Wednesday night, an upgrade was rolled out across our data centers. This upgrade had been tested thoroughly prior to rollout, but unfortunately an issue was discovered after the fact that was only triggered when a certain set of parameters were met, and unfortunately this particular case wasn’t discovered in advance.
While engineering worked on the root cause of the issue, the technical support department began rolling out temporary remediation steps to all impacted customers. We had restored service for approximately 75% of impacted customers when engineering discovered the cause of the issue and implemented a global fix.
After studying this issue from before the upgrade during the testing phase through the life of the incident, we have identified several areas we can improve on:
We will improve our testing methods, including adding unit testing to include all edge and corner cases that can be identified. This will allow us to discover any issues. This unit testing will also be run before and after any upgrades in the future, allowing us to compare pre- and post-upgrade results.
We didn’t do the best job of communicating the incident and its status through the incident lifecycle. In any future incidents, we will be assigning a Communications Lead whose sole responsibility will be to communicate regular updates to customers as well as keep the status page updated as frequently as possible.
We have discussed this issue with engineering and emphasized how important it is to communicate any potential breaking changes with code changes during an upgrade. Just because something shouldn’t break something doesn’t mean that it won’t, and if it’s not documented somewhere accessible then the remediation process slows down.
This particular issue only impacted customers who utilize both an EdgeMarc Session Border Controller and Yealink phones. The SIP registration servers were rejecting registrations when these two requirements were met, reporting “Duplicate Headers”. Why was a little more difficult to determine. There were 2 questions that needed answered: what headers were duplicated., and what changed during the upgrade?
A tcpdump quickly determined that the the duplicate header was an
Allow-Events header that was being inserted by the EdgeMarc. After further research, it appeared that there was a default setting on the EdgeMarcs to help with local call survivability by requesting a certain class of events:
Allow-Events: BroadWorksSubscriberData. This only seemed to be an issue with Yealink phones. Polycom phones seem to have their headers appended properly with the
BroadWorksSubscriberData events class.
Once we discovered what was being duplicated and how to disable that default option we began to roll out mitigation, changing this setting on all affected customers.
At this same time, we were working with engineering to try to answer the second question: what changed? We discovered that engineering changed SIP packet processing in an attempt to tighten up security and filter out non-RFC data. This combined with the EdgeMarcs duplicating one of the SIP Headers caused an issue that was previously not an issue. We finally discovered that this functionality was new and defaulted to enabled. We then disabled this functionality with the information provided by engineering and the problem was resolved.