A normal procedure in a back-end administrative interface caused all device authentication keys to be reset due to an input validation issue. The root cause was determined quickly, however the time to recover the correct authentication keys took longer than anticipated. During this period, devices were unable to make or receive calls.
* Desktop UC phone hardware
* Non-dialtone branded softphones (ie Bria, X-Lite)
* SIP-enabled devices, such as door phones, paging systems, etc.
* Inbound calls. All inbound calls were handled by Level365 UC platform and our partner carriers, and worst case terminated in user's voicemail
* Dialtone Web from UC Portal
* Dialtone for iOS or Android, if user logged out and back in
* Fax services
* SMS services, both inbound and outbound
Lack of input validation, causing underlying SQL actions to be truncated
Import of device addition template file with invalid characters
Reimporting all devices in system with previous valid authentication keys
* Work with network hardware cluster vendor to add input validation on import process
* Work with network hardware cluster vendor to improve import process speed
* Streamline internal process for communicating incidents to customers
* Backups were successful
* Root problem identified quickly
* Service restoration import process unacceptably long
* Better communication to impacted customers
* Decrease time between initial problem identification and declaration of Incident
* Decrease time to resolution
* Add additional input validation to avoid this problem in the future
* Speed up import process time
Time - Description
9:15am - Initial reports of registration problems
9:30am - Identification of root cause. Begin recovery plan.
9:56am - Incident declared
~10:00am - Steps being taken to restore service using temporary methods)
10:20am - Processing of known good backup into proper format begins
11:06am - Completion of import file complete
1:30pm - Restoration of service complete. Devices now able to register
Late Monday we ran a procedure from a back-end administrative interface. This procedure is performed regularly and involved importing a new device manually versus adding it via the portal. The authentication key was generated using best practice mixed-case alpha, numeric, and special characters. Unfortunately, the key that was generated contained a combination of characters that inadvertently broke an underlying database update, and instead of updating that single imported device, applied the same password to all devices. This wasn't noticed immediately as it was close to the end of the day and devices didn't immediately unregister due to the changed authentication key.
We first became aware of the problem about 9:15am on Tuesday and began to investigate. We quickly determined the cause and began to work on a recovery plan. This required building a new device import file from the last backup run before the problem occurred (Sunday night/Monday morning). This required conversion of the backup file into the appropriate format. While this conversion was taking place, we pushed out a change that would allow devices to connect with the incorrect password. Unfortunately, this step took longer than we had anticipated due to the large number of devices that needed modified, and the time to update each device. When this step was completed, devices devices began to register again.
Due to the dynamics of the changes made, some devices needed rebooted after the final step was completed. We worked directly with anyone still having issues after that and all services are completely restored and all devices functional at this time.
Level365 maintains 6 geographically-diverse network hardware clusters. This allows us to avoid issues with connectivity and hardware availability issues. Due to the way that our network hardware clusters synchronize data via the underlaying API, this flaw distributed the change to all 6 network hardware clusters. And since this synchronization doesn't happen at the SQL level, a SQL restore wouldn't address all 6 network hardware clusters.
We will be working closely with our vendor to resolve the issues discovered during this incident.