Problems with Device Registration
Incident Report for Level365
Postmortem

Summary

A normal procedure in a back-end administrative interface caused all device authentication keys to be reset due to an input validation issue. The root cause was determined quickly, however the time to recover the correct authentication keys took longer than anticipated. During this period, devices were unable to make or receive calls.

Impact

What was Impacted

* Desktop UC phone hardware
* Non-dialtone branded softphones (ie Bria, X-Lite)
* SIP-enabled devices, such as door phones, paging systems, etc.

What wasn't Impacted

* Inbound calls. All inbound calls were handled by Level365 UC platform and our partner carriers, and worst case terminated in user's voicemail
* Dialtone Web from UC Portal
* Dialtone for iOS or Android, if user logged out and back in
* Fax services
* SMS services, both inbound and outbound

Root Cause

Lack of input validation, causing underlying SQL actions to be truncated

Trigger

Import of device addition template file with invalid characters

Resolution

Reimporting all devices in system with previous valid authentication keys

Action Items

* Work with network hardware cluster vendor to add input validation on import process
* Work with network hardware cluster vendor to improve import process speed
* Streamline internal process for communicating incidents to customers

Lessons Learned

What went well

* Backups were successful
* Root problem identified quickly

What went wrong

* Service restoration import process unacceptably long

What we can do better

* Better communication to impacted customers
* Decrease time between initial problem identification and declaration of Incident
* Decrease time to resolution
* Add additional input validation to avoid this problem in the future
* Speed up import process time

Timeline

Time - Description
9:15am - Initial reports of registration problems
9:30am - Identification of root cause. Begin recovery plan.
9:56am - Incident declared
~10:00am - Steps being taken to restore service using temporary methods)
10:20am - Processing of known good backup into proper format begins
11:06am - Completion of import file complete
1:30pm - Restoration of service complete. Devices now able to register

Details

Late Monday we ran a procedure from a back-end administrative interface. This procedure is performed regularly and involved importing a new device manually versus adding it via the portal. The authentication key was generated using best practice mixed-case alpha, numeric, and special characters. Unfortunately, the key that was generated contained a combination of characters that inadvertently broke an underlying database update, and instead of updating that single imported device, applied the same password to all devices. This wasn't noticed immediately as it was close to the end of the day and devices didn't immediately unregister due to the changed authentication key.

We first became aware of the problem about 9:15am on Tuesday and began to investigate. We quickly determined the cause and began to work on a recovery plan. This required building a new device import file from the last backup run before the problem occurred (Sunday night/Monday morning). This required conversion of the backup file into the appropriate format. While this conversion was taking place, we pushed out a change that would allow devices to connect with the incorrect password. Unfortunately, this step took longer than we had anticipated due to the large number of devices that needed modified, and the time to update each device. When this step was completed, devices devices began to register again.

Due to the dynamics of the changes made, some devices needed rebooted after the final step was completed. We worked directly with anyone still having issues after that and all services are completely restored and all devices functional at this time.

Level365 maintains 6 geographically-diverse network hardware clusters. This allows us to avoid issues with connectivity and hardware availability issues. Due to the way that our network hardware clusters synchronize data via the underlaying API, this flaw distributed the change to all 6 network hardware clusters. And since this synchronization doesn't happen at the SQL level, a SQL restore wouldn't address all 6 network hardware clusters.

Final Word

We will be working closely with our vendor to resolve the issues discovered during this incident.

Posted Jan 30, 2019 - 15:54 EST

Resolved
This issue has been resolved. If you continue to have problems, please reach out to support for assistance
Posted Jan 29, 2019 - 17:20 EST
Update
We are continuing to work on a fix for this issue.
Posted Jan 29, 2019 - 16:00 EST
Update
Temporary fix is in place and permanent solution is being rolled out at this time.
Posted Jan 29, 2019 - 14:22 EST
Identified
We have identified the issue and are working on a resolution.
Posted Jan 29, 2019 - 10:22 EST
Investigating
We are currently investigating an issue with devices registering after a reboot.
Posted Jan 29, 2019 - 09:56 EST
This incident affected: UCaaS (Core Services).