Disruption with Voice Server

Incident Report for Level365

Postmortem

On the morning of October 14, 2020 we installed several minor update to our platform. These upgrades were performed at midnight EST in accordance with our published maintenance notification and maintenance window, and went as expected. By 12:05am the updates were complete, and all systems tested normal.

Starting around 7:20am, we began to notice packet loss to one of the core servers. We immediately dispatched an engineer to that data center while the rest of engineering continued to troubleshoot remotely. Once the engineer arrived on site, it was determined that the networking stack had failed, causing the server to become inaccessible. The system was restarted and the server began responding properly.

Although there was nothing in the minor updates that should have caused an issue with the networking stack, it is also hard to ignore the coincidental timing of the issue. We believe the issue related from the uptime of that particular server. We pride ourselves on our uptime, with that particular server having been up for over 2 years without a reboot. We are now reconsidering the strategy of long continuous periods of uptime, instead working on a strategy of regular system restarts in a controlled environment.

Posted Oct 26, 2020 - 08:06 EDT

Resolved

System is stable, all systems restored. This incident is considered resolved.

Posted Oct 14, 2020 - 10:25 EDT

Monitoring

Server in question was restarted and service appears to be restored. Monitoring the situation.

Posted Oct 14, 2020 - 08:19 EDT

Investigating

We are currently investigating a disruption on one of our UC servers.

Posted Oct 14, 2020 - 07:20 EDT

This incident affected: UCaaS (Core Services).