On the morning of October 14, 2020 we installed several minor update to our platform. These upgrades were performed at midnight EST in accordance with our published maintenance notification and maintenance window, and went as expected. By 12:05am the updates were complete, and all systems tested normal.
Starting around 7:20am, we began to notice packet loss to one of the core servers. We immediately dispatched an engineer to that data center while the rest of engineering continued to troubleshoot remotely. Once the engineer arrived on site, it was determined that the networking stack had failed, causing the server to become inaccessible. The system was restarted and the server began responding properly.
Although there was nothing in the minor updates that should have caused an issue with the networking stack, it is also hard to ignore the coincidental timing of the issue. We believe the issue related from the uptime of that particular server. We pride ourselves on our uptime, with that particular server having been up for over 2 years without a reboot. We are now reconsidering the strategy of long continuous periods of uptime, instead working on a strategy of regular system restarts in a controlled environment.