Service Incident February 3rd 2016

 

11:28 UTC | 03:28 PT
We are happy to report that the performance issues are now resolved. Postmortem to follow:

11:18 UTC | 03:18 PT
We are seeing performance improving but continue to monitor and will update shortly.

10:46 UTC | 02:46 PT
We are investigating performance issues affecting some customers. More information to follow shortly.

POST-MORTEM

A service responsible for rule execution had to be restarted for two of our data centres to resolve UI issues for Zendesk customers. Due to a bug, views stopped executing and returned 500 errors. This led to some customers experiencing issues loading views and other related performance issues for around 50 minutes. 

The cause of the issue was identified as a configuration bug in a web server. We switched off request queuing per process to avoid a fast rule execution getting queued behind a slow count call. Unfortunately, due to the nature of the bug, we would lose services slowly, one by one, over 24 hours or so, under heavy traffic conditions. For some reason, the watching process never restarted any of the failed services. Eventually, once we overwhelmed the last one, rule execution essentially fell over. This would lead to the load balancer marking it dead, and eventually all our servers were marked dead. At this point, view executions would fail completely for the 20% accounts. To recover, we rolled back the reliance of rule execution on the affected services. 

We've a number of remediation actions planned to prevent recurrences of this incident going forward. First, we've reset the configuration of the web server to fix the bug that caused this. Furthermore, we're improving the logging and monitoring of the service that fell over in this case to keep a closer eye on its performance. Finally, we're adding fallback routes to ensure service continuity in case of similar occurrences in the future. 

FOR MORE INFORMATION