Service Incident March 29th 2016

23:31 UTC | 16:31 PT
Service has stabilized. Investigations continue to determine root cause.

22:24 UTC | 15:24 PT
Progress has been made in resolving availability issues for some customers. We appreciate your patience while we work towards stability.

21:32 UTC | 14:32 PT
Efforts continue to remediate performance related issues affecting some customers. We appreciate your patience.

20:43 UTC | 13:43 PT
We are investigating isolated performance issues affecting our voice platform and performance degradation for some customers.

19:45 UTC | 12:45 PT
Efforts are still ongoing to investigate and remedy the intermittent performance issues some customers are seeing.

19:02 UTC | 12:02 PT
We are investigating the performance impacting issues which seem to be intermittent while we continue to work towards remediating the issue.

17:40 UTC | 10:40 PT
Efforts continue in working with a vendor to fix the performance impacting issues. More information to come when available.

17:03 UTC | 10:03 PT
Ongoing efforts to implement a fix for the performance impacting issue for some users continues.

16:28 UTC | 09:28 PT
We are currently working on a fix for the ongoing performance issues. More information to follow when available.

14:50 UTC | 07:50 PT
We have identified the source of the performance issues affecting some customers and continue to work on a fix.

POST-MORTEM

In the last week, the Zendesk East Coast platform hosted in Virginia, United States, experienced four network-related service impacts. One of these issues (on March 23) was a Malicious Abuse DDoS incident and contributed to 13 minutes of service impact. The other three incidents all related to a capacity issue that we have identified in our Virginia load balancer infrastructure.  

These capacity issues occurred on March 23rd, 24th, and 29th. Our investigation into the incidents extended long after the main incident. Initial indications revolved around a potential bug in the hardware load balancer firmware. After significant work among our teams and in concert with our network component vendors we determined that we had exceeded capacity limits of our load balancer modules. The investigation into a bug masked the fact that the issue was a straightforward capacity problem.

In response to this issue, we have adjusted our configuration of this component to allow significant capacity headroom. We have also executed on a purchase of upgraded load balancer modules which will provide 4x the current capacity. If future expansion is required, we still have many larger modules which can be utilized. We have also adjusted our methodology for capacity planning load balancer components so that future increases in demand can be accommodated with a timely upgrade of these systems.

We regret the impact these events have had on your business. The Zendesk Network Engineering team will continue working to provide a reliable and performant network operating environment for delivery of Zendesk’s services.  

FOR MORE INFORMATION

Please subscribe to this article for regular updates until the issue is resolved. If you aren't subscribed to our Twitter feed, we encourage you to do so in order to get the most current information about any service issues. We also record all site outages on our system status page where you can see the past 12 months of service uptime. If you have questions about this issue, please open a ticket with us by sending a note to support@zendesk.com.