Service Incident January 22nd 2016

 

17:47 UTC | 09:47 PT
The backlog has been cleared and performance has returned to normal in our EU Data Center.

16:14 UTC | 08:14 PT
We are continuing to monitor and clear the backlog in the EU data center. Updates to follow when resolved.

14:55 UTC | 06:55 PT
Work continues on clearing the backlog in order to fully mitigate today's performance issues in EU data centre. More info once fixed.

14:10 UTC | 06:10 PT
Our Operations Team has found the root cause of the performance issue and we continue to work on cleaning up the backlog.

13:49 UTC | 05:49 PT
We're starting to see improvements to our service. Our Operations team continue to troubleshoot to full resolution, more to follow.

13:39 UTC | 05:39 PT
We are continuing to investigate performance issues in our EU datacenter impacting our services.

13:20 UTC | 05:20 PT
We are investigating further performance issues in our EU data centre, with significant impact to Zendesk Voice service. More to follow.

POST-MORTEM

This incident was the result of a unexpected resource constraint in one of our backend key/value storage systems.  This system began to have memory depletion issues due to changes resulting in a large increase in stored value size.
 
The resulting performance problem impacted the following activities:
- Zendesk Voice call initiation
- Delays for asynchronous background operations such as ticket creations from email, twitter, facebook
- Authentication failures for accounts using SAML for authentication
- Timeout for some API calls which depend on background asynchronous operations.  For example, it impacted use of our update_many API endpoint.
 
In response to this incident we have addressed some issues, and are continuing to work on others.  Here’s our list:
- Fixed the app changes that led to large increase in size of the values stored in this system.  Memory usage from this change alone has led to a 50% decrease in overall storage needs.
- Investigating each of these clusters to identify where we should increase capacity.
- Changing the SAML authentication process to no longer include dependency on this backend system.  During investigation of this impact it was noted that the dependency was not really needed for the SAML system.
 

FOR MORE INFORMATION