Service Incident November 11th 2015

SUMMARY

As of 15:45 GMT / 07:45 AM PST we are working to resolve the following service incident:

We are receiving reports of latency in various Zendesk services. We are investigating it.

16:20 GMT / 08:20 AM

We are still investigating the issue creating latency in various Zendesk services for some of our customers. More info to come.

16:55 GMT / 08:55 AM

Latency localized at east coast data center, customers reporting intermittent voice & access issues

17:29 GMT / 09:29 AM

Still investigating performance issues at East Coast data center, including some reports of email delays.

18:08 GMT / 10:08 AM

Still investigating fix for performance issues at East Coast Data Center. Next update in 30 mins.

18:50 GMT / 10:50 AM

Following changes intended to resolve issues, we are seeing improved performance in East Coast data centers.

19:12 GMT / 11:12 AM

We have system monitor and customer confirmation that performance issue has resolved. Post-mortem will be posted here shortly.

POST-MORTEM

An issue in communication between clusters in a web service and one if our US-based data centres, causing the endpoint for our ELK stack (which provides the syslog destination for our syslog-ng configurations) service discovery on the data centre's servers to fail intermittently.

With the service discovery DNS record failing intermittently, syslog-ng entered a failure state that makes it exhaust resources on the servers that had issues to resolve the DNS record for the ELK endpoint. This, in turn, caused a lack of availability of many services of our internal network, resulting in green screens and slow functionality in general for our customers.

To fix this problem, a workaround was put in place removing the DNS service discovery from syslog-ng configuration and replacing it by a plain DNS record in our internal zone.

Better monitoring around our cluster health and communication can be put in place in our monitoring/metrics system to improve the detection of this failure scenario.