As of 15:45 GMT / 07:45 AM PST we are working to resolve the following service incident:
We are receiving reports of latency in various Zendesk services. We are investigating it.
16:20 GMT / 08:20 AM
We are still investigating the issue creating latency in various Zendesk services for some of our customers. More info to come.
16:55 GMT / 08:55 AM
Latency localized at east coast data center, customers reporting intermittent voice & access issues
17:29 GMT / 09:29 AM
Still investigating performance issues at East Coast data center, including some reports of email delays.
18:08 GMT / 10:08 AM
Still investigating fix for performance issues at East Coast Data Center. Next update in 30 mins.
18:50 GMT / 10:50 AM
Following changes intended to resolve issues, we are seeing improved performance in East Coast data centers.
19:12 GMT / 11:12 AM
We have system monitor and customer confirmation that performance issue has resolved. Post-mortem will be posted here shortly.
An issue in communication between clusters in a web service and one if our US-based data centres, causing the endpoint for our ELK stack (which provides the syslog destination for our syslog-ng configurations) service discovery on the data centre's servers to fail intermittently.
With the service discovery DNS record failing intermittently, syslog-ng entered a failure state that makes it exhaust resources on the servers that had issues to resolve the DNS record for the ELK endpoint. This, in turn, caused a lack of availability of many services of our internal network, resulting in green screens and slow functionality in general for our customers.
To fix this problem, a workaround was put in place removing the DNS service discovery from syslog-ng configuration and replacing it by a plain DNS record in our internal zone.
Better monitoring around our cluster health and communication can be put in place in our monitoring/metrics system to improve the detection of this failure scenario.