Service Incident August 26th 2016 Performance issues on Pods 4 and 8

18:28 UTC | 11:28 PT
The issue affecting access for instances on Pods 5, 6, 9, and 13 has been resolved.

18:12 UTC | 11:12 PT
The issue affecting access for instances on Pods 5, 6, 9, and 13 has stabilized as we continue to monitor performance.

17:38 UTC | 10:38 PT
We are mitigating access issues and beginning to see improvements across Pods 5, 6, and 9.

15:49 UTC | 08:49 PT
We are seeing improvement in slowness and access issues for Pods 5, 6,and 13 but we're continuing to investigate.

POST-MORTEM

Hardware failure and redundancy failure in our West Coast data center impacted site-wide services on accounts housed in Pods 4 and 8. Initial alerts and attempts to resolve a hardware failure revolved around one of two intrusion detection systems. These devices in theory are designed to "fail-open," meaning all passthrough traffic should have continued in the event any catastrophic event, including total loss of power. Several days prior to the service-impacting event, one of the devices had failed and triggered alerts. Pursuant to a simple investigation, the appliance was found to be frozen. Several hours prior to the actual service-impacting event, an engineer had issued a restart command and the appliance responded positively, but remained in a frozen state. A ticket was submitted with our data center remote-hands team to physically power-cycle the appliance as part of the vendor's suggested remediation steps. Once the device power-cycled, a service disruption occurred at our data center. We restored services by manually switching to our secondary load balancer, where the physical connection is shared between the second security device and passthrough load balancer traffic.

We currently have open tickets with our vendor to understand the following:

  • Fail-open failure: We've requested forensic sampling of the event.
  • Hardware replacement for the affected device after logs and forensics are recovered.
  • Testing procedures post-fix.

FOR MORE INFORMATION

For current system status information about your Zendesk, check out our system status page. During an incident, you can also receive status updates by following @ZendeskOps on Twitter. The summary of our post-mortem investigation is usually posted here a few days after the incident has ended. If you have additional questions about this incident, please log a ticket with us.