Service Incident August 24th 2016 Performance issues on Pods 4 and 8

05:16 UTC | 22:16 PT
Accessibility issues for our Pod 4 and Pod 8 accounts are now resolved. Postmortem to follow

04:42 UTC | 21:42 PT 

POST-MORTEM

Hardware failure and redundancy failure in our West Coast data center impacted site-wide services on accounts housed in Pods 4 and 8. Initial alerts and attempts to resolve a hardware failure revolved around one of two intrusion detection systems. These devices in theory are designed to "fail-open," meaning all passthrough traffic should have continued in the event any catastrophic event, including total loss of power. Several days prior to the service-impacting event, one of the devices had failed and triggered alerts. Pursuant to a simple investigation, the appliance was found to be frozen. Several hours prior to the actual service-impacting event, an engineer had issued a restart command and the appliance responded positively, but remained in a frozen state. A ticket was submitted with our data center remote-hands team to physically power-cycle the appliance as part of the vendor's suggested remediation steps. Once the device power-cycled, a service disruption occurred at our data center. We restored services by manually switching to our secondary load balancer, where the physical connection is shared between the second security device and passthrough load balancer traffic.

We currently have open tickets with our vendor to understand the following:

  • Fail-open failure: We've requested forensic sampling of the event.
  • Hardware replacement for the affected device after logs and forensics are recovered.
  • Testing procedures post-fix.

FOR MORE INFORMATION

For current system status information about your Zendesk, check out our system status page. During an incident, you can also receive status updates by following @ZendeskOps on Twitter. The summary of our post-mortem investigation is usually posted here a few days after the incident has ended. If you have additional questions about this incident, please log a ticket with us.