Service Incident February 5th 2016

 

20:50 UTC | 12:50 PT
We have fixed issues causing 0 results for native reports. Please refresh reports and information will show within 24 hours.

19:27 UTC | 11:27 PT
Investigation of issues with native reporting for West Coast data center customers continues. Additional updates as they are available.

18:45 UTC | 10:45 PT
Work continues to investigate and address issues with native reporting, impacting some West Coast data center customers.

18:10 UTC | 10:10 PT
We are seeing improvements in native reporting for some West Coast customers and continuing to investigate

17:36 UTC | 09:36 PT
Work continues to resolve the problems with West Coast data center dashboard reporting.

17:04 UTC | 09:04 PT
Work continues to mitigate dashboard reporting issues for some of our West Coast data center customers.

16:46 UTC | 08:46 PT
Mitigation efforts are underway for dashboard reporting issues affecting west coast data center customers.

16:30 UTC | 08:30 PT
We are aware of issues some west coast data center customers are having with dashboard stats not loading and are working towards resolution.

POST-MORTEM

In the course of this incident, native reporting (Admin->Reports->Manage) was showing 0 metrics for all customers hosted in US West Coast data center, whose mysql servers had been recently upgraded. The mysql upgrade caused the timezone tables (system tables in the mysql database) in those servers to be cleared, thus causing the reporting queries to fail because they use a function which requires those tables to be present. This was, in fact, an intermittent problem, with the first incidents reported on January 14.

In addition, reports available at https://#{subdomain}.zendesk.com/agent/reporting were also showing 0 metrics for all customers hosted in US West Coast data center; however, due to a replication lag in one of our services, which caused our processing queue to become backlogged to the point where we had to clear the queue and restart the servers. These issues started on Thursday, February 04 and lasted through Friday, February 05, with a total duration of around 22 hours.  During this time, very few accounts would have been updating statistics dashboards in the reporting tab. They would be either missing entirely or updating very slowly.

To resolve the issues with native reporting, our database admin team re-populated the timezone tables in the upgraded mysql servers, after which refreshing the reports would kick off the jobs to re-populate the metrics for those reports.

As for the reports tab, initial efforts focussed on restarts for the queue processing jobs. It was later discovered that replication slowness likely due to a missing index was causing significant delays in delivery of data to the service. Removing an affected host resolved the delays and allowed for restarting replication from scratch to restore the service.

We've since tweaked our process to populate timezone tables as part of the main data load step, rather than it being a standalone step. Furthermore, we've implemented a new alert on missing data in mysql's timezone tables.

FOR MORE INFORMATION