Service Incident January 27th 2016

 

15:48 UTC | 07:48 PT
We are happy to report that the performance issues affecting some customers are now resolved. A postmortem will be posted below shortly.

15:26 UTC | 07:26 PT
Performance has improved at our US East Coast Data Center. We will continue to monitor and update shortly.

14:54 UTC | 06:54 PT
We are investigating performance issues at our US East Coast Data Center. More information shortly.

POST-MORTEM

This incident began when one of our services, which is responsible for rule execution, stopped operating properly due to an increase in bad queries hitting it. This, in turn, was caused by error in configuring query killer - it requires that rule queries have a specific SQL format, which was misconfigured and caused query killer not to work when expected. 

There were no QA tests done around query killing specifically. These are extreme situations to be tested out in production, though we did add a test locking in the format of the comments. We also performed a manual test to see if they get killed.

A restart and redeploy of the service helped completely fix all issues affecting customers based in US East Coast data center, as well as some trial signups. 

FOR MORE INFORMATION