Service Incident November 11th 2015

As of 12:20am UTC / 16:20 PST we're investigating the following service incident:

We're currently investigating reports of an issue impacting our Search Functionality. More information to follow. 

07:26am UTC / 23:26 PM PST:

We are happy to report that the issue with the Search Functionality that affected our customers hosted in our west coast data center is now resolved. A Post Mortem will be available shortly.

POST-MORTEM

This incident was brought about by a misbehaving search service node after in one of our US data centres, due to which the service was not handling requests properly but remained a viable node for chained execution. This resulted in healthy nodes relying on it for fetching results, which it served very slowly, or eventually returned an error for. This meant some queries were taking up a lot of resources and failing eventually.  

Our monitoring software had detected an issue with this data node but suppressed the alarm due to flapping. There was also a larger network issue in the data centre at the time so other signals may have been missed.  

To recover, we isolated the search issue to the query planning at execution stage and then to a specific data node. This was restarted and the cluster righted itself. Our Operations team is still investigating permanent remediation actions to prevent search-related incidents from recurring. 

FOR MORE INFORMATION