As of 13:52 UTC+1 / 05:52 PST we're investigating the following service incident:
We're actively investigating problems with bulk ticket actions affecting our EU data centre. Updates to follow as we know more.
14:41 UTC+1 / 06:41 PST
We're actively working on mitigating issues for customers in our EU data centre. Apologies for the inconvenience caused.
15:42 UTC +1 / 07:42 PST
Our Operations team are hard at work on resolving problems affecting our customers located in EU data centre. More to follow as we proceed.
17:13 UTC +1 / 09:13 PST
Remediation is still ongoing for issues impacting EU data center customers. Additional updates to follow as we progress to full resolution.
19:34 UTC +1 / 11:34 PST
The issues with bulk updates and ticket sharing which have been impacting some EU customers are now resolved. Post mortem to follow.
We have completed review of this incident. Root cause was found to be due to processing delays in part of our job processing infrastructure. We determined that a customer mistakenly submitted 140,000 jobs of a particular type in a short period of time. This was a significant increase in our overall volume for this job type. This increase resulted in the backlog of jobs, delaying sharing actions as well as bulk ticket updates. These delayed jobs eventually cleared, but took much longer than under normal operations.
Our response to this issue included contacting the customer, blocking their new job submissions, and working with them to address the bug in their system. We also devised a change plan to delete the invalid jobs from the queue, bringing normal processing times back faster.
As a result of the post mortem review we also identified that this job did not have sufficient rate limit and validation protections. We have opened an incident remediation story in with one of our engineering teams to address this as a priority. We want to ensure that load issues from one customer do not impact performance for any others.