Today, between March 20 08:49pm UTC and March 21 09:03am between Xplenty experienced an elevated job status updates latency which caused jobs to appear pending or long running. We accept full responsibility for any downtime arising from this issue and apologize for negative effects our customers experienced.
Here is some additional detail about what happened and steps we are taking to mitigate future outages of a similar nature.
Who was affected?
Customers running jobs in any cloud region during this period.
One of our message brokers experienced connectivity issues which caused job status updates to queue up. When our engineering team received the alerts, we immediately began to take all the steps necessary to resolve the issues. Our status page was updated. We scaled up the services we use in order to handle the increasing volume of cluster provisioning requests, by 9pm UTC our service was again operating at full capacity.
A small portion of job status messages for workflow task jobs were lost but we managed to recover the jobs' final states.
What have we learned from the incident?
We have identified additional alerts on the status of that specific message broker could have enabled us to respond more quickly, which probably would have reduced the outage duration. We have implemented these additional alerts today.
We know that our customers trust us with their data integration tasks and their analytics, and we are constantly taking all the steps required to ensure our service is reliable and trustworthy. However, the kind of incident such as the one that occurred yesterday may still happen, and we have tried to learn from it and grow as a business and as a service.
Node hour usage throughout the outage will not be charged for regardless of whether your jobs were affected.