Elevated Job Status Updates Latency

Incident Report for Xplenty

Postmortem

Today, between March 20 08:49pm UTC and March 21 09:03am between Xplenty experienced an elevated job status updates latency which caused jobs to appear pending or long running. We accept full responsibility for any downtime arising from this issue and apologize for negative effects our customers experienced.

Here is some additional detail about what happened and steps we are taking to mitigate future outages of a similar nature.

Who was affected?

Customers running jobs in any cloud region during this period.

What Happened?

One of our message brokers experienced connectivity issues which caused job status updates to queue up. When our engineering team received the alerts, we immediately began to take all the steps necessary to resolve the issues. Our status page was updated. We scaled up the services we use in order to handle the increasing volume of cluster provisioning requests, by 9pm UTC our service was again operating at full capacity.

A small portion of job status messages for workflow task jobs were lost but we managed to recover the jobs' final states.

What have we learned from the incident?

We have identified additional alerts on the status of that specific message broker could have enabled us to respond more quickly, which probably would have reduced the outage duration. We have implemented these additional alerts today.

What's next?

We know that our customers trust us with their data integration tasks and their analytics, and we are constantly taking all the steps required to ensure our service is reliable and trustworthy. However, the kind of incident such as the one that occurred yesterday may still happen, and we have tried to learn from it and grow as a business and as a service.

Node hour usage throughout the outage will not be charged for regardless of whether your jobs were affected.

Posted Mar 21, 2018 - 11:44 UTC

Resolved

All job status data recovered and updated. The issue is resolved.

Posted Mar 21, 2018 - 11:33 UTC

Update

All job updates backlog is now clean. We are still monitoring the system to verify things are back to normal state.

Posted Mar 21, 2018 - 06:01 UTC

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Mar 21, 2018 - 05:01 UTC

Update

A fix has been pushed and the system was scaled to handle the backlog.

Posted Mar 21, 2018 - 04:50 UTC

Identified

Our engineers identified the culprit and are working on a fix. The issue is related to updating jobs status.

Posted Mar 21, 2018 - 04:24 UTC

Investigating

We're investigating an issue that's causing delayed job status updates.

Posted Mar 21, 2018 - 04:00 UTC