API and Processing Engine Outage

Incident Report for Xplenty

Postmortem

Earlier this week we experienced an outage in our API and processing engine (managing customer clusters & customer jobs). Today we are providing an incident report that details the nature of the outage and our response.

The following is an incident report for the Xplenty API and processing engine outage that occurred on Friday May 30th 2020. We understand this service issue has impacted our valued customers, and we apologize to everyone who was affected.

Issue Summary

From 4:12 AM to 10:29 AM Pacific time, Xplenty API and processing engine was down. At its peak, no new customer jobs were able to be started on Xplenty’s infrastructure. The root cause of this outage was CloudAMQP (our RabbitMQ vendor) when they updated their TLS certificate.

Timeline (all times Pacific)

04:12 AM: Clusters cannot be created, outage begins
04:12 AM: PagerDuty alerts the team and the investigation begins
06:21 AM: Xplenty contacts CloudAMQP, after discovering TLS errors
09:45 AM: CloudAMQP publishes a new certificate chain for backward compatibility
10:29 AM: Job & cluster processing engines are operational
11:30 AM: Clean up & restoration of jobs impacted by outage
14:56 PM: 100% of service is restored and operational

Root Cause

At around 4:00 AM Pacific CloudAMQP, our RabbitMQ vendor, updated their TLS certificate, without prior notification. Due to an outdated RabbitMQ client library, Xplenty was unable to automatically validate the TLS trust chains. Our API and processing engine were blocked from sending and receiving the RabbitMQ messages needed to perform their operation. Here is CloudAMQP’s incident report.

Resolution and recovery

At 9:45 AM CloudAMQP publishes a new certificate chain for backward compatibility. Xplenty restarts systems and customer clusters start processing their jobs. Some job states are incorrect due to missing RabbitMQ messages, and these are manually adjusted. By 10:29 AM all services are fully operational, and manual job restoration completed by 14:56 PM.

Corrective and Preventative Measures

In the last few days we have conducted an internal review and analysis of the outage. The following are actions we are taking to address the underlying cause of the issue and to help prevent recurrence:

Improved CloudAMQP change management
Automation of ruby dependency auditing using Github’s Dependabot
More timely updating of Ruby libraries and core infrastructure
Improved monitoring and faster error resolution

Xplenty is committed to continually improving our technology and operational processes to prevent outages. We appreciate your patience and again apologize for the impact to you, your users, and your organization. We thank you for your business and continued support.

Sincerely,
Xplenty Engineering

Posted Jun 03, 2020 - 07:23 UTC

Resolved

The incident has been resolved. The root cause was that one of our third party vendors had an expired SSL certificate and them updating it caused our system to break.

Backward compatibility mechanism has been put in place by the vendor and things should now be back to normal.

Posted May 30, 2020 - 22:00 UTC

Identified

The API and processing engine are offline due to an issue with the third-party vendor and it is currently being addressed.

Posted May 30, 2020 - 11:00 UTC