Earlier this week we experienced an outage in our API and processing engine (managing customer clusters & customer jobs). Today we are providing an incident report that details the nature of the outage and our response.
The following is an incident report for the Xplenty API and processing engine outage that occurred on Friday May 30th 2020. We understand this service issue has impacted our valued customers, and we apologize to everyone who was affected.
From 4:12 AM to 10:29 AM Pacific time, Xplenty API and processing engine was down. At its peak, no new customer jobs were able to be started on Xplenty’s infrastructure. The root cause of this outage was CloudAMQP (our RabbitMQ vendor) when they updated their TLS certificate.
At around 4:00 AM Pacific CloudAMQP, our RabbitMQ vendor, updated their TLS certificate, without prior notification. Due to an outdated RabbitMQ client library, Xplenty was unable to automatically validate the TLS trust chains. Our API and processing engine were blocked from sending and receiving the RabbitMQ messages needed to perform their operation. Here is CloudAMQP’s incident report.
At 9:45 AM CloudAMQP publishes a new certificate chain for backward compatibility. Xplenty restarts systems and customer clusters start processing their jobs. Some job states are incorrect due to missing RabbitMQ messages, and these are manually adjusted. By 10:29 AM all services are fully operational, and manual job restoration completed by 14:56 PM.
In the last few days we have conducted an internal review and analysis of the outage. The following are actions we are taking to address the underlying cause of the issue and to help prevent recurrence:
Xplenty is committed to continually improving our technology and operational processes to prevent outages. We appreciate your patience and again apologize for the impact to you, your users, and your organization. We thank you for your business and continued support.
Sincerely,
Xplenty Engineering