Intermittent Job Failures and Clusters Stuck on Pending

Incident Report for Xplenty

Postmortem

Issue Summary

From 7:11 AM UTC to 8:58 UTC, there’s an intermittent number of jobs and clusters stuck on pending and errors.

Root Cause

The root cause of this outage was due to our Redis component reaching 100% memory which caused the intermittent issues. Redis is used as a caching mechanism of our application.

Resolution and recovery

Here are the steps we are taking to ensure that the incident does not happen again moving forward.

Vertically scaled up Redis for more memory.
Improve monitoring so we can quickly detect Redis-related memory issues

We appreciate your patience and again apologize for the impact to you, your users, and your organization. We thank you for your business and continued support.

Sincerely,
Integrate.io Engineering

Posted Sep 01, 2022 - 05:05 UTC

Resolved

Beginning at approximately 12:40 AM until 2:30 AM UTC, there was an issue in one of our infrastructure components used for caching which affected clusters and jobs provisioned on the said time period. The issue has now been fixed by our engineers.

Posted Sep 01, 2022 - 04:57 UTC