Intermittent Jobs and Clusters Stuck on Pending
Incident Report for Xplenty
Postmortem

Root Cause

During our recent database maintenance on September 16th 6 AM UTC, we encountered resource limitations from our upstream provider, which resulted in some worker tasks being missed.

Resolution and Mitigation

  • Immediate Actions Taken:
    We immediately stabilized the environment by restarting affected services and applications to minimize disruption.
  • Long-Term Measures:
    To prevent this issue from happening again:

    • Implemented automatic termination of long-idle connections to free up resources.
    • Enhanced our monitoring for pending jobs, ensuring that any long-running tasks are promptly identified and addressed.

Preventive Actions

  • Monitoring Improvements:
    We have implemented monitoring for jobs stuck in a pending state, enabling us to remain proactive in addressing long-running tasks and responding before they impact operations.
  • Additional Measures:
    We have increased the resources allocated to our database and are working closely with our upstream provider to ensure resource availability.

Next Steps

We will continue to monitor the situation closely and make adjustments to workflows or settings as needed. Our team is committed to preventing future incidents of this nature, and we sincerely apologize for any inconvenience caused by this issue.

Posted Oct 01, 2024 - 08:45 UTC

Resolved
During our recent database maintenance, we encountered intermittent resource limitations from our upstream provider, which resulted in some worker tasks being missed. We have implemented measures in place and this issue is now resolved.
Posted Oct 01, 2024 - 08:31 UTC