API is degraded

TL;DR

The API experienced intermittent downtime due to choked database connections and subsequent call failures caused by the database running out of memory. A forced deployment using direct connections and capacity adjustments restored service.

Timeline

2:09AM: Alerts triggered for API unavailability (503 errors) and frequent pod crashes.
2:40AM: A switch to a backup deployment showed temporary stability, but pods continued to restart and out-of-memory errors began appearing.
3:27AM: A forced deployment was initiated on the primary environment using direct database connections; the database team was notified.
3:42AM: The database was restarted and traffic was rerouted, leading to improved service health.
3:50AM: The database’s capacity was increased and the service stabilized fully.

Impact

The API experienced multiple intermittent outages.
Calls were affected due to the database running out of memory, with thousands of calls and jobs left in an active or stuck state.

Root Cause

Choked database connections due to a spike in aborted request errors led to failing health checks, which in turn caused API pods to restart continuously.
The database ran out of memory—not because of sheer volume alone, but due to a misconfiguration (insufficient maxlocksper_transaction), which was exacerbated by a thundering herd of requests.

Changes we've made

Increase Capacity: Boost the database’s capacity.
Adjust Configuration: Raise the maxlocksper_transaction setting.
Cleanup Operations: Remove stuck pods and clear active call jobs from the affected environment.
Enhance Monitoring and Deployment: Improve alerting for database health and reduce urgent deployment times from ~15 minutes to ~5 minutes.

If working on realtime distributed systems excites you, consider applying: https://jobs.ashbyhq.com/vapi/295f5269-1bb5-4740-81fa-9716adc32ad5