Back to overview
Degraded

API is degraded

Jan 30 at 03:30am PST
Affected services
Vapi API

Resolved
Jan 30 at 03:44am PST

TL;DR

The API experienced intermittent downtime due to choked database connections and subsequent call failures caused by the database running out of memory. A forced deployment using direct connections and capacity adjustments restored service.

Timeline

2:09AM: Alerts triggered for API unavailability (503 errors) and frequent pod crashes.
2:40AM: A switch to a backup deployment showed temporary stability, but pods continued to restart and out-of-memory errors began appearing.
3:27AM: A forced deployment was initiated on the primary environment using direct database connections; the database team was notified.
3:42AM: The database was restarted and traffic was rerouted, leading to improved service health.
3:50AM: The database’s capacity was increased and the service stabilized fully.

Impact

The API experienced multiple intermittent outages.
Calls were affected due to the database running out of memory, with thousands of calls and jobs left in an active or stuck state.

Root Cause

Choked database connections due to a spike in aborted request errors led to failing health checks, which in turn caused API pods to restart continuously.
The database ran out of memory—not because of sheer volume alone, but due to a misconfiguration (insufficient maxlocksper_transaction), which was exacerbated by a thundering herd of requests.

Changes we've made

Increase Capacity: Boost the database’s capacity.
Adjust Configuration: Raise the maxlocksper_transaction setting.
Cleanup Operations: Remove stuck pods and clear active call jobs from the affected environment.
Enhance Monitoring and Deployment: Improve alerting for database health and reduce urgent deployment times from ~15 minutes to ~5 minutes.

If working on realtime distributed systems excites you, consider applying: https://jobs.ashbyhq.com/vapi/295f5269-1bb5-4740-81fa-9716adc32ad5

Created
Jan 30 at 03:30am PST

We're suspecting another Supabase DB issue, remediating ASAP.