API is down

Resolved
Jan 29 at 09:24am PST

TL;DR

A failed deployment by Supabase of their connection pooler, Supavisor, in one region caused all database connections to fail. Since API pods rely on a successful database health check at startup, none could start properly. The workaround was to bypass the pooler and connect directly to the database, restoring service.

Timeline

8:08am PST, Jan 29: Monitoring detects Postgres errors.
8:13am: The provider’s status page reports a failed connection pooler deployment. (Due to subscription issues, the team wasn’t immediately notified.)
8:18am: The API goes down.
8:22am: Temporary API recovery occurs as some non-pooler-dependent requests succeed.
8:25am: The API fails again; the incident response team assembles.
8:28am: Investigation reveals API pods are repeatedly restarting.
8:30am: It’s determined that database call failures are triggering the pod restarts.
8:36am: Support confirms that a connection pooler outage in the region is affecting service.
8:38am: A call with support leads to the decision to use direct database connections.
8:44am: A change is deployed to bypass the pooler.
9:12am: The API begins to recover as calls start succeeding.
9:19am: Full service is restored.

Impact

The API was down for 54 minutes, with all calls failing due to reliance on the provider’s system for tracking and organization data.
While some API requests not dependent on the pooler continued working, new API pods entered crash loops because their health checks (which made database requests) failed.
Database operation failures led to call processing hanging, causing errors that prevented proper job closure.

Root Cause

A failed connection pooler deployment disrupted all database connections.
This affected API operations that depended on those connections, leading to cascading failures and hanging processes.

Changes we've made

Reduce Deployment Time: Shorten backend update runtimes to under five minutes.
Switch to Direct Connections: Use direct database connections exclusively to avoid pooler issues.
Increase Connection Capacity: Boost the number of direct connections available to handle higher loads.

If working on realtime distributed systems excites you, consider applying: https://jobs.ashbyhq.com/vapi/295f5269-1bb5-4740-81fa-9716adc32ad5

Updated
Jan 29 at 09:05am PST

We've rolled out direct connection to database for now. Calls are going through. We're waiting on Supabase to confirm fix to resolve the outage.

Updated
Jan 29 at 08:35am PST

We are impacted by supabase outage. https://status.supabase.com
Working with their team to get it working ASAP.

Created
Jan 29 at 08:28am PST

API is down. We're investigating. Updates to follow.