Previous incidents

Nov 2024 to Jan 2025

January 2025

Jan 30, 2025

1 incident

API is degraded

Resolved Jan 30 at 03:44am PST

TL;DR

The API experienced intermittent downtime due to choked database connections and subsequent call failures caused by the database running out of memory. A forced deployment using direct connections and capacity adjustments restored service.

Timeline

2:09AM: Alerts triggered for API unavailability (503 errors) and frequent pod crashes.
2:40AM: A switch to a backup deployment showed temporary stability, but pods continued to restart and out-of-memory errors began appearing.
3:27AM...

1 previous update

Jan 29, 2025

1 incident

API is down

Downtime

Resolved Jan 29 at 09:24am PST

TL;DR

A failed deployment by Supabase of their connection pooler, Supavisor, in one region caused all database connections to fail. Since API pods rely on a successful database health check at startup, none could start properly. The workaround was to bypass the pooler and connect directly to the database, restoring service.

Timeline

8:08am PST, Jan 29: Monitoring detects Postgres errors.
8:13am: The provider’s status page reports a failed connection pooler deployment. (Due to subscri...

3 previous updates

Jan 21, 2025

1 incident

Updates to DB are failing

Degraded

Resolved Jan 21 at 05:23am PST

TL;DR

A configuration error caused the production database to switch to read-only mode, blocking write operations and eventually leading to an API outage. Restarting the database restored service.

Timeline

5:03:04am: A SQL client connected to the production database via the connection pooler, which inadvertently set the database to read-only.
5:05am: Write operations began failing.
5:18am: The API went down due to accumulated errors.
~5:23am: The team initiated a database restart.
5:...

1 previous update

Jan 13, 2025

1 incident

Calls not connecting for `weekly` channel

Degraded

Resolved Jan 13 at 08:49am PST

TL;DR: Scaler failed and we didn't have enough workers

Root Cause

During a weekly deployment, Redis IP addresses changed. This prevented our scaling system from finding the queue, leaving us stuck at fixed number workers instead of scaling up as needed. We resolved the issue by temporarily moving traffic to our daily environment.

Timeline

Jan 11, 5:12 PM: Deploy started
Jan 13, 6:00 AM: Calls started failing due to scaling issues
Jan 13, 8:45 AM: Resolved by moving traffic to daily
Ja...

1 previous update

December 2024

Dec 11, 2024

1 incident

OpenAI API is degraded

Downtime

Resolved Dec 11 at 08:00pm PST

Resolved: https://status.openai.com/

1 previous update

November 2024

Nov 14, 2024

1 incident

ElevenLabs is degraded

Degraded

Resolved Nov 14 at 01:08pm PST

Should be back to normal now as per 11labs.
https://status.elevenlabs.io/

1 previous update

Nov 12, 2024

1 incident

API is degraded

Degraded

Resolved Nov 12 at 02:15pm PST

TL;DR: API pods were choked. Our probes missed it.

Root Cause

Our API experienced DB contention. Recent monitoring system changes meant our probes didn't pick up this contention and restart the pods.

Timeline

November 12th 2:00pm PT - Customer reports of API failures
November 12th 2:05pm PT - Oncall team determined cause and scaled and restarted pods
November 12th 2:10pm PT - Full functionality restored.

Changes we've implemented

Restored higher sensitivity thresholds fo...

1 previous update

Nov 11, 2024

1 incident

Phone calls are degraded

Degraded

Resolved Nov 11 at 05:03pm PST

TL;DR: API gateway rejected Websocket requests

Summary

On November 11, 2024, from 4:22 PM to 5:05 PM PST, our WebSocket-based calls experienced disruption due to a configuration issue in our API gateway. This affected both inbound and outbound phone calls in one of our production clusters.

Impact

Duration: 43 minutes
Affected services: WebSocket-based phone calls
System returned 404 errors for affected connections
Service was fully restored by routing traffic to our backup clu...

1 previous update

Nov 07, 2024

1 incident

API is down

Downtime

Resolved Nov 07 at 06:11pm PST

Misconfiguration on networking cluster. Resolved now.

Here's what happened:

Summary

On November 7, 2024, from 5:59 PM to 6:10 PM PT, our API service experienced an outage due to an unintended configuration change. During this period, new API calls were unable to initiate, though existing connections remained largely unaffected.

Impact

Duration: 11 minutes
Service returned 521 errors for new inbound API calls
Existing API calls remained stable
Service was fully restored at 6...

1 previous update