Previous incidents
Signups and credential creation are not working
Resolved Feb 27, 2025 at 05:08am UTC
Root Cause Analysis (RCA) for the Incident – Timeline in PT
TL;DR
A recent security fix by Supabase impacted database projects using pg_net 0.8.0, causing failures in the POST /credential endpoint and new user signups.
Timeline
March 5th, 3:00 AM PT: Failures in POST /credential endpoint and new user signups begin.
3:26 AM PT: On-call engineer observes a surge in errors related to POST /credential, including an unusual PostgresError.
3:32 AM PT: Team...
4 previous updates
Assembly AI transcriber calls are facing degradation.
Resolved Feb 22, 2025 at 02:17pm UTC
It is resolved now. It was due to a account related problem which has been fixed now. We will be taking steps to make sure it doesn't happen again.
1 previous update
API returning 413 (payload too large) due to networking misconfiguration
Resolved Feb 21, 2025 at 07:24pm UTC
TL;DR
A change in the cluster-router networking filter caused an increase in 413 (request entity too large) errors. API requests to POST /call, /assistant, and /file were impacted.
Timeline
- February 20th 9:54pm PST: A change to the cluster-router is released and traffic is cut over to prod1.
- 10:19pm PST: 413 responses from Cloudflare begin appearing in increased Datadog logs.
- February 21st ~8:50am: Users in Discord flag requests failing with 413 errors.
- **...
Deepgram is failing to send transcription intermittently
Resolved Feb 21, 2025 at 08:57am UTC
Deepgram has resolved the incident on their side. Back to normal.
https://status.deepgram.com/incidents/wr5whbzk45mg
2 previous updates
Elevenlabs rate limiting and high latency
Resolved Feb 20, 2025 at 05:11pm UTC
11labs has confirmed that the problem has been fixed. No failures in last 10mins. Resolving incident.
Here is the elevenlabs report on the incident https://status.elevenlabs.io/incidents/01JMJ4B025B83H28C3K81B1YS4
1 previous update
ElevenLabs Rate Limiting
Resolved Feb 19, 2025 at 07:43pm UTC
ElevenLabs is imposing rate limits which will have impact on Vapi users who have it configured as their voice model. We are working to resolve this issue, but users can restore service by switching to Cartesia or using their own API key.
API is degraded
Resolved Jan 30, 2025 at 11:44am UTC
TL;DR
The API experienced intermittent downtime due to choked database connections and subsequent call failures caused by the database running out of memory. A forced deployment using direct connections and capacity adjustments restored service.
Timeline
2:09AM: Alerts triggered for API unavailability (503 errors) and frequent pod crashes.
2:40AM: A switch to a backup deployment showed temporary stability, but pods continued to restart and out-of-memory errors began appearing.
3:27AM...
1 previous update
API is down
Resolved Jan 29, 2025 at 05:24pm UTC
TL;DR
A failed deployment by Supabase of their connection pooler, Supavisor, in one region caused all database connections to fail. Since API pods rely on a successful database health check at startup, none could start properly. The workaround was to bypass the pooler and connect directly to the database, restoring service.
Timeline
8:08am PST, Jan 29: Monitoring detects Postgres errors.
8:13am: The provider’s status page reports a failed connection pooler deployment. (Due to subscri...
3 previous updates
Updates to DB are failing
Resolved Jan 21, 2025 at 01:23pm UTC
TL;DR
A configuration error caused the production database to switch to read-only mode, blocking write operations and eventually leading to an API outage. Restarting the database restored service.
Timeline
5:03:04am: A SQL client connected to the production database via the connection pooler, which inadvertently set the database to read-only.
5:05am: Write operations began failing.
5:18am: The API went down due to accumulated errors.
~5:23am: The team initiated a database restart.
5:...
1 previous update
Calls not connecting for `weekly` channel
Resolved Jan 13, 2025 at 04:49pm UTC
TL;DR: Scaler failed and we didn't have enough workers
Root Cause
During a weekly deployment, Redis IP addresses changed. This prevented our scaling system from finding the queue, leaving us stuck at fixed number workers instead of scaling up as needed. We resolved the issue by temporarily moving traffic to our daily environment.
Timeline
Jan 11, 5:12 PM: Deploy started
Jan 13, 6:00 AM: Calls started failing due to scaling issues
Jan 13, 8:45 AM: Resolved by moving traffic to daily
Ja...
1 previous update
OpenAI API is degraded
Resolved Dec 12, 2024 at 04:00am UTC
Resolved: https://status.openai.com/
1 previous update