Previous incidents
Increased 480 Temporarily Unavailable cases for SIP inbound
Resolved Apr 07 at 10:00pm PDT
For RCA please checkout https://status.vapi.ai/incident/528384?mp=true
3 previous updates
SIP calls failing intermittently
Resolved Apr 07 at 10:00pm PDT
For RCA please check https://status.vapi.ai/incident/528384?mp=true
2 previous updates
Some SIP calls have longer reported call duration than reality
Resolved Mar 28 at 03:10pm PDT
Between 2025/03/27 8:40 PST and 9:35 PST, a small portion of SIP calls had their call durations initially inflated due to an internal system hang. The call duration information has been fixed retroactively.
API degradation
Resolved Mar 24 at 09:33pm PDT
TL;DR
After deploying recent infrastructure changes to backend-production1, Redis Sentinel pods began restarting due to failing liveness checks (/health/ping_sentinel.sh
). These infra changes included adding a new IP range, causing all cluster nodes to cycle. When Redis pods restarted, they continually failed health checks, resulting in repeated restarts. A rollback restored API functionality. The entire cluster is being re-created to address DNS resolution failures before rolling forwar...
1 previous update
Call worker degradation
Resolved Mar 24 at 04:45pm PDT
Issue was mitigated via rollback. We're investigating and will update with an RCA
1 previous update
Cloudflare R2 storage is degraded, causing call recording upload failures
Resolved Mar 21 at 03:55pm PDT
Recording upload errors are recovered. We are continuing to monitor
2 previous updates
Google Gemini Voicemail Detection is intermittently failing
Resolved Mar 19 at 04:05pm PDT
TL;DR
It was decided that we should make Google Voicemail Detection the default option. On 16th March 2025, a PR was merged which implemented this change. This PR was released into production on 18th March 2025. On the morning of 19th March 2025, it was discovered that customers were experiencing call failures due to this change. Specifically: Google VMD was turned on by default, with no obvious way to disable it via the dashboard. Google VMD generated false positives when the bot identifi...
3 previous updates
Intermittent errors during end calls.
Resolved Mar 18 at 04:36pm PDT
Resolved now.
RCA:
Timeline (in PT)
4:10pm New release went out for a small percentage of users.
4:15pm Our monitoring picked up increased errors in ending calls.
4:34pm Release was auto rolled back due to increased errors and incident was resolved.
Impact
Calls to end with unknown-error
End of call report was missing
Root cause:
A missing DB migration caused issues in fetching data during end of call.
Remediation:
Add CI check to make sure we don't release code when ...
1 previous update
Increased error in calls
Resolved Mar 15 at 12:37pm PDT
The issue has subsided, we experienced a brief spike in call initiations and didn't scale up fast enough.
Immediate term, we're vertically scaling our call worker instances. Near term, we're rolling out our new call worker architecture for rapid scaling
1 previous update
SIP call failures to connect
Resolved Apr 07 at 09:56pm PDT
RCA for SIP Degradation for sip.vapi.ai
TLDR;
Vapi sip service (sip.vapi.ai) was intermittently throwing errors and not able to connect to calls. We had some major flaws in our SIP infrastructure which was resolved by rearchitecting the whole thing from scratch.
Impact
- Call to Vapi SIP uri or Vapi phone numbers were failing to connect with 480/487/503 errors
- Inbound calls to Vapi getting connected but audio not coming out, eventually causing silence timeouts or customer-did-not...
5 previous updates
Vapi workers not connecting due to lack of workers
Resolved Mar 17 at 08:56pm PDT
TL;DR
Weekly Cluster customers saw vapifault-transport-never-connected errors due to workers not scaling fast enough to meet demand
Timeline in PST
- 7:00am - Customers report an increased number of vapifault-transport-never-connected errors. A degradation incident is posted on BetterStack
- 7:30am - The issue is resolved as call workers scaled to meet demand
Root Cause
- Call workers did not scale fast enough on the weekly cluster
Impact
There were 34 instances of vapifault-tr...
3 previous updates
Investigating GET /call/:id timeouts
Resolved Mar 14 at 05:00pm PDT
We are working with impacted customers to investigate but have not seen this issue occurring regularly.
1 previous update
Calls are intermittently ending abruptly
Resolved Mar 14 at 04:01pm PDT
TL;DR
Calls ended abruptly due to call-workers restarting themselves caused by high memory usage (OOMKilled).
Timeline in PST
- March 13th 3:47am: Issue raised regarding calls ending without a call-ended-reason.
- 1:57pm: High memory usage identified on call-workers exceeding the 2GB limit.
- 3:29pm: Confirmation received that another customer experienced the same issue.
- 4:30pm: Changes implemented to increase memory request and limit on call-workers.
- March 14th 12:27pm: Changes dep...
1 previous update
sip.vapi.ai degradation
Resolved Mar 17 at 09:00pm PDT
RCA: SIP 480 Failures (March 13-14)
Summary
Between March 13-14, SIP calls intermittently failed due to recurring 480 errors. This issue was traced to our SIP SBC service failing to communicate with the SIP inbound service. As a temporary mitigation, restarting the SBC service resolved the issue. However, a long-term fix is planned, involving a transition to a more stable Auto Scaling Group (ASG) deployment.
Incident Timeline
(All times in PT)
March 13, 2025
07:00 AM – SIP...
2 previous updates
Dashboard is unavailable.
Resolved Mar 12 at 02:00pm PDT
TL;DR
During the response to the concurrent Deepgram bug, it was noticed that an unstable Lodash change committed to main was leaked into the prod dashboard.
Timeline in PST
- 12:10am - Breaking changes were introduced to the main branch
- 12:42am - Afterwards, another commit was merged to main
- This merge incorrectly triggered a deployment of the production dashboard
- 12:45am - A rollback in Cloudflare Pages was completed, restoring service
- 12:47am - Shortly afterward, a fix w...
1 previous update
We are seeing degraded service from Deepgram
Resolved Mar 11 at 12:59am PDT
TL;DR
An application-level bug was leaked into production, causing a spike in pipeline-error-deepgram-returning-502-network-error errors. This resulted in roughly 1.48K failed calls.
Timeline in PST
- 12:03am - Rollout to prod1 containing the offending change is started
- 12:13am - Rollout to prod1 is complete
- 12:25am - A huddle in #eng-scale is started
- 12:43am - Rollback to prod3 is started
- 12:55am - Rollback to prod3 is complete
Root Cause
- An application-level bug related...
1 previous update
Increased call start errors due to Vapi fault transport errors + Twilio timeouts
Resolved Mar 10 at 07:18pm PDT
RCA: vapifault-transport-never-connected errors caused call failures
Date: 03/10/2025
Summary:
A recent update to our production environment increased the memory usage of one of our core
call-processing services. This led to an unintended triggering of our automated process restart
mechanism, resulting in a brief period of call failures. The issue was resolved by adjusting the
memory threshold for these restarts.
Timeline:
1. 5:50am A few calls start facing issues in starting due to
vapifau...
2 previous updates
Increased Twilio errors causing 31902 & 31920 websocket connection issues. In...
Resolved Mar 07 at 02:00pm PST
We have rolled back the faulty release which caused this issue. We are monitoring the situation now.
1 previous update
Vonage inbound calling is degraded
Resolved Mar 06 at 02:39pm PST
The issue was caused by Vonage sending an unexpected payload schema, causing validation to fail at the API level. We deployed a fix to accommodate for the schema.
Signups temporarily unavailable
Resolved Mar 05 at 10:00pm PST
The API bug was reverted and we confirmed service restoration
Weekly cluster at capacity limits
Resolved Mar 05 at 12:04pm PST
We are seeing calls go through fine now, and are still keeping an eye out
1 previous update
Signups and credential creation are not working
Resolved Feb 26 at 09:08pm PST
Root Cause Analysis (RCA) for the Incident – Timeline in PT
TL;DR
A recent security fix by Supabase impacted database projects using pg_net 0.8.0, causing failures in the POST /credential
endpoint and new user signups.
Timeline
March 5th, 3:00 AM PT: Failures in POST /credential
endpoint and new user signups begin.
3:26 AM PT: On-call engineer observes a surge in errors related to POST /credential
, including an unusual PostgresError
.
3:32 AM PT: Team...
4 previous updates
Assembly AI transcriber calls are facing degradation.
Resolved Feb 22 at 06:17am PST
It is resolved now. It was due to a account related problem which has been fixed now. We will be taking steps to make sure it doesn't happen again.
1 previous update
API returning 413 (payload too large) due to networking misconfiguration
Resolved Feb 21 at 11:24am PST
TL;DR
A change in the cluster-router networking filter caused an increase in 413 (request entity too large) errors. API requests to POST /call, /assistant, and /file were impacted.
Timeline
- February 20th 9:54pm PST: A change to the cluster-router is released and traffic is cut over to prod1.
- 10:19pm PST: 413 responses from Cloudflare begin appearing in increased Datadog logs.
- February 21st ~8:50am: Users in Discord flag requests failing with 413 errors.
- **...
Deepgram is failing to send transcription intermittently
Resolved Feb 21 at 12:57am PST
Deepgram has resolved the incident on their side. Back to normal.
https://status.deepgram.com/incidents/wr5whbzk45mg
2 previous updates
Elevenlabs rate limiting and high latency
Resolved Feb 20 at 09:11am PST
11labs has confirmed that the problem has been fixed. No failures in last 10mins. Resolving incident.
Here is the elevenlabs report on the incident https://status.elevenlabs.io/incidents/01JMJ4B025B83H28C3K81B1YS4
1 previous update
ElevenLabs Rate Limiting
Resolved Feb 19 at 11:43am PST
ElevenLabs is imposing rate limits which will have impact on Vapi users who have it configured as their voice model. We are working to resolve this issue, but users can restore service by switching to Cartesia or using their own API key.
API is degraded
Resolved Jan 30 at 03:44am PST
TL;DR
The API experienced intermittent downtime due to choked database connections and subsequent call failures caused by the database running out of memory. A forced deployment using direct connections and capacity adjustments restored service.
Timeline
2:09AM: Alerts triggered for API unavailability (503 errors) and frequent pod crashes.
2:40AM: A switch to a backup deployment showed temporary stability, but pods continued to restart and out-of-memory errors began appearing.
3:27AM...
1 previous update
API is down
Resolved Jan 29 at 09:24am PST
TL;DR
A failed deployment by Supabase of their connection pooler, Supavisor, in one region caused all database connections to fail. Since API pods rely on a successful database health check at startup, none could start properly. The workaround was to bypass the pooler and connect directly to the database, restoring service.
Timeline
8:08am PST, Jan 29: Monitoring detects Postgres errors.
8:13am: The provider’s status page reports a failed connection pooler deployment. (Due to subscri...
3 previous updates
Updates to DB are failing
Resolved Jan 21 at 05:23am PST
TL;DR
A configuration error caused the production database to switch to read-only mode, blocking write operations and eventually leading to an API outage. Restarting the database restored service.
Timeline
5:03:04am: A SQL client connected to the production database via the connection pooler, which inadvertently set the database to read-only.
5:05am: Write operations began failing.
5:18am: The API went down due to accumulated errors.
~5:23am: The team initiated a database restart.
5:...
1 previous update
Calls not connecting for `weekly` channel
Resolved Jan 13 at 08:49am PST
TL;DR: Scaler failed and we didn't have enough workers
Root Cause
During a weekly deployment, Redis IP addresses changed. This prevented our scaling system from finding the queue, leaving us stuck at fixed number workers instead of scaling up as needed. We resolved the issue by temporarily moving traffic to our daily environment.
Timeline
Jan 11, 5:12 PM: Deploy started
Jan 13, 6:00 AM: Calls started failing due to scaling issues
Jan 13, 8:45 AM: Resolved by moving traffic to daily
Ja...
1 previous update