Previous incidents
Increased error in calls
Resolved Mar 15 at 12:37pm PDT
The issue has subsided, we experienced a brief spike in call initiations and didn't scale up fast enough.
Immediate term, we're vertically scaling our call worker instances. Near term, we're rolling out our new call worker architecture for rapid scaling
1 previous update
Vapi workers not connecting due to lack of workers
Resolved Mar 17 at 08:56pm PDT
TL;DR
Weekly Cluster customers saw vapifault-transport-never-connected errors due to workers not scaling fast enough to meet demand
Timeline in PST
- 7:00am - Customers report an increased number of vapifault-transport-never-connected errors. A degradation incident is posted on BetterStack
- 7:30am - The issue is resolved as call workers scaled to meet demand
Root Cause
- Call workers did not scale fast enough on the weekly cluster
Impact
There were 34 instances of vapifault-tr...
3 previous updates
SIP call failures to connect
Updated Mar 17 at 02:30pm PDT
Degarading sip.vapi.ai instead of api.vapi.ai as only sip part is currently impacted.
3 previous updates
Investigating GET /call/:id timeouts
Resolved Mar 14 at 05:00pm PDT
We are working with impacted customers to investigate but have not seen this issue occurring regularly.
1 previous update
Calls are intermittently ending abruptly
Resolved Mar 14 at 04:01pm PDT
TL;DR
Calls ended abruptly due to call-workers restarting themselves caused by high memory usage (OOMKilled).
Timeline in PST
- March 13th 3:47am: Issue raised regarding calls ending without a call-ended-reason.
- 1:57pm: High memory usage identified on call-workers exceeding the 2GB limit.
- 3:29pm: Confirmation received that another customer experienced the same issue.
- 4:30pm: Changes implemented to increase memory request and limit on call-workers.
- March 14th 12:27pm: Changes dep...
1 previous update
sip.vapi.ai degradation
Resolved Mar 17 at 09:00pm PDT
RCA: SIP 480 Failures (March 13-14)
Summary
Between March 13-14, SIP calls intermittently failed due to recurring 480 errors. This issue was traced to our SIP SBC service failing to communicate with the SIP inbound service. As a temporary mitigation, restarting the SBC service resolved the issue. However, a long-term fix is planned, involving a transition to a more stable Auto Scaling Group (ASG) deployment.
Incident Timeline
(All times in PT)
March 13, 2025
07:00 AM – SIP...
2 previous updates
Dashboard is unavailable.
Resolved Mar 12 at 02:00pm PDT
TL;DR
During the response to the concurrent Deepgram bug, it was noticed that an unstable Lodash change committed to main was leaked into the prod dashboard.
Timeline in PST
- 12:10am - Breaking changes were introduced to the main branch
- 12:42am - Afterwards, another commit was merged to main
- This merge incorrectly triggered a deployment of the production dashboard
- 12:45am - A rollback in Cloudflare Pages was completed, restoring service
- 12:47am - Shortly afterward, a fix w...
1 previous update
We are seeing degraded service from Deepgram
Resolved Mar 11 at 12:59am PDT
TL;DR
An application-level bug was leaked into production, causing a spike in pipeline-error-deepgram-returning-502-network-error errors. This resulted in roughly 1.48K failed calls.
Timeline in PST
- 12:03am - Rollout to prod1 containing the offending change is started
- 12:13am - Rollout to prod1 is complete
- 12:25am - A huddle in #eng-scale is started
- 12:43am - Rollback to prod3 is started
- 12:55am - Rollback to prod3 is complete
Root Cause
- An application-level bug related...
1 previous update
Increased call start errors due to Vapi fault transport errors + Twilio timeouts
Resolved Mar 10 at 07:18pm PDT
RCA: vapifault-transport-never-connected errors caused call failures
Date: 03/10/2025
Summary:
A recent update to our production environment increased the memory usage of one of our core
call-processing services. This led to an unintended triggering of our automated process restart
mechanism, resulting in a brief period of call failures. The issue was resolved by adjusting the
memory threshold for these restarts.
Timeline:
1. 5:50am A few calls start facing issues in starting due to
vapifau...
2 previous updates
Increased Twilio errors causing 31902 & 31920 websocket connection issues. In...
Resolved Mar 07 at 02:00pm PST
We have rolled back the faulty release which caused this issue. We are monitoring the situation now.
1 previous update
Vonage inbound calling is degraded
Resolved Mar 06 at 02:39pm PST
The issue was caused by Vonage sending an unexpected payload schema, causing validation to fail at the API level. We deployed a fix to accommodate for the schema.
Signups temporarily unavailable
Resolved Mar 05 at 10:00pm PST
The API bug was reverted and we confirmed service restoration
Weekly cluster at capacity limits
Resolved Mar 05 at 12:04pm PST
We are seeing calls go through fine now, and are still keeping an eye out
1 previous update