Previous incidents

May 2025
No incidents reported
April 2025
No incidents reported
March 2025
Mar 15, 2025
1 incident

Increased error in calls

Degraded

Resolved Mar 15 at 12:37pm PDT

The issue has subsided, we experienced a brief spike in call initiations and didn't scale up fast enough.

Immediate term, we're vertically scaling our call worker instances. Near term, we're rolling out our new call worker architecture for rapid scaling

1 previous update

Mar 14, 2025
4 incidents

Vapi workers not connecting due to lack of workers

Degraded

Resolved Mar 17 at 08:56pm PDT

TL;DR

Weekly Cluster customers saw vapifault-transport-never-connected errors due to workers not scaling fast enough to meet demand

Timeline in PST

  • 7:00am - Customers report an increased number of vapifault-transport-never-connected errors. A degradation incident is posted on BetterStack
  • 7:30am - The issue is resolved as call workers scaled to meet demand

Root Cause

  • Call workers did not scale fast enough on the weekly cluster

Impact

There were 34 instances of vapifault-tr...

3 previous updates

SIP call failures to connect

Degraded

Updated Mar 17 at 02:30pm PDT

Degarading sip.vapi.ai instead of api.vapi.ai as only sip part is currently impacted.

3 previous updates

Investigating GET /call/:id timeouts

Degraded

Resolved Mar 14 at 05:00pm PDT

We are working with impacted customers to investigate but have not seen this issue occurring regularly.

1 previous update

Calls are intermittently ending abruptly

Degraded

Resolved Mar 14 at 04:01pm PDT

TL;DR

Calls ended abruptly due to call-workers restarting themselves caused by high memory usage (OOMKilled).

Timeline in PST

  • March 13th 3:47am: Issue raised regarding calls ending without a call-ended-reason.
  • 1:57pm: High memory usage identified on call-workers exceeding the 2GB limit.
  • 3:29pm: Confirmation received that another customer experienced the same issue.
  • 4:30pm: Changes implemented to increase memory request and limit on call-workers.
  • March 14th 12:27pm: Changes dep...

1 previous update

Mar 13, 2025
1 incident

sip.vapi.ai degradation

Degraded

Resolved Mar 17 at 09:00pm PDT

RCA: SIP 480 Failures (March 13-14)

Summary
Between March 13-14, SIP calls intermittently failed due to recurring 480 errors. This issue was traced to our SIP SBC service failing to communicate with the SIP inbound service. As a temporary mitigation, restarting the SBC service resolved the issue. However, a long-term fix is planned, involving a transition to a more stable Auto Scaling Group (ASG) deployment.

Incident Timeline
(All times in PT)

March 13, 2025
07:00 AM – SIP...

2 previous updates

Mar 12, 2025
1 incident

Dashboard is unavailable.

Degraded

Resolved Mar 12 at 02:00pm PDT

TL;DR

During the response to the concurrent Deepgram bug, it was noticed that an unstable Lodash change committed to main was leaked into the prod dashboard.

Timeline in PST

  • 12:10am - Breaking changes were introduced to the main branch
  • 12:42am - Afterwards, another commit was merged to main
    • This merge incorrectly triggered a deployment of the production dashboard
  • 12:45am - A rollback in Cloudflare Pages was completed, restoring service
  • 12:47am - Shortly afterward, a fix w...

1 previous update

Mar 11, 2025
1 incident

We are seeing degraded service from Deepgram

Degraded

Resolved Mar 11 at 12:59am PDT

TL;DR

An application-level bug was leaked into production, causing a spike in pipeline-error-deepgram-returning-502-network-error errors. This resulted in roughly 1.48K failed calls.

Timeline in PST

  • 12:03am - Rollout to prod1 containing the offending change is started
  • 12:13am - Rollout to prod1 is complete
  • 12:25am - A huddle in #eng-scale is started
  • 12:43am - Rollback to prod3 is started
  • 12:55am - Rollback to prod3 is complete

Root Cause

  • An application-level bug related...

1 previous update

Mar 10, 2025
1 incident

Increased call start errors due to Vapi fault transport errors + Twilio timeouts

Degraded

Resolved Mar 10 at 07:18pm PDT

RCA: vapifault-transport-never-connected errors caused call failures
Date: 03/10/2025

Summary:
A recent update to our production environment increased the memory usage of one of our core
call-processing services. This led to an unintended triggering of our automated process restart
mechanism, resulting in a brief period of call failures. The issue was resolved by adjusting the
memory threshold for these restarts.

Timeline:
1. 5:50am A few calls start facing issues in starting due to
vapifau...

2 previous updates

Mar 07, 2025
1 incident

Increased Twilio errors causing 31902 & 31920 websocket connection issues. In...

Degraded

Resolved Mar 07 at 02:00pm PST

We have rolled back the faulty release which caused this issue. We are monitoring the situation now.

1 previous update

Mar 06, 2025
1 incident

Vonage inbound calling is degraded

Resolved Mar 06 at 02:39pm PST

The issue was caused by Vonage sending an unexpected payload schema, causing validation to fail at the API level. We deployed a fix to accommodate for the schema.

Mar 05, 2025
2 incidents

Signups temporarily unavailable

Resolved Mar 05 at 10:00pm PST

The API bug was reverted and we confirmed service restoration

Weekly cluster at capacity limits

Degraded

Resolved Mar 05 at 12:04pm PST

We are seeing calls go through fine now, and are still keeping an eye out

1 previous update