Previous incidents

May 2025
May 18, 2025
1 incident

Cartesia voices are degraded

Degraded

Resolved May 18 at 01:56pm PDT

Everything is functional. We're still working with Cartesia to get to bottom. We'll change back to degraded if the issue raises again during investigation.

2 previous updates

May 13, 2025
4 incidents

Vapifault Worker Timeouts

Resolved May 13 at 10:31am PDT

RCA: Vapifault Worker Timeouts

TL;DR

On May 12, approximately 335 concurrent calls were either web-based or exceeded 15 minutes in duration, surpassing the prescaled worker limit of 250 on the weekly environment. Due to infrastructure constraints, Lambda functions could not supplement the increased call load. Kubernetes call-worker pods could not scale quickly enough to meet demand, resulting in worker timeout issues. The following day, this issue reoccurred due to the prescaling limit...

providerfault-transport errors

Resolved May 13 at 10:29am PDT

RCA: Providerfault-transport-never-connected

Summary

During a surge in inbound call traffic, two distinct errors were observed: "vapifault-transport-worker-not-available" and "providerfault-transport-never-connected." This report focuses on the root cause analysis of the "providerfault-transport-never-connected" errors occurring during the increased call volume.

Timeline of Events (PT)

  • 10:26 AM: Significant spike in inbound call volume.
  • 10:26 – 10:40 AM: Intermittent H...

SIP calls abruptly closing after 30 seconds

Resolved May 13 at 10:27am PDT

RCA: SIP Calls Ending Abruptly

TL;DR

A SIP node was rotated, and the associated Elastic IP (EIP) was reassigned to the new node. However, the SIP service was not restarted afterward, causing the SIP service to use an incorrect (private) IP address when sending SIP requests. Consequently, users receiving these SIP requests attempted to respond to the wrong IP address, resulting in ACK timeouts.

Timeline (PT)

  • May 12, ~9:00 pm: SIP node rotated and Elastic IP reassigned, but SI...

Stale data for Weekly users

Resolved May 13 at 10:22am PDT

RCA: Phone Number Caching Error in Weekly Environment

TL;DR

Certain code paths allowed caching functions to execute without an associated organization ID, preventing correct lookup of the organization's channel. This unintentionally enabled caching for the weekly environment, specifically affecting inbound phone call paths. Users consequently received outdated server URLs after updating phone numbers.

Timeline (PT)

  • May 10, 1:26 am: Caching re-enabled for users in daily envir...
May 10, 2025
1 incident

Voice issues due to 11labs quota

Resolved May 10 at 10:34am PDT

RCA: 11Labs Voice Issue

TL;DR

Calls began failing due to exceeding the 11Labs voice service quota, resulting in errors (vapifault-eleven-labs-quota-exceeded).

Timeline of Events (PT)

  • 12:04 PM: Calls begin failing due to 11Labs quota being exceeded.
  • 12:16 PM: Customer reports the issue as a production outage.
  • 12:24 PM: Contacted 11Labs support regarding quota exhaustion.
  • 12:25 PM: 11Labs support recommends enabling usage-based billing.
  • 12:26 PM: Usag...
May 02, 2025
1 incident

API degradation

Degraded

Resolved May 02 at 06:43pm PDT

RCA for May 2nd User error in manual rollout

Root cause:

  • User error in kicking off a manual rollout, driven by unblocking a release
  • Due to this, load balancer was pointed at an invalid backend cluster

Timeline

  • 5:24pm PT: Engineer flagged blocked rollout, Infra engineer identified transient error that auto-blocked rollout
  • 5:31pm PT: Infra engineer triggered manual rollout on behalf of engineer, to unblock release
  • 5:43pm PT: On-call was paged with issue in rollout manager, e...

2 previous updates

May 01, 2025
2 incidents

WebCalls from Dashboard are unavailable

Degraded

Resolved May 01 at 04:14pm PDT

We've applied a fix to the dashboard web call feature. Root cause was misconfiguration on the authentication strategy used for this feature.

1 previous update

Issues with KYC verification

Degraded

Resolved May 01 at 04:10pm PDT

The team has applied a fix to frontend issue. Root cause was a change related to authentication handlers. Additionally, our KYC vendors are extending observability features in their SDK

3 previous updates

April 2025
Apr 29, 2025
1 incident

Call Recordings May Fail For Some Users

Degraded

Resolved Apr 29 at 11:59pm PDT

We have resolved the issue. Will upload RCA 04/30 noon PST.

TL;DR: Recordings weren't uploaded to object storage due to some invalid credentials. We generated and applied new keys.

1 previous update

Apr 22, 2025
1 incident

Increased 404 Errors Related to Phone Numbers Found

Degraded

Resolved Apr 22 at 04:39am PDT

We have determined the issue and resolved. We will update by noon PST with an RCA.

TL;DR: Adding a new CIDR range to our SIP cluster caused issues where the servers were unable to discover each other.

1 previous update

Apr 04, 2025
2 incidents

Degradation in phone calls stuck in queued state.

Degraded

Resolved Apr 04 at 12:00pm PDT

Resolved the issue, blocked offending user and reviewed rate limits

1 previous update

Degradation in API

Degraded

Resolved Apr 04 at 09:00am PDT

API rollback completed and errors subsided

1 previous update

Apr 02, 2025
1 incident

Intermittent 503s in api

Degraded

Resolved Apr 03 at 11:00am PDT

Improvements shipped reliably fixed the issue. Team has commenced medium-term, and is investigating long-term scalability improvements

2 previous updates

Apr 01, 2025
1 incident

Experiencing Anthropic rate limits on model calls

Degraded

Resolved Apr 01 at 08:04pm PDT

Anthropic rate limiting is resolved after raising quota

1 previous update

March 2025
Mar 31, 2025
1 incident

Increased 480 Temporarily Unavailable cases for SIP inbound

Degraded

Resolved Apr 07 at 10:00pm PDT

For RCA please checkout https://status.vapi.ai/incident/528384?mp=true

3 previous updates

Mar 28, 2025
2 incidents

SIP calls failing intermittently

Degraded

Resolved Apr 07 at 10:00pm PDT

For RCA please check https://status.vapi.ai/incident/528384?mp=true

2 previous updates

Some SIP calls have longer reported call duration than reality

Resolved Mar 28 at 03:10pm PDT

Between 2025/03/27 8:40 PST and 9:35 PST, a small portion of SIP calls had their call durations initially inflated due to an internal system hang. The call duration information has been fixed retroactively.

Mar 24, 2025
2 incidents

API degradation

Degraded

Resolved Mar 24 at 09:33pm PDT

TL;DR

After deploying recent infrastructure changes to backend-production1, Redis Sentinel pods began restarting due to failing liveness checks (/health/ping_sentinel.sh). These infra changes included adding a new IP range, causing all cluster nodes to cycle. When Redis pods restarted, they continually failed health checks, resulting in repeated restarts. A rollback restored API functionality. The entire cluster is being re-created to address DNS resolution failures before rolling forwar...

1 previous update

Call worker degradation

Degraded

Resolved Mar 24 at 04:45pm PDT

Issue was mitigated via rollback. We're investigating and will update with an RCA

1 previous update

Mar 21, 2025
1 incident

Cloudflare R2 storage is degraded, causing call recording upload failures

Degraded

Resolved Mar 21 at 03:55pm PDT

Recording upload errors are recovered. We are continuing to monitor

2 previous updates

Mar 19, 2025
1 incident

Google Gemini Voicemail Detection is intermittently failing

Degraded

Resolved Mar 19 at 04:05pm PDT

TL;DR

It was decided that we should make Google Voicemail Detection the default option. On 16th March 2025, a PR was merged which implemented this change. This PR was released into production on 18th March 2025. On the morning of 19th March 2025, it was discovered that customers were experiencing call failures due to this change. Specifically: Google VMD was turned on by default, with no obvious way to disable it via the dashboard. Google VMD generated false positives when the bot identifi...

3 previous updates

Mar 18, 2025
1 incident

Intermittent errors during end calls.

Degraded

Resolved Mar 18 at 04:36pm PDT

Resolved now.

RCA:
Timeline (in PT)
4:10pm New release went out for a small percentage of users.
4:15pm Our monitoring picked up increased errors in ending calls.
4:34pm Release was auto rolled back due to increased errors and incident was resolved.

Impact
Calls to end with unknown-error
End of call report was missing

Root cause:
A missing DB migration caused issues in fetching data during end of call.

Remediation:
Add CI check to make sure we don't release code when ...

1 previous update

Mar 15, 2025
1 incident

Increased error in calls

Degraded

Resolved Mar 15 at 12:37pm PDT

The issue has subsided, we experienced a brief spike in call initiations and didn't scale up fast enough.

Immediate term, we're vertically scaling our call worker instances. Near term, we're rolling out our new call worker architecture for rapid scaling

1 previous update

Mar 14, 2025
4 incidents

SIP call failures to connect

Degraded

Resolved Apr 07 at 09:56pm PDT

RCA for SIP Degradation for sip.vapi.ai

TLDR;
Vapi sip service (sip.vapi.ai) was intermittently throwing errors and not able to connect to calls. We had some major flaws in our SIP infrastructure which was resolved by rearchitecting the whole thing from scratch.

Impact
- Call to Vapi SIP uri or Vapi phone numbers were failing to connect with 480/487/503 errors
- Inbound calls to Vapi getting connected but audio not coming out, eventually causing silence timeouts or customer-did-not...

5 previous updates

Vapi workers not connecting due to lack of workers

Degraded

Resolved Mar 17 at 08:56pm PDT

TL;DR

Weekly Cluster customers saw vapifault-transport-never-connected errors due to workers not scaling fast enough to meet demand

Timeline in PST

  • 7:00am - Customers report an increased number of vapifault-transport-never-connected errors. A degradation incident is posted on BetterStack
  • 7:30am - The issue is resolved as call workers scaled to meet demand

Root Cause

  • Call workers did not scale fast enough on the weekly cluster

Impact

There were 34 instances of vapifault-tr...

3 previous updates

Investigating GET /call/:id timeouts

Degraded

Resolved Mar 14 at 05:00pm PDT

We are working with impacted customers to investigate but have not seen this issue occurring regularly.

1 previous update

Calls are intermittently ending abruptly

Degraded

Resolved Mar 14 at 04:01pm PDT

TL;DR

Calls ended abruptly due to call-workers restarting themselves caused by high memory usage (OOMKilled).

Timeline in PST

  • March 13th 3:47am: Issue raised regarding calls ending without a call-ended-reason.
  • 1:57pm: High memory usage identified on call-workers exceeding the 2GB limit.
  • 3:29pm: Confirmation received that another customer experienced the same issue.
  • 4:30pm: Changes implemented to increase memory request and limit on call-workers.
  • March 14th 12:27pm: Changes dep...

1 previous update

Mar 13, 2025
1 incident

sip.vapi.ai degradation

Degraded

Resolved Mar 17 at 09:00pm PDT

RCA: SIP 480 Failures (March 13-14)

Summary
Between March 13-14, SIP calls intermittently failed due to recurring 480 errors. This issue was traced to our SIP SBC service failing to communicate with the SIP inbound service. As a temporary mitigation, restarting the SBC service resolved the issue. However, a long-term fix is planned, involving a transition to a more stable Auto Scaling Group (ASG) deployment.

Incident Timeline
(All times in PT)

March 13, 2025
07:00 AM – SIP...

2 previous updates

Mar 12, 2025
1 incident

Dashboard is unavailable.

Degraded

Resolved Mar 12 at 02:00pm PDT

TL;DR

During the response to the concurrent Deepgram bug, it was noticed that an unstable Lodash change committed to main was leaked into the prod dashboard.

Timeline in PST

  • 12:10am - Breaking changes were introduced to the main branch
  • 12:42am - Afterwards, another commit was merged to main
    • This merge incorrectly triggered a deployment of the production dashboard
  • 12:45am - A rollback in Cloudflare Pages was completed, restoring service
  • 12:47am - Shortly afterward, a fix w...

1 previous update

Mar 11, 2025
1 incident

We are seeing degraded service from Deepgram

Degraded

Resolved Mar 11 at 12:59am PDT

TL;DR

An application-level bug was leaked into production, causing a spike in pipeline-error-deepgram-returning-502-network-error errors. This resulted in roughly 1.48K failed calls.

Timeline in PST

  • 12:03am - Rollout to prod1 containing the offending change is started
  • 12:13am - Rollout to prod1 is complete
  • 12:25am - A huddle in #eng-scale is started
  • 12:43am - Rollback to prod3 is started
  • 12:55am - Rollback to prod3 is complete

Root Cause

  • An application-level bug related...

1 previous update

Mar 10, 2025
1 incident

Increased call start errors due to Vapi fault transport errors + Twilio timeouts

Degraded

Resolved Mar 10 at 07:18pm PDT

RCA: vapifault-transport-never-connected errors caused call failures
Date: 03/10/2025

Summary:
A recent update to our production environment increased the memory usage of one of our core
call-processing services. This led to an unintended triggering of our automated process restart
mechanism, resulting in a brief period of call failures. The issue was resolved by adjusting the
memory threshold for these restarts.

Timeline:
1. 5:50am A few calls start facing issues in starting due to
vapifau...

2 previous updates

Mar 07, 2025
1 incident

Increased Twilio errors causing 31902 & 31920 websocket connection issues. In...

Degraded

Resolved Mar 07 at 02:00pm PST

We have rolled back the faulty release which caused this issue. We are monitoring the situation now.

1 previous update

Mar 06, 2025
1 incident

Vonage inbound calling is degraded

Resolved Mar 06 at 02:39pm PST

The issue was caused by Vonage sending an unexpected payload schema, causing validation to fail at the API level. We deployed a fix to accommodate for the schema.

Mar 05, 2025
2 incidents

Signups temporarily unavailable

Resolved Mar 05 at 10:00pm PST

The API bug was reverted and we confirmed service restoration

Weekly cluster at capacity limits

Degraded

Resolved Mar 05 at 12:04pm PST

We are seeing calls go through fine now, and are still keeping an eye out

1 previous update