Incidents | Vapi

SIP Maintenance

Mon, 01 Dec 2025 00:30:00 -0000

We will be performing critical maintenance on our SIP infrastructure. Calls may be impacted during this period. We appreciate your patience and apologize for any inconvenience this may cause.

Call Logs after Nov 22 not available

Sat, 29 Nov 2025 18:40:00 -0000

We identified the issue as a misconfiguration in the read-API endpoint. The fix has been applied, and all call logs should now display correctly. No data was lost.

Call Logs after Nov 22 not available

Sat, 29 Nov 2025 18:00:00 -0000

We have identified an issue in Dashboard → Call Logs that is preventing call records after November 22 (PST) from appearing in the interface. API access to call logs does not seem affected.

Authentication systems maintenance

Thu, 27 Nov 2025 06:30:25 -0000

Maintenance completed

Authentication systems maintenance

Thu, 27 Nov 2025 06:00:25 -0000

We’re performing a planned resize of our authentication database, which will require a brief restart of the instance. During this window, users may experience elevated errors when signing in or signing up. Call functionality will not be affected. The process is typically completed within one minute.

Call Logs after 4 PM PT not loading

Thu, 20 Nov 2025 02:40:00 -0000

Affected call logs have been successfully restored. We will providing a detailed RCA soon.

Call Logs after 4 PM PT not loading

Thu, 20 Nov 2025 01:06:00 -0000

We have fixed the sync issue and we see new calls are showing up on the dashboard again. We are still working on restoring the calls logs between 4:00 pm and 5:06 pm PT.

Call Logs after 4 PM PT not loading

Thu, 20 Nov 2025 00:00:00 -0000

We have identified an issue in our DB read replicas that is not displaying call logs after 4 PM PT in dashboard. We are working on fixing the issue. This does not impact active calls and we don't believe there is data loss at this time.

API Degradation

Tue, 18 Nov 2025 17:00:00 -0000

Cloudflare has resolved their issues and our services are restored.

API Degradation

Tue, 18 Nov 2025 17:00:00 -0000

Cloudflare has resolved their issues and our services are restored.

API Degradation

Tue, 18 Nov 2025 17:00:00 -0000

Cloudflare has resolved their issues and our services are restored.

API Degradation

Tue, 18 Nov 2025 12:33:00 -0000

We are experiencing increased API failures due to a widespread Cloudflare outage. Our systems remain operational, but requests routed through Cloudflare may fail or time out. This issue originates at the Cloudflare level and is impacting multiple services globally.

API Degradation

Tue, 18 Nov 2025 12:33:00 -0000

API Degradation

Tue, 18 Nov 2025 12:33:00 -0000

Vapi SIP recovered

Tue, 18 Nov 2025 12:02:23 +0000

Vapi SIP recovered

Vapi DB recovered

Tue, 18 Nov 2025 11:52:08 +0000

Vapi DB recovered

Vapi API recovered

Tue, 18 Nov 2025 11:50:43 +0000

Vapi API recovered

Cartesia recovered

Tue, 18 Nov 2025 11:49:15 +0000

Cartesia recovered

Vapi API [Weekly] recovered

Tue, 18 Nov 2025 11:46:55 +0000

Vapi API [Weekly] recovered

Vapi SIP went down

Tue, 18 Nov 2025 11:42:17 +0000

Vapi SIP went down

Vapi API went down

Tue, 18 Nov 2025 11:42:06 +0000

Vapi API went down

Vapi DB went down

Tue, 18 Nov 2025 11:40:10 +0000

Vapi DB went down

Vapi API [Weekly] went down

Tue, 18 Nov 2025 11:35:48 +0000

Vapi API [Weekly] went down

Cartesia went down

Tue, 18 Nov 2025 11:32:58 +0000

Cartesia went down

Gladia concurrency limit affected

Mon, 17 Nov 2025 15:00:00 -0000

We have temporarily increased our concurrency limits with provider and working on a long term solution.

Gladia concurrency limit affected

Mon, 17 Nov 2025 13:25:00 -0000

We've identified an spike in call ended reasons with: call.in-progress.error-vapifault-gladia-transcriber-failed. Caused due to a concurrency limit with the provider. While we work on resolving the issue, we recommend switching to another Transcriber.

Call concurrency limit affected

Sun, 16 Nov 2025 16:57:00 -0000

concurrency limit has been reset and jobs are processing normally. issue has been resolved as of 9:05 PT

Deepgram STT degradation

Thu, 13 Nov 2025 23:26:00 -0000

Our STT provider has made a fix on their end and are reporting improvement. We are continuing to monitor while we push out our own improvement: https://status.deepgram.com/incidents/vgsyqxkc67by.

Deepgram STT degradation

Thu, 13 Nov 2025 23:26:00 -0000

Our STT provider has made a fix on their end and are reporting improvement. We are continuing to monitor while we push out our own improvement: https://status.deepgram.com/incidents/vgsyqxkc67by.

Deepgram STT degradation

Thu, 13 Nov 2025 22:53:00 -0000

Deepgram has confirmed there is an issue on their end resulting in increased latency that may cause calls to drop. We are making a change internally to handle the exception properly.

Deepgram STT degradation

Thu, 13 Nov 2025 22:53:00 -0000

Deepgram has confirmed there is an issue on their end resulting in increased latency that may cause calls to drop. We are making a change internally to handle the exception properly.

Deepgram STT degradation

Thu, 13 Nov 2025 22:24:00 -0000

We have found an issue with increased latency from one of our providers which is resulting in call failures.

Deepgram STT degradation

Thu, 13 Nov 2025 22:24:00 -0000

We have found an issue with increased latency from one of our providers which is resulting in call failures.

Deepgram STT degradation

Thu, 13 Nov 2025 22:12:00 -0000

We are seeing calls being dropped for both daily/weekly channels.

Deepgram STT degradation

Thu, 13 Nov 2025 22:12:00 -0000

We are seeing calls being dropped for both daily/weekly channels.

Calls to openAI provider are affected

Thu, 13 Nov 2025 10:51:00 -0000

Issue has been mitigated as of 2:50 AM PT

Calls to openAI provider are affected

Thu, 13 Nov 2025 10:51:00 -0000

Issue has been mitigated as of 2:50 AM PT

Calls to openAI provider are affected

Thu, 13 Nov 2025 09:46:00 -0000

Calls to openAI provider are affected from 12:15 AM PT, we are actively investigating the issue

Calls to openAI provider are affected

Thu, 13 Nov 2025 09:46:00 -0000

Calls to openAI provider are affected from 12:15 AM PT, we are actively investigating the issue

SIP calls are degraded

Thu, 13 Nov 2025 02:05:00 -0000

Nov 7th 2025 SIP service degradation Summary ---------- On Friday, November 7th, 2025, one of our SIP gateway experienced a failure, causing inbound and outbound Vapi SIP calls to be disrupted between 10:30 AM and 12:15 PM PST Context --------- All Vapi SIP calls go through our SIP infrastructure which handles SIP trunking, authentication, and registration. When an inbound SIP call arrives, the SIP SBC authenticates and validates it, making a webhook call to our API server for call registration. Once calls are registered, SBC establishes a bidirectional websocket connection (via websocket proxy) to call workers for real-time call processing and audio streaming. Root Cause ------------ Our SIP gateway runs on dedicated infrastructure which runs stateful workloads. This part of our infrastructure was missing log archival configuration. Over time, application logs accumulated and filled the available disk space, causing the server to crash and become unresponsive.This issue was compounded by the absence of disk space monitoring and alerting, which delayed our detection and response. Resolution ---------- Once the issue was identified, our engineering team took the following actions: Cleared accumulated logs to restore available disk space - Restarted SIP gateway services and validated recovery - Implemented immediate log rotation on the affected host - Verified all SIP services were operational before resuming normal operations What We’re Doing to Prevent This -------------------------------- Immediate Actions (Completed) - Deployed disk space monitoring with alerts at 75% utilization - Fixed SIP gateway metrics-based alerts to detect node failures and missing metrics - Added volume-based alerts for all stateful SIP instances Expected results: Early detection of issues affecting SIP gateway instances including high disk usage, node failures, or no metrics, so that any disruption to call processing can be identified and resolved before impacting customers. Short-Term Actions (In Progress – 30 Days) --------------------------------------------- - Implement comprehensive per-node health monitoring with automated alerting - Enhance our synthetic phone health checks to test individual SIP nodes for stateful service health - Deploy hot standby SIP instances for immediate failover capability Expected results: Capture all functional issues at the individual SIP instance level, and ensure that in the event of a failure, we can immediately failover manually to a standby SIP gateway instance to remediate quickly. Long-Term Improvements (Next 60 Days) -------------------------------------------- High Availability: - Implement automated SIP failover based on instance health checks - Perform quarterly automated failover tests to verify reliability Expected results: Failed SIP instances are automatically removed and replaced with healthy nodes, ensuring minimal or no manual intervention and uninterrupted service continuity.

SIP Database Maintenance

Tue, 11 Nov 2025 04:00:31 -0000

Maintenance completed

SIP Database Maintenance

Tue, 11 Nov 2025 02:00:31 -0000

We are seeing moderately higher latency on our SIP database (separate from our core application databases) resulting in slightly higher SIP response times (1-1.5 seconds). We will be performing critical maintenance on our database to remediate this issue.

Database maintenance

Sun, 09 Nov 2025 04:19:38 -0000

Maintenance completed

Database maintenance

Sun, 09 Nov 2025 04:00:55 -0000

We'll be adjusting our database configuration during a brief maintenance window of up to 1 hour. During this period, API requests may experience intermittent delays or errors, but we do not anticipate any significant disruption. We appreciate your patience and apologize for any inconvenience this may cause.

Database maintenance

Sun, 09 Nov 2025 04:00:55 -0000

SIP Maintenance

Sun, 09 Nov 2025 01:44:20 -0000

Maintenance completed

SIP Maintenance

Sun, 09 Nov 2025 01:00:00 -0000

We are making a minor change to our SIP service (hosted at sip.vapi.ai) that may result in some downtime.

SIP Maintenance

Sun, 09 Nov 2025 01:00:00 -0000

We are making a minor change to our SIP service (hosted at sip.vapi.ai) that may result in some downtime.

Degradation in SIP calls

Sat, 08 Nov 2025 08:51:00 -0000

We are working on RCA for SIP degradation, we will share it by November 12th

Database maintenance

Sat, 08 Nov 2025 05:00:33 -0000

Maintenance completed

SIP maintenance

Sat, 08 Nov 2025 04:30:00 -0000

Maintenance completed

Database maintenance

Sat, 08 Nov 2025 04:00:33 -0000

SIP maintenance

Sat, 08 Nov 2025 03:00:00 -0000

We will be performing critical maintenance on our SIP infrastructure. Calls may be impacted during this period. We appreciate your patience and apologize for any inconvenience this may cause.

Degradation in SIP calls

Fri, 07 Nov 2025 20:17:00 -0000

The SIP issue has been resolved. We will continue to monitor our systems.

Degradation in SIP calls

Fri, 07 Nov 2025 20:04:00 -0000

SIP calls are still degraded. Team is actively working on remediation.

Degradation in SIP calls

Fri, 07 Nov 2025 19:27:00 -0000

We are seeing degradation in SIP calls. The team is currently investigating the issue.

SIP calls are degraded

Fri, 07 Nov 2025 13:49:00 -0000

SIP calls are currently degraded. We're looking into it.

Elevated errors in connecting calls.

Tue, 28 Oct 2025 23:01:00 -0000

We experienced a spike in call connection errors between 3:40 and 3:58. The issue has since been resolved.

OpenAI recovered

Mon, 27 Oct 2025 05:22:10 +0000

OpenAI recovered

OpenAI went down

Mon, 27 Oct 2025 03:48:29 +0000

OpenAI went down

API + DB Degradation

Wed, 22 Oct 2025 19:35:00 -0000

We are seeing increased latency and requests timing out from API and DB degradation. We are working with our DB provider to resolve this, and have made a change. Now monitoring to ensure improvement. There was a DB restart and things are looking normal now. The issues have been resolved.

API + DB Degradation

Wed, 22 Oct 2025 19:35:00 -0000

API + DB Degradation

Wed, 22 Oct 2025 19:35:00 -0000

Elevated errors in api

Tue, 21 Oct 2025 16:54:00 -0000

we experienced elevated errors in our api (for phone calls create) at 9:35 AM PT for few minutes. This has been resolved.

Elevated errors in api

Tue, 21 Oct 2025 16:54:00 -0000

we experienced elevated errors in our api (for phone calls create) at 9:35 AM PT for few minutes. This has been resolved.

OpenAI recovered

Mon, 20 Oct 2025 08:18:55 +0000

OpenAI recovered

OpenAI went down

Mon, 20 Oct 2025 08:08:16 +0000

OpenAI went down

Google Gemini recovered

Thu, 16 Oct 2025 17:22:30 +0000

Google Gemini recovered

Google Gemini went down

Thu, 16 Oct 2025 15:47:03 +0000

Google Gemini went down

Increased 5XXs for a minute

Wed, 15 Oct 2025 18:31:00 -0000

We had a restart on our database endpoint leading to a small blip in 500s for api endpoints.

Increased 5XXs for a minute

Wed, 15 Oct 2025 18:31:00 -0000

We had a restart on our database endpoint leading to a small blip in 500s for api endpoints.

Worker died errors on calls.

Tue, 14 Oct 2025 09:00:00 -0000

We had a small blip on daily channel with call.in-progress.error-vapifault-worker-died errors due to a new daily deployment. We have rolled it back.

Degradation for inbound twilio calls

Mon, 13 Oct 2025 20:11:00 -0000

The issue with twilio inbound calls failing on daily has been resolved. The root cause was connection timeouts on a new egress proxy service.

Degradation for inbound twilio calls

Mon, 13 Oct 2025 20:01:00 -0000

We are seeing degradation in twilio inbound calls. Only daily channel is affected.

Call logs not visible on dashboard

Mon, 13 Oct 2025 07:45:00 -0000

We detected calls logs not reflecting in the dashboard for some time. This was due to an error while attaching a partition and has been resolved now. Call logs will be populated soon if they were missing.

Vapi API [Weekly] recovered

Sun, 12 Oct 2025 22:47:43 +0000

Vapi API [Weekly] recovered

Vapi API [Weekly] went down

Sun, 12 Oct 2025 22:39:05 +0000

Vapi API [Weekly] went down

Increased Latency in Vapi Web Call

Sat, 04 Oct 2025 23:22:00 -0000

After further investigation with our WebRTC provider, this does not seem to be platform issue. We will follow up with impacted users directly.

Increased Latency in Vapi Web Call

Sat, 04 Oct 2025 23:22:00 -0000

After further investigation with our WebRTC provider, this does not seem to be platform issue. We will follow up with impacted users directly.

Increased Latency in Vapi Web Call

Fri, 03 Oct 2025 23:00:00 -0000

We have detected increased latency in Vapi Web Calls which may prevent certain users from joining the call and ultimately ending the call with ended reason: `call.in-progress.error-assistant-did-not-receive-customer-audio`. We are actively working with our WebRTC provider to resolve the issue. To mitigate this, you can try increasing the `customerJoinTimeoutSeconds` property of your assistant. ```bash curl -X PATCH https://api.vapi.ai/assistant/ \ -H "Authorization: Bearer " \ -H "Content-Type: application/json" \ -d '{ "customerJoinTimeoutSeconds": 60 }' ```

Increased Latency in Vapi Web Call

Fri, 03 Oct 2025 23:00:00 -0000

Elevated API errors between 9:30-11:30am PT on 30th Sept.

Tue, 30 Sep 2025 19:11:00 -0000

We experienced intermittent spikes in 5xx errors on our APIs in the weekly cluster. The root cause was identified, and a fix has already been implemented. During this period, both inbound and outbound calls may have been affected, as they rely on the APIs for data, resulting in potential service degradation.

Play.ht recovered

Mon, 29 Sep 2025 06:45:37 +0000

Play.ht recovered

Play.ht went down

Mon, 29 Sep 2025 05:53:28 +0000

Play.ht went down

Play.ht recovered

Sun, 28 Sep 2025 09:47:27 +0000

Play.ht recovered

Play.ht went down

Sun, 28 Sep 2025 08:54:18 +0000

Play.ht went down

Degradation in connecting SIP calls.

Fri, 12 Sep 2025 22:47:00 -0000

Services are restored.

Degradation in connecting SIP calls.

Fri, 12 Sep 2025 22:19:00 -0000

We're noticing slight increase in failures to connect SIP calls. Our team is investigating on priority.

High errors with connecting to our assistant Db. Calls and API are affected.

Wed, 10 Sep 2025 19:27:00 -0000

Problem has been resolved now. All services are healthy again.

High errors with connecting to our assistant Db. Calls and API are affected.

Wed, 10 Sep 2025 19:27:00 -0000

Problem has been resolved now. All services are healthy again.

High errors with connecting to our assistant Db. Calls and API are affected.

Wed, 10 Sep 2025 19:27:00 -0000

Problem has been resolved now. All services are healthy again.

High errors with connecting to our assistant Db. Calls and API are affected.

Wed, 10 Sep 2025 19:17:00 -0000

We are investigating the problem.

High errors with connecting to our assistant Db. Calls and API are affected.

Wed, 10 Sep 2025 19:17:00 -0000

We are investigating the problem.

High errors with connecting to our assistant Db. Calls and API are affected.

Wed, 10 Sep 2025 19:17:00 -0000

We are investigating the problem.

Call Logs Since 09/05 00:00 UTC not loading

Fri, 05 Sep 2025 01:43:00 -0000

We have fixed the issue. Call Logs are returned correctly by API

Call Logs Since 09/05 00:00 UTC not loading

Fri, 05 Sep 2025 01:43:00 -0000

We have fixed the issue. Call Logs are returned correctly by API

Call Logs Since 09/05 00:00 UTC not loading

Fri, 05 Sep 2025 01:43:00 -0000

We have fixed the issue. Call Logs are returned correctly by API

Call Logs Since 09/05 00:00 UTC not loading

Fri, 05 Sep 2025 00:00:00 -0000

We’ve identified an issue in our API that is preventing call logs from loading after 00:00 UTC September 5, 2025. Our team is actively working on a fix. Calls are not affected, and there is no data loss. We’ll provide updates here as soon as the issue is resolved.

Call Logs Since 09/05 00:00 UTC not loading

Fri, 05 Sep 2025 00:00:00 -0000

Call Logs Since 09/05 00:00 UTC not loading

Fri, 05 Sep 2025 00:00:00 -0000

Cartesia Voices Degraded

Thu, 04 Sep 2025 20:55:00 -0000

Cartesia has resolved the issue, and is full operational.

Cartesia Voices Degraded

Thu, 04 Sep 2025 20:13:00 -0000

Cartesia voices are experiencing a service degradation and returning 500s which might cause calls to end with call.in-progress.error-vapifault-cartesia-voice-failed. We are closely monitoring the issue, and recommend setting a voice fallback or moving to Vapi voices while this is resolved. You can also track the status at https://status.cartesia.ai/

Elevated cases of assistant not responding in calls on daily cluster

Wed, 03 Sep 2025 10:06:00 -0000

We have identified the root cause. The problem has been fixed by rolling back a recent deployment on daily.

Elevated cases of assistant not responding in calls on daily cluster

Wed, 03 Sep 2025 09:56:00 -0000

We are seeing cases of calls going silent. Either assistant is not responding, causing silence timeouts.

Call Transfers Degradation on Vapi Phone Number

Tue, 02 Sep 2025 19:24:00 -0000

We have scaled up our telephony infrastructure resources and bumped our rate limits. We haven't seen any more issues in the last 20 minutes, and call transfers are working as expected now. We are closely monitoring.

Call Transfers Degradation on Vapi Phone Number

Tue, 02 Sep 2025 14:00:00 -0000

We have identified a high error rate on call transfers when using Vapi Phone Numbers. To the end-user this may cause call drops when assistants initiate a call transfer. The team is actively working in our SIP infrastructure to resolve this.

Deepgram Aura-2 TTS Performance Degradation

Thu, 28 Aug 2025 19:04:00 -0000

Deepgram is investigating an issue where a subset of requests may return elevated rates of 5XX errors or experience significantly higher time to first byte

Aura-2 TTS Performance Degradation

Wed, 27 Aug 2025 21:00:00 -0000

This incident has been resolved.

Aura-2 TTS Performance Degradation

Wed, 27 Aug 2025 17:00:00 -0000

Deepgram reported high rate of 500 errors when using their Aura-2 Voices, which may impact Vapi calls if using this provider. (e.g. ended reason call.in-progress.error-vapifault-deepgram-voice-failed) Follow deepgram's incident report here: https://status.deepgram.com/incidents/sl3zxvhddf1w Recommendations 1. Temporarily switch to another voice, like Vapi or Elevenlabs. 2. Configure a voice fallback. Your calls will still go to Deepgram first but if it fails, it will switch voice to another provider. We won't drop any calls but the user will hear another voice.

Intermittent issues with the dashboard loading

Tue, 26 Aug 2025 00:04:00 -0000

The issue was pinpointed and reverted quickly

Intermittent issues with the dashboard loading

Mon, 25 Aug 2025 23:57:00 -0000

The team has determined the code change which caused the issues and rolled it back. We are continuing to monitor.

Intermittent issues with the dashboard loading

Mon, 25 Aug 2025 21:50:00 -0000

We are seeing issues with the dashboard sidebar loading for some customers. We are looking into it and will update here as we know more. For the time being, users can workaround this issue by clearing local cache and cookies.

Elevated error rates on ElevenLabs Voice requests

Wed, 06 Aug 2025 04:14:00 -0000

Elevenlabs released a fix and is fully operational now. We are also seeing normal levels, but will continue to monitor. For impacted users, we recommend implementing Vapi Fallback Plan to automatically failover in the future https://docs.vapi.ai/voice-fallback-plan

Elevated error rates on ElevenLabs Voice requests

Wed, 06 Aug 2025 02:34:00 -0000

ElevenLabs is currently dropping requests due to elevated loads. We're closely monitoring the situation. Some calls using Vapi or Elevenlabs Voices might be degraded. We recommend switching to Cartesia TTS while this is being resolved. We recommend leveraging Vapi Fallback Plan to automatically fallback in the future: https://docs.vapi.ai/voice-fallback-plan

API and calls were momentarily disrupted

Tue, 05 Aug 2025 19:12:00 -0000

# IR August 4th: Call Degradation due to Pod Evictions ## TL;DR On August 4th, an incident occurred due to aggressive pod consolidation by Karpenter, which caused Redis pods to be evicted and restarted. This led to API pod failures, triggering a failover to an outdated networking component, resulting in dropped calls. The incident caused a total of 393 calls to be dropped. ## Timeline (PST) ### August 4th - **11:02-11:27 AM** - Core team identifies Karpenter pods in CrashLoopBackOff (OOMKilled due to high call volume), leading to aggressive pod consolidation. - **11:27 AM** - Redis pods evicted with message: `"Evicted pod: Drifted."` Redis pods restart on new nodes, causing dependent API pods to fail. - **11:28 AM** - Cloudflare load balancer detects failing API pods and initiates failover to a secondary networking component. - **11:28-11:29 AM** - The secondary networking component, outdated and improperly scaled, misroutes traffic, resulting in additional call failures. - **11:29 AM** - Worker unavailability due to misrouting causes a total of 393 call drops. - **11:30 AM** - Corrective rollout completes, restoring worker availability. - **11:31 AM** - Stability restored. ## Root Cause The incident was triggered by aggressive node consolidation from Karpenter following initial resource constraints. Critical Redis pods were evicted without adhering to their PodDisruptionBudgets (PDBs), causing API pod failures. This failure initiated a Cloudflare load balancer failover to an outdated networking component, resulting in dropped calls. ## Impact - **393 total calls dropped** due to worker unavailability. - Temporary service disruption impacting Redis and API services. ## What Went Well? - Quick response by the incident response team. ## What Went Poorly? - Networking components were not maintained in parity, worsening the impact during failover. - PodDisruptionBudgets (PDBs) for Redis pods were improperly configured, allowing unintended evictions. - Lack of monitoring for Karpenter restarts delayed detection by several hours. ## Remediation steps taken - Increase memory limits for Karpenter in configuration management. - Add protective annotations (`karpenter.sh/do-not-disrupt: "true"`) to critical Redis pods. - Integrate Karpenter logs with centralized logging for improved visibility. - Implement monitoring to detect Karpenter pod restarts. If working on realtime distributed systems excites you, consider applying: https://jobs.ashbyhq.com/vapi/295f5269-1bb5-4740-81fa-9716adc32ad5

API and calls were momentarily disrupted

Mon, 04 Aug 2025 18:14:00 -0000

Around 11:00 AM, a sudden surge in call volume that caused connection failures. The same spike also disrupted the API, likely resulting in multiple 5xx errors.

SIP Maintenance

Thu, 31 Jul 2025 04:30:00 -0000

Maintenance completed

SIP Maintenance

Thu, 31 Jul 2025 04:15:00 -0000

We need to perform an upgrade on our SIP service. Therefore, the SIP service needs a restart. It should be a quick restart but there might be some disruption with ongoing calls and incoming calls, during the restart.

Deepgram is experiencing connection issues

Tue, 29 Jul 2025 05:19:00 -0000

We are no longer seeing signs of connection issues. The issue should be resolved now, but we will continue to monitor. We apologize for any inconvenience caused.

Deepgram is experiencing connection issues

Mon, 28 Jul 2025 18:50:00 -0000

Deepgram is experiencing intermittent issues with their WebSocket connections for both transcription and voice services. This may impact the experience in your assistants. Recommended Action: Temporarily switch to another provider, or configure a fallback transcriber / voice

Incident Report: Increased call failures on July 25 (PST)

Fri, 25 Jul 2025 21:46:00 -0000

**Incident Report:** Increased call failures on July 25 (PST) **Summary (TL;DR)** On July 25 between 7:00–7:15am PST, a spike in call volume caused some calls to fail with a worker-not-available error. The fallback service for short calls (our serverless workers) could not start because its image architecture didn’t match the configured runtime (image built for ARM64, runtime set to x86). We stabilized the platform by scaling primary workers and then corrected the configuration. Service is operating normally. **Impact** Total failed calls: 3,028 between 7:00–7:15am PST with error call.in-progress.error-vapifault-worker-not-available. 1,122 of these calls were eligible to be handled by our serverless workers but still failed. **Current status**: Resolved. No action is required from customers. If you experienced failures during this window, please retry the affected calls. **Timeline (PST, July 25)** 7:00am: Sudden spike in incoming calls. 7:00–7:15am: Elevated failures with worker-not-available. ~11:51am: Incident triage began; we confirmed our autoscaling attempted to invoke serverless workers. ~12:56pm: Root cause identified: serverless worker image was built for ARM64 while the runtime was still configured for x86, preventing startup. **After identification**: We increased capacity on primary backends to minimize reliance on the fallback path and then redeployed the serverless workers with the correct architecture. **Root Cause** A configuration mismatch between the container image architecture (ARM64) and the serverless runtime setting (x86) prevented our fallback workers from starting during a sudden traffic surge. **Remediation & Prevention** *Completed* Aligned serverless runtime architecture with the container image (ARM64 ↔ ARM64). Temporarily scaled primary worker capacity to handle surges while deploying the fix. *In Progress / Planned* **Automated canary tests**: Periodically invoke serverless workers to ensure readiness and catch regressions early. **Alerting**: Add targeted alerts when the fallback path is degraded or invocation rates drop unexpectedly. Build-time and deploy-time guards: Enforce architecture checks so image and runtime must match before deployment. **Dependency review**: Audit and, where needed, adjust dependencies to ensure reliable ARM operation in serverless environments.

Elevenlabs is experiencing increased Text-to-Speech latency

Thu, 24 Jul 2025 18:30:00 -0000

Elevenlabs released a hotfix and is fully operational now.

Elevenlabs is experiencing increased Text-to-Speech latency

Thu, 24 Jul 2025 13:40:00 -0000

Elevenlabs reported increased latency in text-to-speech requests (voice). This may impact the experience in your assistants. More details in: https://status.elevenlabs.io/incidents/01K0YAV1N4W7EZW1BMQCQJ4YJR Impact - Elevenlabs Voices - Vapi Voices Recommended Action: temporarily switch to another text to speech provider.

Regular DB maintenance

Wed, 23 Jul 2025 03:30:52 -0000

Maintenance completed

Regular DB maintenance

Wed, 23 Jul 2025 02:00:52 -0000

We are rolling out an important update to our database. Call logs and analytics may be impacted during this period. We appreciate your patience and apologize for any inconvenience this may cause.

Regular DB maintenance

Sat, 19 Jul 2025 08:30:00 -0000

Maintenance completed

Regular DB maintenance

Sat, 19 Jul 2025 07:30:00 -0000

We are rolling out security patches and important updates to our database. This would need a restart of all our database servers. Restarts are quick but there could be a few seconds of intermittent unavailability.

Web Calls from Assistant Page are unavailable

Thu, 17 Jul 2025 17:50:00 -0000

Dashboard calls should be back to normal.

Deepgram transcription is degraded

Thu, 17 Jul 2025 17:34:00 -0000

This was resolved

Web Calls from Assistant Page are unavailable

Thu, 17 Jul 2025 16:45:00 -0000

We are investigating an issue that blocks users from talking to their assistant from the Assistant page on Daily channel. Weekly channel and all other calls don't seem to be affected.

Deepgram transcription is degraded

Tue, 15 Jul 2025 19:43:00 -0000

Deepgram transcription is currently degraded due to rate limiting. This is causing an increase in transcriber failures and silence timeouts. We are working with the Deepgram team to resolve the issue.

SIP Infrastructure Changes

Sun, 13 Jul 2025 05:00:11 -0000

Maintenance completed

SIP Infrastructure Changes

Sun, 13 Jul 2025 03:00:00 -0000

We will be making changes to our SIP infrastructure which may result in some service degradation, especially for SIP REFER's and outbound calls.

Up to date calls not returned in API or dashboard

Thu, 10 Jul 2025 04:24:00 -0000

We have identified and resolved the issue. Apologies for the disruption.

Up to date calls not returned in API or dashboard

Thu, 10 Jul 2025 04:24:00 -0000

We have identified and resolved the issue. Apologies for the disruption.

Up to date calls not returned in API or dashboard

Thu, 10 Jul 2025 04:24:00 -0000

We have identified and resolved the issue. Apologies for the disruption.

Up to date calls not returned in API or dashboard

Thu, 10 Jul 2025 04:08:00 -0000

The call logs view is not showing up to date call history in the Vapi dashboard or API. The team is looking into it and will update here.

Up to date calls not returned in API or dashboard

Thu, 10 Jul 2025 04:08:00 -0000

The call logs view is not showing up to date call history in the Vapi dashboard or API. The team is looking into it and will update here.

Up to date calls not returned in API or dashboard

Thu, 10 Jul 2025 04:08:00 -0000

The call logs view is not showing up to date call history in the Vapi dashboard or API. The team is looking into it and will update here.

Playht voices are degraded

Tue, 08 Jul 2025 06:18:00 -0000

Resolved now.

Playht voices are degraded

Mon, 07 Jul 2025 00:12:00 -0000

Playht voices are not working reliably right now. We are working with PlayHT team to resolve it urgently. For now please switch to a different provider or add a fallback voice.

Database maintenance

Tue, 01 Jul 2025 05:30:32 -0000

Maintenance completed

Database maintenance

Tue, 01 Jul 2025 05:00:32 -0000

We'll be adjusting our database configuration during a brief maintenance window of up to 30 minutes. During this period, API requests may experience intermittent delays or errors, but we do not anticipate any significant disruption. We appreciate your patience and apologize for any inconvenience this may cause.

SIP is degraded

Tue, 01 Jul 2025 04:43:00 -0000

TLDR: A temporary slowdown caused by saturation in our API gateway layer increased response times until they exceeded the edge-network timeout, causing a 524 HTTP response for some API requests. Timeline in PST 01:00 AM First elevated 524 error responses detected 06:35 AM Rolled back recent backend release (no improvement) 07:19 AM Rolled back related network changes (no improvement). 08:22 AM Scaled up API gateway 09:36 AM Scaled up API gateway further 10:00 AM Reverted the previous night's SIP gateway update; error rate returned to normal. Impact - Based on our telemetry, a total of 58,769 requests were affected. - Distribution, grouped by request path: - /phone-number/status - 33,855 - /phone-number/hook - 8,277 - /phone-number/sip - 6,719 - /phone-number/inbound - 3,836 - 6082 across 25 other endpoints What went poorly? - Delayed root-cause isolation. Initial rollbacks focused on application and network layers, but the underlying issue originated elsewhere, leading to a longer mitigation window. - Saturation metrics for the API gateway layer were not being tracked, which slowed down error diagnosis. - Reverting changes to our SIP gateway is not a swift process, unlike rolling back our clusters. - On call should have escalated issue quicker. What went well? - Only SIP calls saw degradation, other customer traffic remained largely unaffected. Remediations - [x] Increase observability in the API gateway, specifically metrics - [x] Blue green deployments for our SIP gateway for quicker change reversion - [ ] Collaborate with our SIP gateway provider to investigate potential issues on the SIP gateway end If working on realtime distributed systems excites you, consider applying: https://jobs.ashbyhq.com/vapi/295f5269-1bb5-4740-81fa-9716adc32ad5

Version upgrade for dedicated SIP instances

Sun, 29 Jun 2025 00:00:18 -0000

Maintenance completed

Version upgrade for dedicated SIP instances

Sat, 28 Jun 2025 20:00:18 -0000

We are scheduling a 4 hour maintenance window 2025-06-28 1-5pm PT to upgrade the version of our dedicated SIP infrastructure. There may be some disruption to calls during this time.

DB High Latency

Wed, 25 Jun 2025 18:54:00 -0000

The issue has been resolved: https://neonstatus.com/aws-us-west-oregon/incidents/01JYM23FB7HR82VPZR9DBKVPP8#01JYM4F8Y8MZARWJWRFKV01AP5.

DB High Latency

Wed, 25 Jun 2025 18:54:00 -0000

The issue has been resolved: https://neonstatus.com/aws-us-west-oregon/incidents/01JYM23FB7HR82VPZR9DBKVPP8#01JYM4F8Y8MZARWJWRFKV01AP5.

DB High Latency

Wed, 25 Jun 2025 17:38:00 -0000

Our database provider has reported high latency in our region, this will cause increased latency and possible timeouts in our service as well. We are monitoring here: https://neonstatus.com/aws-us-west-oregon/incidents/01JYM23FB7HR82VPZR9DBKVPP8#01JYM23FB7QF0YZRZAQMZGHMEA.

DB High Latency

Wed, 25 Jun 2025 17:38:00 -0000

SIP is degraded

Wed, 25 Jun 2025 17:28:00 -0000

We rolled back a version change to our SIP infrastructure earlier today around 10am PT and since then have seen stability. We will update here with a more complete timeline and RCA tomorrow.

SIP is degraded

Wed, 25 Jun 2025 17:17:00 -0000

The issue has come up again. We are working with our SIP infrastructure provider to resolve.

SIP is degraded

Wed, 25 Jun 2025 14:35:00 -0000

We cutover to a previous deployment and are seeing improvement. We are continuing to monitor and will provide an RCA later today.

SIP is degraded

Wed, 25 Jun 2025 13:04:00 -0000

We are investigating an issue with our SIP gateway. We will update this thread with more information.

Weekly cluster call export maintenance

Wed, 25 Jun 2025 01:00:02 -0000

Maintenance completed

Weekly cluster call export maintenance

Tue, 24 Jun 2025 15:00:02 -0000

We’re currently performing maintenance on our analytics database. As a result, call exports from the weekly cluster may return blank CSV files until maintenance is complete. If you run into this issue and need to export data, please temporarily switch your organization’s export setting to daily, then revert back to weekly after exporting. Maintenance will finish by 6 PM PST today. Thank you for your patience.

Increased database latency causing requests to fail

Fri, 20 Jun 2025 20:43:00 -0000

Our database provider has reported this issue as resolved from their end

Increased database latency causing requests to fail

Fri, 20 Jun 2025 20:43:00 -0000

Our database provider has reported this issue as resolved from their end

Increased database latency causing requests to fail

Fri, 20 Jun 2025 18:04:00 -0000

We are seeing issues with API requests being timed out or aborted. This is because of an increase in latency from our database provider. We are monitoring the issue: https://neonstatus.com/aws-us-west-oregon.

Increased database latency causing requests to fail

Fri, 20 Jun 2025 18:04:00 -0000

Breaking changes to Success Evaluation API Response

Thu, 19 Jun 2025 02:42:00 -0000

## TL;DR In response to hallucinations reports in the Success Evaluation feature, we updated our integration with Gemini LLM to use Structured Output. This inadvertently changed the type of the call.analysis.successEvaluation field from string | null to string | number | boolean | null, introducing a breaking change for customers with strict type validation and those using Vapi Server SDKs. ## Timeline (all in PT) - June 12, 11:32pm: Enterprise and Startup users report hallucinations in Success Evaluation field. Engineer acknowledges reports and begins work in a solution by migrating to Gemini Structured Output. - June 16, 11:35pm: Migration to Structured Output is completed. Update passes automated code tests and is merged into main branch. - June 17, 1:24pm: Update is released, inadvertently introducing changes in the type of call.analysis.successEvaluation property. - June 18, 11:15am: Enterprise users reports breaking change in webhook message; investigation begins. - June 18, 1:51pm: Vapi team decides to retain the new type change and communicates to affected users, requesting updates to their servers to accept string | number | boolean | null. - June 18, 3:43pm: Enterprise users reports Go SDK-specific issue; investigation begins. - June 18, 4:08pm: Team identifies broader SDK impact and start work on a patch to revert API to string-only output while keeping Structured Output. - June 18, 7:42pm: Patch reverting API output to string-only is released. ## Impact Between June 17th 1:24 pm and June 18th 7:42 pm, organizations in daily channel, using strict type validation on their servers or using Vapi Server SDKs experienced issues when processing post call analysis events. ## What went wrong? - Automated tests failed to catch the breaking change in API response. - Poor communication of internal changes to core platform features. - Underestimated the impact, leading to a late rollback (+24hrs) ## What went well? - Organizations in weekly channel were not affected. - Calls were not affected on any of the channels. - Hallucination issue appears resolved. ## Action Items - Testing: Build comprehensive integration tests to catch response type changes. - Communication: Design better notifications and public changelog protocols for potential breaking changes. - Support: Support affected customers and requested server updates. Follow ups to confirm no further issues and assist with any remaining fixes.

Breaking changes to Success Evaluation API Response

Tue, 17 Jun 2025 20:24:00 -0000

Organizations in daily channel report breaking change end of call report. Property `call.analysis.successEvaluation` was migrated from `string | null` to `string | number | boolean | null`. Organizations in weekly channel are not affected.

Sign-ups/Sign-ins are not working

Fri, 13 Jun 2025 08:12:00 -0000

It is resolved.

Sign-ups/Sign-ins are not working

Thu, 12 Jun 2025 19:47:00 -0000

Supabase and its upstream provider Cloudflare are reporting that services are recovering. Similarly, we are seeing sign-ups and sign-ins working again, though there may be intermittent disruption to the service. We are continuing to monitor and observe our upstream providers status pages for change. https://status.supabase.com/ https://www.cloudflarestatus.com/

Sign-ups/Sign-ins are not working

Thu, 12 Jun 2025 18:19:00 -0000

We use Supabase for authentication which is having an issue due to a Cloudflare outage. Our authentication endpoint is down impacting auth flows for sign-ups and sign-ins. We are investigating. Phone calls are still working and our API is accessible. WebRTC (daily.co) calls will fail.

Elevenlabs voice provider not working with custom API Key

Thu, 12 Jun 2025 10:20:00 -0000

Summary: We experienced an issue related to API key validation within our WebSockets implementation when sending the API key more than once. Details: The issue arose during API key validation within our WebSockets implementation. Our system validates that the API key provided during the initial message is the same as proceeding messages. A recent change introduced during a release caused a mismatch in how API keys were compared. Specifically, the system was comparing hashed API keys against non-hashed API keys. This comparison would always fail, as hashed and non-hashed keys are inherently different. The impacted API keys were legacy API keys, which were not being hashed. Timeline (GMT +2): Release Ready: 9:23 AM Full Deployment: 9:52 AM Reported by Vapi: 12:28 PM Rollback Initiated: 12:53 PM Impact: This issue impacted a small number of clients using non-legacy API keys who also provided the API key multiple times during the WebSocket connection. Specifically, if the API key was provided during the initial connection and then again in subsequent messages, our system performs a validation check. Due to a flawed comparison between hashed and non-hashed API keys, this validation check failed for those clients sending API keys multiple times, resulting in the error you saw. Resolution: - The engineering team has implemented a fix to ensure API keys are compared correctly, regardless of whether they are hashed or non-hashed. The fix has been deployed. Preventative Measures: - To prevent similar issues in the future, the following steps are being taken: We already had tests for this, but unfortunately, we found issues with the tests that clearly didn't catch this because of a race condition. That race condition has since been solved. - We’ve also made sure the tests now block merges.

Elevenlabs voice provider not working with custom API Key

Mon, 09 Jun 2025 11:02:00 -0000

Services are back up now. Elevenlabs rolled back a change, errors have come down now, resolving it. We will keep monitoring the situation further for some time.

Elevenlabs voice provider not working with custom API Key

Mon, 09 Jun 2025 10:39:00 -0000

We are working with 11labs team to resolve an issue wherein 11labs are not working when users bring their own key on Vapi.

Weekly cluster maintenance

Tue, 03 Jun 2025 17:00:28 -0000

Maintenance completed

Weekly cluster maintenance

Tue, 03 Jun 2025 17:00:28 -0000

Weekly cluster is undergoing additional maintenance

Weekly cluster maintenance

Tue, 03 Jun 2025 17:00:28 -0000

Weekly cluster is undergoing additional maintenance

Weekly cluster maintenance

Tue, 03 Jun 2025 08:00:25 -0000

Maintenance completed

Weekly cluster maintenance

Mon, 02 Jun 2025 18:00:25 -0000

Weekly cluster is under additional monitoring and maintenance after update. We should have things resolved by tonight

API was down

Mon, 02 Jun 2025 06:00:00 -0000

API was down due to user error in routine maintenance. Service has since been restored

API was down

Mon, 02 Jun 2025 06:00:00 -0000

API was down due to user error in routine maintenance. Service has since been restored

API was down

Mon, 02 Jun 2025 05:45:00 -0000

API was down due to user error in routine maintenance. Service has since been restored

API was down

Mon, 02 Jun 2025 05:45:00 -0000

API was down due to user error in routine maintenance. Service has since been restored

Users Unable to Sign In to Dashboard

Tue, 27 May 2025 01:37:00 -0000

Summary Users experienced login issues with our dashboard due to an unintended deployment of a staging version to the production environment. Timeline (in PST): * 3:17 PM: Internal engineers identified issues affecting developer workflows. * 4:19 PM: Breaking change is introduced and unintentionally deployed to production * 4:38 PM: First customer reports surfaced; engineering team immediately escalated internally. * 4:43 PM: Public status page updated to notify customers. * 4:54 PM: Corrective actions deployed. * 5:08 PM: Additional steps taken to accelerate resolution for users. * 5:17 PM: Issue fully resolved and status page updated accordingly. Impact: * Users were temporarily unable to log into the dashboard. * The issue was promptly reported and escalated by affected users. Root Cause: A configuration change intended to streamline internal development processes unintentionally led to the deployment of a staging version of our dashboard to the production environment. This occurred because the system did not adequately distinguish between environments in the deployment workflows, resulting in incorrect settings being applied in production. What Went Well: * Internal escalation was rapid, and the status page effectively informed users quickly. What Went Poorly: * Limited tooling for rapid rollbacks led to extended resolution time. * Insufficient clarity around deployment workflows contributed to the incident. Corrective Actions Taken: * Immediately reverted the unintended deployment and restored the correct production configuration. * Purged caches to expedite the resolution. Future Preventative Measures: * Enhance deployment configuration to clearly separate staging and production environments. * Improve tools and processes for more rapid rollback capabilities in future deployments.

Users Unable to Sign In to Dashboard

Tue, 27 May 2025 00:08:00 -0000

The sign-in issue has been resolved, and a fix has been successfully deployed. Users should now be able to access the dashboard as expected. We are currently preparing a RCA and will share it soon.

Users Unable to Sign In to Dashboard

Mon, 26 May 2025 23:40:00 -0000

We are currently investigating an issue preventing some users from signing in to the dashboard. The team is actively working on a fix. We will provide updates as progress is made. Thank you for your patience.

Cartesia voices are degraded

Sun, 18 May 2025 20:56:00 -0000

Everything is functional. We're still working with Cartesia to get to bottom. We'll change back to degraded if the issue raises again during investigation.

Cartesia voices are degraded

Sun, 18 May 2025 20:17:00 -0000

It's all working now as Cartesia team has bumped our limits. We're still investigating the issue

Cartesia voices are degraded

Sun, 18 May 2025 20:11:00 -0000

We're investigating an internal bug causing 429s on Cartesia.

Vapifault Worker Timeouts

Tue, 13 May 2025 17:31:00 -0000

# RCA: Vapifault Worker Timeouts ## TL;DR On May 12, approximately 335 concurrent calls were either web-based or exceeded 15 minutes in duration, surpassing the prescaled worker limit of 250 on the weekly environment. Due to infrastructure constraints, Lambda functions could not supplement the increased call load. Kubernetes call-worker pods could not scale quickly enough to meet demand, resulting in worker timeout issues. The following day, this issue reoccurred due to the prescaling limit being inadvertently reset to the lower default value during a routine deployment. ## Timeline (PT) - **May 12, 1:30 pm:** Customer reports issues related to worker timeouts. - **May 12, 4:39 pm:** Another customer reports the same issue with worker timeouts. - **May 12, 5:19 pm:** Workers scaled manually from 250 to 350; service restored. - **May 12, 11:48 pm:** Routine deployment resets worker prescale count back to 250. - **May 13, 10:47 am:** Customer reports recurrence of worker timeout issue. - Concurrent increase in overall call volume further exacerbates worker availability. - **May 13, 11:29 am:** Workers scaled again to 350 on weekly and increased to 750 on daily; service fully restored. ## Impact - Approximately **2,461 calls** dropped due to worker connection timeouts. ## What Went Wrong? - **Insufficient Monitoring:** Worker timeout events were not correctly captured by monitoring because of how `callEndedReason` is logged. - Customers identified and reported the issue before internal monitoring did. - **Configuration Drift:** Prescale worker count change was not committed to the main configuration branch, causing resets during routine deployments. - **Alert Handling:** Lambda invocation alerts fired but were deprioritized as "requires investigation but not urgent." ## What Went Well? - Rapid remediation once the problem was identified.

Vapifault Worker Timeouts

Tue, 13 May 2025 17:31:00 -0000

providerfault-transport errors

Tue, 13 May 2025 17:29:00 -0000

# RCA: Providerfault-transport-never-connected ## Summary During a surge in inbound call traffic, two distinct errors were observed: "vapifault-transport-worker-not-available" and "providerfault-transport-never-connected." This report focuses on the root cause analysis of the "providerfault-transport-never-connected" errors occurring during the increased call volume. ## Timeline of Events (PT) - **10:26 AM:** Significant spike in inbound call volume. - **10:26 – 10:40 AM:** Intermittent HTTP 520 errors returned by CDN for inbound call endpoints (46 calls impacted). - **11:00 AM – 12:00 PM:** Infrastructure intermittently failed to establish transport connections despite successfully picking up calls (172 calls impacted). - **12:00 PM:** Call volume returns to normal; errors cease. ## Root Cause Analysis ### 1. HTTP 520 Errors at CDN - High load triggered intermittent HTTP 520 errors for critical endpoints. - Internal tracing confirmed successful API responses not properly relayed back, indicating issues in network layers external to core services. - Active investigation ongoing with network provider to identify the underlying cause. ### 2. Resource Exhaustion on Proxy Service - During peak load, the proxy service responsible for handling call connections exhausted available CPU and memory resources (observed usage ~1.27 CPU cores and 1.2 GB RAM). - Insufficient resource allocation led to failed transport connections. - Logs showed degraded pod performance, including failures in auxiliary tasks like recording uploads. ## What Went Wrong? - **Misclassification of Errors:** Internally treated as external provider faults rather than recognizing infrastructure capacity issues. - **Insufficient Monitoring:** Lack of alerts and monitoring for proxy resource saturation conditions. - **Load-Testing Gap:** Prior load tests did not replicate proxy resource constraints encountered in production scenarios.

providerfault-transport errors

Tue, 13 May 2025 17:29:00 -0000

SIP calls abruptly closing after 30 seconds

Tue, 13 May 2025 17:27:00 -0000

# RCA: SIP Calls Ending Abruptly ## TL;DR A SIP node was rotated, and the associated Elastic IP (EIP) was reassigned to the new node. However, the SIP service was not restarted afterward, causing the SIP service to use an incorrect (private) IP address when sending SIP requests. Consequently, users receiving these SIP requests attempted to respond to the wrong IP address, resulting in ACK timeouts. ## Timeline (PT) - **May 12, ~9:00 pm:** SIP node rotated and Elastic IP reassigned, but SIP service was not restarted. - Calls appeared to succeed initially because they were routed through a healthy SIP node. - **May 13, 12:44 pm:** Customer reports SIP calls consistently failing after approximately 30-31 seconds. - **May 13, 12:49 pm:** SIP service restarted; customer confirms issue resolved. ## Impact - 35 calls experienced "ACK timeout" failures, corresponding directly to failed customer calls. ## What Went Wrong? - Lack of monitoring and alerting for SIP-related failures. - Issue persisted unnoticed for approximately 3 hours. - Customer reported issue first, not internal systems. - Absence of documented runbooks for SIP node rotation process. - No load test conducted following node rotation to verify successful SIP routing. ## What Went Well? - Rapid issue remediation following customer escalation.

Stale data for Weekly users

Tue, 13 May 2025 17:22:00 -0000

# RCA: Phone Number Caching Error in Weekly Environment ## TL;DR Certain code paths allowed caching functions to execute without an associated organization ID, preventing correct lookup of the organization's channel. This unintentionally enabled caching for the weekly environment, specifically affecting inbound phone call paths. Users consequently received outdated server URLs after updating phone numbers. ## Timeline (PT) - **May 10, 1:26 am:** Caching re-enabled for users in daily environment using the feature flag. - **May 13, 10:42 am:** Customer reports phone calls referencing outdated server URLs after updates. - **May 13, 11:18 am:** Caching disabled globally; service fully restored. - **May 13, ~10:00 pm:** Fix deployed to weekly environment; caching globally re-enabled. ## Impact - Customers experienced degraded service; updates to server URLs or assistant configurations for phone numbers did not immediately reflect during calls. - Issue previously identified and resolved in daily environment resurfaced in weekly due to incomplete implementation of the feature flag. ## What Went Wrong? - Inadequate testing of the feature flag allowed unintended caching on some paths. - Lack of proper failure handling when organization ID was missing. - Issue surfaced through customer reporting, not internal monitoring. - Fix deployed to daily environment was not applied to weekly environment in time. ## What Went Well? - Feature flag system allowed rapid disabling of caching globally once identified.

Voice issues due to 11labs quota

Sat, 10 May 2025 17:34:00 -0000

# RCA: 11Labs Voice Issue ## TL;DR Calls began failing due to exceeding the 11Labs voice service quota, resulting in errors (`vapifault-eleven-labs-quota-exceeded`). ## Timeline of Events (PT) - **12:04 PM:** Calls begin failing due to 11Labs quota being exceeded. - **12:16 PM:** Customer reports the issue as a production outage. - **12:24 PM:** Contacted 11Labs support regarding quota exhaustion. - **12:25 PM:** 11Labs support recommends enabling usage-based billing. - **12:26 PM:** Usage-based billing activated; issue resolved immediately. ## Root Cause Analysis - The incident occurred because the monthly quota limit for 11Labs voice services was reached. - Example error log: ``` { "message": "This request exceeds your quota of 2000000000. You have 4 credits remaining, while 23 credits are required for this request.", "error": "quota_exceeded", "code": 1008 } ``` ## What Went Wrong? - Lack of proactive alerting: No paging occurred because logs were being sampled and adequate monitors were not in place in the new logging system. - Initial difficulty diagnosing the issue quickly due to limited familiarity with the new logging tool (Axiom). ## What Went Well? - Rapid response and effective support provided by the external vendor (11Labs). - Swift resolution once the problem was clearly identified.

Voice issues due to 11labs quota

Sat, 10 May 2025 17:34:00 -0000

Upgrading Weekly Cluster

Sun, 04 May 2025 04:20:35 -0000

Maintenance completed

Upgrading Weekly Cluster

Sun, 04 May 2025 03:20:35 -0000

Regular upgrades to cluster

API degradation

Sat, 03 May 2025 01:43:00 -0000

# RCA for May 2nd User error in manual rollout ## Root cause: * User error in kicking off a manual rollout, driven by unblocking a release * Due to this, load balancer was pointed at an invalid backend cluster ## Timeline * 5:24pm PT: Engineer flagged blocked rollout, Infra engineer identified transient error that auto-blocked rollout * 5:31pm PT: Infra engineer triggered manual rollout on behalf of engineer, to unblock release * 5:43pm PT: On-call was paged with issue in rollout manager, engineering team internally escalated downtime * 5:45pm PT: Infra engineer fixed misconfigured rollout and confirmed load balancer was correctly pointed * 5:50pm PT: Engineering team manually tested API and calls were working again ## Impact * Calls, API and dashboard were down or degraded for up to 15 minutes * User experience was disrupted temporarily; Issue reported internally and by self-serve users ## What went wrong? * We rushed through a manual rollout, which is gated to Infra team * Manual rollout tools did not catch user error ## What went well? * Our pagers flagged this issue * Team responded quickly and was able to mitigate * Status page was put up proactively ## Action Items: * Update manual deployment tools to avoid such user error [Done] * Expand rollout auto-blocking mechanism to incorporate other pages [Done] * Better documentation for rollout/rollback steps * Further lock down manual deployment, gate behind approval by 1 more infra eng

API degradation

Sat, 03 May 2025 01:43:00 -0000

API degradation

Sat, 03 May 2025 00:54:00 -0000

We identified the root cause of the issue in a bad deployment. The team rolled out a fix. API is fully operational again.

API degradation

Sat, 03 May 2025 00:54:00 -0000

We identified the root cause of the issue in a bad deployment. The team rolled out a fix. API is fully operational again.

API degradation

Sat, 03 May 2025 00:44:00 -0000

Some API endpoints may be unavailable. Team is working on implementing a fix.

API degradation

Sat, 03 May 2025 00:44:00 -0000

Some API endpoints may be unavailable. Team is working on implementing a fix.

Call Recordings May Fail For Some Users

Wed, 30 Apr 2025 06:59:00 -0000

We have resolved the issue. Will upload RCA 04/30 noon PST. TL;DR: Recordings weren't uploaded to object storage due to some invalid credentials. We generated and applied new keys.

Call Recordings May Fail For Some Users

Wed, 30 Apr 2025 06:59:00 -0000

We have resolved the issue. Will upload RCA 04/30 noon PST. TL;DR: Recordings weren't uploaded to object storage due to some invalid credentials. We generated and applied new keys.

Call Recordings May Fail For Some Users

Wed, 30 Apr 2025 05:30:00 -0000

Some users may not receive call recordings due to an issue with our Cloudflare R2 Storage, the team is deploying a fix now

Call Recordings May Fail For Some Users

Wed, 30 Apr 2025 05:30:00 -0000

Some users may not receive call recordings due to an issue with our Cloudflare R2 Storage, the team is deploying a fix now

Auth DB restart

Fri, 25 Apr 2025 05:05:00 -0000

Maintenance completed

Auth DB restart

Fri, 25 Apr 2025 05:00:26 -0000

We will be performing a brief restart of our authentication database to accommodate increased scale. This maintenance is expected to complete within one minute. We appreciate your patience and apologize for any inconvenience. Should only impact the signin & signup on dashboard. Calls and other APIs will not be impacted by it.

Increased 404 Errors Related to Phone Numbers Found

Tue, 22 Apr 2025 11:39:00 -0000

We have determined the issue and resolved. We will update by noon PST with an RCA. TL;DR: Adding a new CIDR range to our SIP cluster caused issues where the servers were unable to discover each other.

Increased 404 Errors Related to Phone Numbers Found

Tue, 22 Apr 2025 09:58:00 -0000

We are seeing an increase in 404 responses for SIP outbound calls.

Upgrading Weekly API

Tue, 15 Apr 2025 19:12:26 -0000

Maintenance completed

Upgrading Weekly API

Tue, 15 Apr 2025 18:21:20 -0000

Applying performance optimizations

Upgrading Weekly API

Tue, 15 Apr 2025 18:21:20 -0000

Applying performance optimizations

Increased 480 Temporarily Unavailable cases for SIP inbound

Tue, 08 Apr 2025 05:00:00 -0000

For RCA please checkout https://status.vapi.ai/incident/528384?mp=true

SIP calls failing intermittently

Tue, 08 Apr 2025 05:00:00 -0000

For RCA please check https://status.vapi.ai/incident/528384?mp=true

SIP call failures to connect

Tue, 08 Apr 2025 04:56:00 -0000

#RCA for SIP Degradation for sip.vapi.ai **TLDR;** Vapi sip service (sip.vapi.ai) was intermittently throwing errors and not able to connect to calls. We had some major flaws in our SIP infrastructure which was resolved by rearchitecting the whole thing from scratch. **Impact** - Call to Vapi SIP uri or Vapi phone numbers were failing to connect with 480/487/503 errors - Inbound calls to Vapi getting connected but audio not coming out, eventually causing silence timeouts or customer-did-not-answer - Outbound calls from Vapi numbers or custom SIP trunks were mostly unimpacted due to whole migration but we did add some rate limiting recently which could have caused 429's failing Vapi call creation. - Around 1% calls were failing intermittently with failure rate going up to 10% at times briefly. **Root Cause** - In order to scale out our SIP infrastructure, Vapi moved to a Kubernetes based SIP deployment back in mid January. - SIP networking in kubernetes was complex to get right and we released multiple fixes throughout February and mid March and operated the service on a satisfactory level but with intermittent failures. - Periods of degraded experience during this time were specifically due to networking errors between different components of our SIP infrastructure. Most of the time we were able to resolve issues as they occur by restarting services, releasing patches, blocking malicious traffic, scaling out more, etc. - By mid march we realised that the kubernetes deployment is not going to be stable and started devising a new infrastructure for SIP. We started migration for SIP to a more stable autoscaling group based deployment on 31st March, and continued doing so over the next day or two. - The team monitored the new deployment very closely, and kept releasing patches for every small failure that we saw. - The new deployment has been looking great so far **What went poorly?** - We took a lot of time in deciding to pull the plug on our kubernetes deployment. - Users were impacted intermittently and the SIP reliability was not what we aspire for **Remediations** - SIP infrastructure was revamped to an autoscaling group based deployment which is more stable. - Audit of each error case and apply immediate fixes where needed - Add better monitoring and telemetry across the SIP infrastructure to make sure we catch issues and act on them preemptively.

SIP call failures to connect

Mon, 07 Apr 2025 22:48:00 -0000

SIP infrastructure has been upgraded on our side. So far seeing good performance for it.

Degradation in phone calls stuck in queued state.

Fri, 04 Apr 2025 19:00:00 -0000

Resolved the issue, blocked offending user and reviewed rate limits

Degradation in phone calls stuck in queued state.

Fri, 04 Apr 2025 18:19:00 -0000

We're actively investigating the issue that popped up in the last 15 minutes

Degradation in API

Fri, 04 Apr 2025 16:00:00 -0000

API rollback completed and errors subsided

Degradation in API

Fri, 04 Apr 2025 15:30:00 -0000

API was degraded Friday morning, the team was proactively notified via monitors and started a rollback

Intermittent 503s in api

Thu, 03 Apr 2025 18:00:00 -0000

Improvements shipped reliably fixed the issue. Team has commenced medium-term, and is investigating long-term scalability improvements

Intermittent 503s in api

Thu, 03 Apr 2025 06:11:00 -0000

We have identified the issue, pushed a fix, and are monitoring for improvements.

Intermittent 503s in api

Wed, 02 Apr 2025 21:14:00 -0000

We are investigating increased cases of 503s in our APIs.

Experiencing Anthropic rate limits on model calls

Wed, 02 Apr 2025 03:04:00 -0000

Anthropic rate limiting is resolved after raising quota

Experiencing Anthropic rate limits on model calls

Wed, 02 Apr 2025 02:04:00 -0000

Assistants using Anthropic models with Vapi-provided API keys are intermittently experiencing rate limits. Those using bring-your-own API keys are unaffected

Increased 480 Temporarily Unavailable cases for SIP inbound

Mon, 31 Mar 2025 16:00:00 -0000

Issue should be resolved now, we will be publishing a RCA for it later today.Sorry for the disruption.

Increased 480 Temporarily Unavailable cases for SIP inbound

Mon, 31 Mar 2025 14:37:00 -0000

We have identified the problem and working on a fix.

Increased 480 Temporarily Unavailable cases for SIP inbound

Mon, 31 Mar 2025 13:47:00 -0000

We are seeing increased cases of 480 Temporarily Unavailable cases for SIP inbound and are investigating on priority.

SIP calls failing intermittently

Sun, 30 Mar 2025 15:50:00 -0000

This should be resolved. We will be posting an RCA soon.

SIP calls failing intermittently

Fri, 28 Mar 2025 22:18:00 -0000

We are seeing a degradation in our SIP service and are working towards resolving it on priority.

Some SIP calls have longer reported call duration than reality

Fri, 28 Mar 2025 22:10:00 -0000

Between 2025/03/27 8:40 PST and 9:35 PST, a small portion of SIP calls had their call durations initially inflated due to an internal system hang. The call duration information has been fixed retroactively.

Upgrade SIP infrastructure

Thu, 27 Mar 2025 04:00:00 -0000

Maintenance completed

Upgrade SIP infrastructure

Thu, 27 Mar 2025 02:00:30 -0000

We are rolling out some major infra changes to our SIP infrastructure that should make it more stable. There should not be any downtime but could be some cases of call drops that rely on SIP during the infrastructure rollout.

API degradation

Tue, 25 Mar 2025 04:33:00 -0000

# TL;DR After deploying recent infrastructure changes to backend-production1, Redis Sentinel pods began restarting due to failing liveness checks (`/health/ping_sentinel.sh`). These infra changes included adding a new IP range, causing all cluster nodes to cycle. When Redis pods restarted, they continually failed health checks, resulting in repeated restarts. A rollback restored API functionality. The entire cluster is being re-created to address DNS resolution failures before rolling forward. # Timeline 1. March 30th: New IP range and subnets added. 2. March 24th, 3:55 PM: Deployment to backend-production1 initiated. 3. March 24th, 4:14 PM: Deployment completed. - Immediate increase in Redis errors observed in API pods. - API pods scaled dramatically and restarted frequently. - API service degraded with significant timeouts. 4. March 24th, 4:19 PM: Rollback initiated. 5. March 24th, 4:27 PM: Rollback completed; API service fully restored. # Resolution A rollback to the previous stable configuration resolved the immediate API timeout issues. The complete cluster re-creation is underway to permanently resolve underlying DNS resolution failures related to the new IP range before future deployments. # Impact - Approximately 2.67k API requests failed (5xx responses) or timed out. - Impacted areas included logs and database write operations. - Errors included Redis AudioCache failures, API database connection issues, and aborted API requests due to timeouts. # Root Cause The rollout caused a rotation of all cluster nodes due to subnet changes tied to the new IP range. DNS resolution failures associated with this new IP range caused Redis I/O operations to block on TCP connections, resulting in prolonged hanging TCP connections. These hanging connections intermittently caused Redis pods to fail liveness checks, resulting in continuous restarts. API pods, maintaining open connections to Redis, experienced similar blockages, leading to extensive API request timeouts and service degradation. The permanent resolution involves recreating the cluster entirely to address these DNS resolution issues comprehensively. If working on realtime distributed systems excites you, consider applying: https://jobs.ashbyhq.com/vapi/295f5269-1bb5-4740-81fa-9716adc32ad5

API degradation

Tue, 25 Mar 2025 04:14:00 -0000

API in degraded state, as identified by our monitors. We're rolling back to previous cluster

Call worker degradation

Mon, 24 Mar 2025 23:45:00 -0000

Issue was mitigated via rollback. We're investigating and will update with an RCA

Call worker degradation

Mon, 24 Mar 2025 23:39:00 -0000

After most recent deploy, we noticed degradation in call initiation API. Changes were immediately rolled back, we are investigating the issue

Cloudflare R2 storage is degraded, causing call recording upload failures

Fri, 21 Mar 2025 22:55:00 -0000

Recording upload errors are recovered. We are continuing to monitor

Cloudflare R2 storage is degraded, causing call recording upload failures

Fri, 21 Mar 2025 22:54:00 -0000

Root issue has been fixed by Cloudflare. We are now monitoring

Cloudflare R2 storage is degraded, causing call recording upload failures

Fri, 21 Mar 2025 22:16:00 -0000

Call recording uploads are failing, due to degradations in Cloudflare R2 (our default storage provider). See https://www.cloudflarestatus.com/

Google Gemini Voicemail Detection is intermittently failing

Wed, 19 Mar 2025 23:05:00 -0000

# TL;DR It was decided that we should make Google Voicemail Detection the default option. On 16th March 2025, a PR was merged which implemented this change. This PR was released into production on 18th March 2025. On the morning of 19th March 2025, it was discovered that customers were experiencing call failures due to this change. Specifically: Google VMD was turned on by default, with no obvious way to disable it via the dashboard. Google VMD generated false positives when the bot identified itself as a bot. # Timeline in PST - **16th March 2025**: the offending PR is merged. - **18th March 2025, 3:08 PM**: the offending PR is released to production. - **19th March 2025, 8:52 AM**: Vapi Eng bot reports an incident: [https://vapi-ai.slack.com/archives/C06GT64R399/p1742399522864239](https://vapi-ai.slack.com/archives/C06GT64R399/p1742399522864239) - **19th March 2025, 9:18 AM**: It is determined that the issue is likely caused by Gemini VMD. - **19th March 2025, 10:04 AM**: Production is rolled back, immediately resolving the issue. - **19th March 2025, 11:00 AM**: Hotfix is committed to production. # Root Cause Several issues were identified: - Google VMD should not have been set as the default option. Any non-essential feature should be disabled by default. - From a dashboard perspective, `"undefined"` should always imply `"off"`. Additionally: - Google VMD produced false positives whenever the bot revealed itself as an AI or otherwise implied it was non-human. Examples: - *"Thank you for calling Jim Adler and Associates! I’m Kendall, an AI assistant. This call may be recorded for quality and training purposes as well as to help direct your information to the right person. I’m here to answer questions or book appointments—how may I assist you?"* - *"Thank you for calling Max Electric! This call is being recorded for quality and training purposes. You are calling outside of our business hours. This is Matthew. Please let me know how I can help!"* This appears to be an edge case identifiable primarily through actual usage. # What went poorly? - A non-essential feature was set as a default option. # What went well? - The issue was taken seriously as soon as it was identified. - The root cause was quickly discovered. # Remediation - Production was rolled back promptly. - A hotfix was implemented to stabilize production (ensuring Google VMD is no longer the default). - A longer-term fix has been developed to mitigate false positives. If working on realtime distributed systems excites you, consider applying: https://jobs.ashbyhq.com/vapi/295f5269-1bb5-4740-81fa-9716adc32ad5

Google Gemini Voicemail Detection is intermittently failing

Wed, 19 Mar 2025 19:30:00 -0000

We have released a fix for this issue

Google Gemini Voicemail Detection is intermittently failing

Wed, 19 Mar 2025 18:30:00 -0000

We have identified the root cause and rolled back. We are working on fix.

Google Gemini Voicemail Detection is intermittently failing

Wed, 19 Mar 2025 16:55:00 -0000

Google VMD is intermittently flagging on-going calls as "voicemail" and causing them to end with customer-did-not-answer. We are investigating and will have an update by 12pm PST latest. Users can resolve this by using an alternate VMD provider (Twilio or OpenAI).

Intermittent errors during end calls.

Tue, 18 Mar 2025 23:36:00 -0000

Resolved now. **RCA:** **Timeline (in PT)** 4:10pm New release went out for a small percentage of users. 4:15pm Our monitoring picked up increased errors in ending calls. 4:34pm Release was auto rolled back due to increased errors and incident was resolved. **Impact** Calls to end with unknown-error End of call report was missing **Root cause:** A missing DB migration caused issues in fetching data during end of call. **Remediation:** Add CI check to make sure we don't release code when the dependent DB migration hasn't been run yet.

Intermittent errors during end calls.

Tue, 18 Mar 2025 23:29:00 -0000

We are investigating a increased cases of call drops. Will post updates soon.

sip.vapi.ai degradation

Tue, 18 Mar 2025 04:00:00 -0000

**RCA: SIP 480 Failures (March 13-14)** **Summary** Between March 13-14, SIP calls intermittently failed due to recurring 480 errors. This issue was traced to our SIP SBC service failing to communicate with the SIP inbound service. As a temporary mitigation, restarting the SBC service resolved the issue. However, a long-term fix is planned, involving a transition to a more stable Auto Scaling Group (ASG) deployment. **Incident Timeline** (All times in PT) **March 13, 2025** 07:00 AM – SIP SBC pod starts showing symptoms of failure to connect to the SIP inbound pod, resulting in intermittent 480 errors. 01:19 PM – A customer reported an increase in 480 SIP errors, prompting escalation to the infrastructure team. 01:30 PM – The infrastructure team took corrective action, and service was restored. **March 14, 2025** 07:30 AM – Similar issue recurred, triggering monitoring alerts. 08:30 AM – The infrastructure team was engaged for remediation as failures persisted. 08:43 AM – The affected SIP SBC pod was deleted, restoring service. 09:43 AM – The issue reappeared, requiring repeated manual intervention. Additional occurrences throughout the day: 11:10 AM – 11:17 AM 12:03 PM – 12:09 PM 01:04 PM – 01:22 PM 02:08 PM – 02:37 PM **Challenges Identified** The failures appear due to broken connection between services, there were no health checks to keep the connections intact. Increased frequency – The number of occurrences was higher than usual, impacting a lot customers. Delayed response on Day 1 – The application remained in a somewhat degraded state for six hours before customer escalation prompted action. **Positive Takeaways** *Effective monitoring* – Alerts triggered as expected, enabling swift identification of the issue. *Improved response time on Day 2* – The team responded more promptly to subsequent incidents. **Remediation Actions Taken** *Enhance alerting mechanisms* – Modified alerts to periodically refire when in an alarm state, ensuring timely on-call responses. *Transition to ASG-based deployment* – Move SIP workloads from Kubernetes to an ASG-based infrastructure for improved stability. *Health check* - Add health check between the 2 services so that the system is able to auto heal incase issue reoccurs.

Vapi workers not connecting due to lack of workers

Tue, 18 Mar 2025 03:56:00 -0000

# TL;DR Weekly Cluster customers saw vapifault-transport-never-connected errors due to workers not scaling fast enough to meet demand # Timeline in PST * 7:00am - Customers report an increased number of vapifault-transport-never-connected errors. A degradation incident is posted on BetterStack * 7:30am - The issue is resolved as call workers scaled to meet demand # Root Cause - Call workers did not scale fast enough on the weekly cluster # Impact There were 34 instances of vapifault-transport-never-connected errors, meaning there were 34 calls that failed due to the issue. # What went poorly? - We were unable to detect the issue before customers did # What went well? - The solution was straightforward → Pre-scaling workers on the Weekly Cluster # Remediation - Pre scaling workers on all clusters to prevent vapifault errors - Increase size of worker nodes to aid in scaling, by allowing more call workers to fit per node - Increase sensitivity of pipeline error monitors / Dedicated monitor for vapifault errors If working on realtime distributed systems excites you, consider applying: https://jobs.ashbyhq.com/vapi/295f5269-1bb5-4740-81fa-9716adc32ad5

SIP call failures to connect

Mon, 17 Mar 2025 21:30:00 -0000

Degarading sip.vapi.ai instead of api.vapi.ai as only sip part is currently impacted.

Increased error in calls

Sat, 15 Mar 2025 19:37:00 -0000

The issue has subsided, we experienced a brief spike in call initiations and didn't scale up fast enough. Immediate term, we're vertically scaling our call worker instances. Near term, we're rolling out our new call worker architecture for rapid scaling

Increased error in calls

Sat, 15 Mar 2025 19:17:00 -0000

Users are experiencing `vapifault-transport-never-connected` errors

NeonDB Scheduled Maintenance: DB endpoing restart

Sat, 15 Mar 2025 16:00:00 -0000

Maintenance completed

NeonDB Scheduled Maintenance: DB endpoing restart

Sat, 15 Mar 2025 16:00:00 -0000

Maintenance completed

NeonDB Scheduled Maintenance: DB endpoing restart

Sat, 15 Mar 2025 16:00:00 -0000

Maintenance completed

NeonDB Scheduled Maintenance: DB endpoing restart

Sat, 15 Mar 2025 12:00:00 -0000

Neon is doing scheduled maintenance in our region `us-west-2`: https://neonstatus.com/aws-us-west-oregon/incidents/01JP2WGPKFV2GDV4QSKV8F8NGP. This will require a restart of our endpoint that will result in seconds of downtime. We have marked off the block of time in which this restart will likely happen.

NeonDB Scheduled Maintenance: DB endpoing restart

Sat, 15 Mar 2025 12:00:00 -0000

NeonDB Scheduled Maintenance: DB endpoing restart

Sat, 15 Mar 2025 12:00:00 -0000

SIP call failures to connect

Sat, 15 Mar 2025 01:23:00 -0000

SIP service has faced partial degradation multiple times in the last day. Things are looking stable now, but we are keeping the incident open until we rollout a major infra level change which is going to solve it for good. We apologise for this inconvenience and are working with urgency to solve the issue permanently. Here's the timeline of the issue for today (in Pacific Time): 7:30am SBC pod not able to connect to sbc inbound pod resulting in 480. Our monitoring picks it up. 8:30am Infra team is pulled in for remediation as the failures dont stop for a while. 8:43am The faulty SIP sbc pod was deleted and the service was restored. 9:43am The same issue pops up again and a manual action is taken to restore the service everytime. More instances for the same issue pop up multiple time throughout the day. 11:10 - 11:17am 12:03pm - 12:09pm 1:04pm - 1:22pm 2:08pm - 2:37pm

Investigating GET /call/:id timeouts

Sat, 15 Mar 2025 00:00:00 -0000

We are working with impacted customers to investigate but have not seen this issue occurring regularly.

SIP call failures to connect

Fri, 14 Mar 2025 23:36:00 -0000

We have released a temporary fix to the problem and the issue hasn't been reported again in the last 2 hours. We are still working on a more permanent fix for it.

Calls are intermittently ending abruptly

Fri, 14 Mar 2025 23:01:00 -0000

# TL;DR Calls ended abruptly due to call-workers restarting themselves caused by high memory usage (OOMKilled). # Timeline in PST - March 13th 3:47am: Issue raised regarding calls ending without a call-ended-reason. - 1:57pm: High memory usage identified on call-workers exceeding the 2GB limit. - 3:29pm: Confirmation received that another customer experienced the same issue. - 4:30pm: Changes implemented to increase memory request and limit on call-workers. - March 14th 12:27pm: Changes deployed. # Root Cause Call-workers exceeded Kubernetes-set memory limits, causing containers to restart unexpectedly and terminate ongoing calls. Since call-workers maintain call state internally, calls could not be recovered, leading to abrupt terminations. # Impact 1705 call-workers exceeded the 2GB memory threshold, causing 1705 abrupt call terminations. # What went poorly? - Issue identified only after user notification. - The fix required a code change rather than immediate manual intervention, delaying remediation. - Release complications delayed quick deployment. - Investigation took 10 hours, and remediation required an additional 3 hours. # What went well? - Effective communication allowed identification and planning of the fix once the issue was understood. # Remediation - Increase memory requests and limits on call-workers. - Implement monitoring for call-worker memory usage exceeding limits. - Implement monitoring for call-worker container restarts. If working on realtime distributed systems excites you, consider applying: https://jobs.ashbyhq.com/vapi/295f5269-1bb5-4740-81fa-9716adc32ad5

Calls are intermittently ending abruptly

Fri, 14 Mar 2025 23:01:00 -0000

SIP call failures to connect

Fri, 14 Mar 2025 21:30:00 -0000

sip.vapi.ai is not responding intermittently. We are investigating the failures and will be coming up with a fix soon.

Vapi workers not connecting due to lack of workers

Fri, 14 Mar 2025 20:00:00 -0000

We have investigated and resolved this issue by prescaling the impacted cluster to handle a higher volume of traffic. We will update with an RCA.

Calls are intermittently ending abruptly

Fri, 14 Mar 2025 19:11:00 -0000

We are currently experiencing higher memory usage in our call workers which may be causing calls to end abruptly. Our team is actively investigating and working to resolve the issue promptly. We apologize for any inconvenience this may cause and appreciate your patience. Further updates will be provided by 2pm PST.

Calls are intermittently ending abruptly

Fri, 14 Mar 2025 19:11:00 -0000

Investigating GET /call/:id timeouts

Fri, 14 Mar 2025 18:54:00 -0000

Some users are experiencing timeouts in `GET /call/:id` API endpoint. Our team is actively investigating this and working to resolve the issue promptly. We apologize for any inconvenience this may cause and appreciate your patience. Further updates will be provided shortly.

Vapi workers not connecting due to lack of workers

Fri, 14 Mar 2025 14:30:00 -0000

This issue resolved itself as more workers were created. We are investigating further to provide a more long-term remediation and will update.

Vapi workers not connecting due to lack of workers

Fri, 14 Mar 2025 14:00:00 -0000

Workers did not scale to meet an increase in demand resulting in vapifault-transport-never-connected errors.

sip.vapi.ai degradation

Thu, 13 Mar 2025 23:29:00 -0000

Incident was resolved at 1:30pm PT One of the 2 ips behind sip.vapi.ai was failing to connect to an internal service resulting in 480 error.

sip.vapi.ai degradation

Thu, 13 Mar 2025 23:18:00 -0000

Intermittent "480 temporarily unavailable" errors while connecting calls to sip.vapi.ai. Started happening at 7am PT.

We are seeing degraded service from Deepgram

Tue, 11 Mar 2025 07:59:00 -0000

# TL;DR An application-level bug was leaked into production, causing a spike in pipeline-error-deepgram-returning-502-network-error errors. This resulted in roughly 1.48K failed calls. # Timeline in PST * 12:03am - Rollout to prod1 containing the offending change is started * 12:13am - Rollout to prod1 is complete * 12:25am - A huddle in #eng-scale is started * 12:43am - Rollback to prod3 is started * 12:55am - Rollback to prod3 is complete # Root Cause * An application-level bug related to the Deepgram Numerals setting caused WebSocket connections to return a non-101 status code. This was masked as a pipeline-error-deepgram-returning-502-network-error error, initially leading us to believe it was a Deepgram issue. # Impact There were 1.48K pipeline-error-deepgram-returning-502-network-error errors, meaning there were 1.48K calls that failed due to this issue. # What went poorly? * The monitor did not fire early enough to trigger the Canary Manager’s rollback * We did not roll back immediately upon noticing the correlation between the error-count increase and the start of the canary rollout * We were misled by the error name # What went well? * The monitor caught the issue and alerted us shortly after rollout completion * Multiple team members responded promptly, initiating a huddle in #eng-scale # Remediation * Increase sensitivity of pipeline error monitor * Investigate and resolve the application bug * Refactor Deepgram error categorization to clearly indicate non-Deepgram related issues * Refactor Canary Manager to use direct DD metrics instead of relying on monitor alerts If working on realtime distributed systems excites you, consider applying: https://jobs.ashbyhq.com/vapi/295f5269-1bb5-4740-81fa-9716adc32ad5

We are seeing degraded service from Deepgram

Tue, 11 Mar 2025 07:30:00 -0000

Assistants which use Deepgram for transcription are unresponsive, consider using another transcription model.

Increased call start errors due to Vapi fault transport errors + Twilio timeouts

Tue, 11 Mar 2025 02:18:00 -0000

RCA: vapifault-transport-never-connected errors caused call failures Date: 03/10/2025 Summary: A recent update to our production environment increased the memory usage of one of our core call-processing services. This led to an unintended triggering of our automated process restart mechanism, resulting in a brief period of call failures. The issue was resolved by adjusting the memory threshold for these restarts. Timeline: 1. 5:50am A few calls start facing issues in starting due to vapifault-transport-never-connected. 2. 6:40am Call failures start to increase. Partial outage of call starts. Our monitoring picked it up and paged oncall. Some discord users and customers on slack start reporting errors. 3. 6:55am - 7:20am Investigated causes for failures. Shifted the calls to a previous cluster, but calls were still failing. 4. 7:35am We reached a RCA on why the failures were occurring and a fix was scoped out. 5. 7:58am The hotfix was completely deployed and the failures stopped. The incident was resolved at this point. Root Cause: A recent production update increased the memory requirements of our call-processing service. As a result, an internal safeguard—designed to restart processes exceeding a set memory threshold—was activated more frequently than anticipated. Mediation: 1. Threshold Adjustment: We have increased the memory threshold that triggers a process restart to better handle higher usage. 2. Enhanced Monitoring: We are implementing additional alerts to detect similar issues earlier. 3. Process Review: We are further examining our restart protocols to reduce unnecessary service interruptions during periods of high demand.

Increased call start errors due to Vapi fault transport errors + Twilio timeouts

Mon, 10 Mar 2025 15:12:00 -0000

Issue has been patched and we are monitoring the fix. We will be following up with a detailed RCA soon.

Increased call start errors due to Vapi fault transport errors + Twilio timeouts

Mon, 10 Mar 2025 14:09:00 -0000

We are noticing increased occurrences of 31920 error in Twilio calls. Team in investigating and mitigating the issue.

Kubernetes cluster upgrades

Sat, 08 Mar 2025 20:30:38 -0000

Maintenance completed

Kubernetes cluster upgrades

Sat, 08 Mar 2025 19:00:38 -0000

We're rolling out Kubernetes cluster upgrades for security and reliability.

Increased Twilio errors causing 31902 & 31920 websocket connection issues. Increase in customer-did-not-answer for twilio calls

Fri, 07 Mar 2025 22:00:00 -0000

We have rolled back the faulty release which caused this issue. We are monitoring the situation now.

Increased Twilio errors causing 31902 & 31920 websocket connection issues. Increase in customer-did-not-answer for twilio calls

Fri, 07 Mar 2025 21:57:00 -0000

We are investigating the problem.

Vonage inbound calling is degraded

Thu, 06 Mar 2025 22:39:00 -0000

The issue was caused by Vonage sending an unexpected payload schema, causing validation to fail at the API level. We deployed a fix to accommodate for the schema.

Signups temporarily unavailable

Thu, 06 Mar 2025 06:00:00 -0000

The API bug was reverted and we confirmed service restoration

Weekly cluster at capacity limits

Wed, 05 Mar 2025 20:04:00 -0000

We are seeing calls go through fine now, and are still keeping an eye out

Weekly cluster at capacity limits

Wed, 05 Mar 2025 19:42:00 -0000

Resolution: we've scaled up and are monitoring

Assembly AI transcriber calls are facing degradation.

Sat, 22 Feb 2025 14:17:00 -0000

It is resolved now. It was due to a account related problem which has been fixed now. We will be taking steps to make sure it doesn't happen again.

Assembly AI transcriber calls are facing degradation.

Sat, 22 Feb 2025 13:41:00 -0000

We're coordinating with assembly AI team to fix the issue on priority. Try switching transcriber meanwhile.

API returning 413 (payload too large) due to networking misconfiguration

Fri, 21 Feb 2025 19:24:00 -0000

# TL;DR A change in the cluster-router networking filter caused an increase in 413 (request entity too large) errors. API requests to POST /call, /assistant, and /file were impacted. # Timeline 1. **February 20th 9:54pm PST:** A change to the cluster-router is released and traffic is cut over to prod1. 2. **10:19pm PST:** 413 responses from Cloudflare begin appearing in increased Datadog logs. 3. **February 21st ~8:50am:** Users in Discord flag requests failing with 413 errors. 4. **9:58am PST:** The IR team rolls back the networking cluster to the previous deployment without the filter change; service is restored and the 413 errors subside. # Impact - During the time of impact, POST requests to /call, /assistant, and /file failed with a 413 error code. # Root Cause - A change in the cluster-router filter added buffering of POST requests for all endpoints (previously only applied to /status, /inbound, and /inbound_call). - The envoy filter was configured with a stream window size of approximately 65Kb, so request bodies larger than that received a 413 response. # Changes we've made - Monitor to catch 4xx and 5xx errors from Cloudflare. # Changes we will make - Improve change testing for the networking cluster. - Implement a percentage-based cutover of traffic for networking rollouts instead of a 100% switch. # What went well - The cause was identified quickly by investigating changes in Cloudflare responses. # What went poorly - There was a 12-hour delay between identifying the cause and remediation due to the lack of alerts for this error. - The issue was initially flagged by the Discord community rather than through internal monitoring. If working on realtime distributed systems excites you, consider applying: https://jobs.ashbyhq.com/vapi/295f5269-1bb5-4740-81fa-9716adc32ad5

Deepgram is failing to send transcription intermittently

Fri, 21 Feb 2025 08:57:00 -0000

Deepgram has resolved the incident on their side. Back to normal. https://status.deepgram.com/incidents/wr5whbzk45mg

Deepgram is failing to send transcription intermittently

Fri, 21 Feb 2025 07:26:00 -0000

Deepgram has ackowledged the problem and are working to resolve it. More information on https://status.deepgram.com/incidents/wr5whbzk45mg

Deepgram is failing to send transcription intermittently

Fri, 21 Feb 2025 06:28:00 -0000

Transcriptions are failing to generate which cause calls to hang and end earlier than expected.

Elevenlabs rate limiting and high latency

Thu, 20 Feb 2025 17:11:00 -0000

11labs has confirmed that the problem has been fixed. No failures in last 10mins. Resolving incident. Here is the elevenlabs report on the incident https://status.elevenlabs.io/incidents/01JMJ4B025B83H28C3K81B1YS4

Elevenlabs rate limiting and high latency

Thu, 20 Feb 2025 16:55:00 -0000

11labs is having issues with a latest deployment. We're seeing high latency and rate limits. We have reached out to them and they are fixing it ASAP.

ElevenLabs Rate Limiting

Wed, 19 Feb 2025 19:43:00 -0000

ElevenLabs is imposing rate limits which will have impact on Vapi users who have it configured as their voice model. We are working to resolve this issue, but users can restore service by switching to Cartesia or using their own API key.

API is degraded

Thu, 30 Jan 2025 11:44:00 -0000

## TL;DR The API experienced intermittent downtime due to choked database connections and subsequent call failures caused by the database running out of memory. A forced deployment using direct connections and capacity adjustments restored service. ## Timeline 2:09AM: Alerts triggered for API unavailability (503 errors) and frequent pod crashes. 2:40AM: A switch to a backup deployment showed temporary stability, but pods continued to restart and out-of-memory errors began appearing. 3:27AM: A forced deployment was initiated on the primary environment using direct database connections; the database team was notified. 3:42AM: The database was restarted and traffic was rerouted, leading to improved service health. 3:50AM: The database’s capacity was increased and the service stabilized fully. ## Impact The API experienced multiple intermittent outages. Calls were affected due to the database running out of memory, with thousands of calls and jobs left in an active or stuck state. ## Root Cause Choked database connections due to a spike in aborted request errors led to failing health checks, which in turn caused API pods to restart continuously. The database ran out of memory—not because of sheer volume alone, but due to a misconfiguration (insufficient max_locks_per_transaction), which was exacerbated by a thundering herd of requests. ## Changes we've made Increase Capacity: Boost the database’s capacity. Adjust Configuration: Raise the max_locks_per_transaction setting. Cleanup Operations: Remove stuck pods and clear active call jobs from the affected environment. Enhance Monitoring and Deployment: Improve alerting for database health and reduce urgent deployment times from ~15 minutes to ~5 minutes. If working on realtime distributed systems excites you, consider applying: https://jobs.ashbyhq.com/vapi/295f5269-1bb5-4740-81fa-9716adc32ad5

API is degraded

Thu, 30 Jan 2025 11:30:00 -0000

We're suspecting another Supabase DB issue, remediating ASAP.

SIP cluster scaling

Thu, 30 Jan 2025 07:00:00 -0000

Maintenance completed

SIP cluster scaling

Thu, 30 Jan 2025 04:00:00 -0000

We will be retrying our deployment of SIP cluster to make sure we are ready for upcoming scale. There might be some minor disruptions wrt connecting SIP calls, but we will be closely monitoring the situation and complete the migration swiftly.

API is down

Wed, 29 Jan 2025 17:24:00 -0000

## TL;DR A failed deployment by Supabase of their connection pooler, Supavisor, in one region caused all database connections to fail. Since API pods rely on a successful database health check at startup, none could start properly. The workaround was to bypass the pooler and connect directly to the database, restoring service. ## Timeline 8:08am PST, Jan 29: Monitoring detects Postgres errors. 8:13am: The provider’s status page reports a failed connection pooler deployment. (Due to subscription issues, the team wasn’t immediately notified.) 8:18am: The API goes down. 8:22am: Temporary API recovery occurs as some non-pooler-dependent requests succeed. 8:25am: The API fails again; the incident response team assembles. 8:28am: Investigation reveals API pods are repeatedly restarting. 8:30am: It’s determined that database call failures are triggering the pod restarts. 8:36am: Support confirms that a connection pooler outage in the region is affecting service. 8:38am: A call with support leads to the decision to use direct database connections. 8:44am: A change is deployed to bypass the pooler. 9:12am: The API begins to recover as calls start succeeding. 9:19am: Full service is restored. ## Impact The API was down for 54 minutes, with all calls failing due to reliance on the provider’s system for tracking and organization data. While some API requests not dependent on the pooler continued working, new API pods entered crash loops because their health checks (which made database requests) failed. Database operation failures led to call processing hanging, causing errors that prevented proper job closure. ## Root Cause A failed connection pooler deployment disrupted all database connections. This affected API operations that depended on those connections, leading to cascading failures and hanging processes. ## Changes we've made Reduce Deployment Time: Shorten backend update runtimes to under five minutes. Switch to Direct Connections: Use direct database connections exclusively to avoid pooler issues. Increase Connection Capacity: Boost the number of direct connections available to handle higher loads. If working on realtime distributed systems excites you, consider applying: https://jobs.ashbyhq.com/vapi/295f5269-1bb5-4740-81fa-9716adc32ad5

API is down

Wed, 29 Jan 2025 17:05:00 -0000

We've rolled out direct connection to database for now. Calls are going through. We're waiting on Supabase to confirm fix to resolve the outage.

API is down

Wed, 29 Jan 2025 16:35:00 -0000

We are impacted by supabase outage. https://status.supabase.com Working with their team to get it working ASAP.

API is down

Wed, 29 Jan 2025 16:28:00 -0000

API is down. We're investigating. Updates to follow.

Updates to DB are failing

Tue, 21 Jan 2025 13:23:00 -0000

## TL;DR A configuration error caused the production database to switch to read-only mode, blocking write operations and eventually leading to an API outage. Restarting the database restored service. ## Timeline 5:03:04am: A SQL client connected to the production database via the connection pooler, which inadvertently set the database to read-only. 5:05am: Write operations began failing. 5:18am: The API went down due to accumulated errors. ~5:23am: The team initiated a database restart. 5:25am: The database restarted. 5:33am: Service was fully restored. ## Impact Write operations were blocked for 30 minutes. The API experienced a 15-minute outage. ## Root Cause A direct connection from a SQL client, configured in read-only mode, propagated this setting across all sessions through the connection pooler. This disabled updates, inserts, and deletes, eventually leading to API failure. ## Changes we've made Disable Replication Jobs: Halt the replication jobs suspected of triggering the issue. Escalate Support: The support case is escalated to the relevant team with a 24-hour follow-up. Enhance Auditing: Enable and configure detailed audit logging (DDL and role operations) to help trace future incidents. Restrict Direct Access: Eliminate direct production database connections by updating the access credentials. If working on realtime distributed systems excites you, consider applying: https://jobs.ashbyhq.com/vapi/295f5269-1bb5-4740-81fa-9716adc32ad5

Updates to DB are failing

Tue, 21 Jan 2025 13:20:00 -0000

We are investigating.

Calls not connecting for `weekly` channel

Mon, 13 Jan 2025 16:49:00 -0000

TL;DR: Scaler failed and we didn't have enough workers ## Root Cause During a weekly deployment, Redis IP addresses changed. This prevented our scaling system from finding the queue, leaving us stuck at fixed number workers instead of scaling up as needed. We resolved the issue by temporarily moving traffic to our daily environment. ## Timeline Jan 11, 5:12 PM: Deploy started Jan 13, 6:00 AM: Calls started failing due to scaling issues Jan 13, 8:45 AM: Resolved by moving traffic to daily Jan 13, 11:00 AM: Full service restored ## Changes We've Implemented - Load testing on every deploy - Added better monitoring for scaling errors If working on realtime distributed systems excites you, consider applying: https://jobs.ashbyhq.com/vapi/295f5269-1bb5-4740-81fa-9716adc32ad5

Calls not connecting for `weekly` channel

Mon, 13 Jan 2025 16:31:00 -0000

We're investigating. We'll update ASAP.

DB resizing, 5m of downtime expected.

Sat, 23 Nov 2024 20:15:00 -0000

Maintenance completed

DB resizing, 5m of downtime expected.

Sat, 23 Nov 2024 20:00:00 -0000

We need to resize the DB to handle increased load. 5m of downtime is expected.

ElevenLabs is degraded

Thu, 14 Nov 2024 21:08:00 -0000

Should be back to normal now as per 11labs. https://status.elevenlabs.io/

ElevenLabs is degraded

Thu, 14 Nov 2024 21:01:00 -0000

11labs is suffering degradation for high latency on API. We have contacted them and they are looking into it with urgency. You can also directly track the progress at https://status.elevenlabs.io

API is degraded

Tue, 12 Nov 2024 22:15:00 -0000

TL;DR: API pods were choked. Our probes missed it. ## Root Cause Our API experienced DB contention. Recent monitoring system changes meant our probes didn't pick up this contention and restart the pods. ## Timeline - November 12th 2:00pm PT - Customer reports of API failures - November 12th 2:05pm PT - Oncall team determined cause and scaled and restarted pods - November 12th 2:10pm PT - Full functionality restored. ## Changes we've implemented 1. Restored higher sensitivity thresholds for our monitoring systems 2. Currently investigating underlying database connection management If working on realtime distributed systems excites you, consider applying: https://jobs.ashbyhq.com/vapi/295f5269-1bb5-4740-81fa-9716adc32ad5

API is degraded

Tue, 12 Nov 2024 22:12:00 -0000

Seeing long connection times. Investigating.

Phone calls are degraded

Tue, 12 Nov 2024 01:03:00 -0000

TL;DR: API gateway rejected Websocket requests ## Summary On November 11, 2024, from 4:22 PM to 5:05 PM PST, our WebSocket-based calls experienced disruption due to a configuration issue in our API gateway. This affected both inbound and outbound phone calls in one of our production clusters. ## Impact - Duration: 43 minutes - Affected services: WebSocket-based phone calls - System returned 404 errors for affected connections - Service was fully restored by routing traffic to our backup cluster ## Root Cause The incident occurred due to a control plane issue in our API gateway that attempted to reload plugin configurations. Due to an expired authentication token, this reload failed, causing the WebSocket routing system to enter a degraded state. ## Timeline 4:22 PM PST - Initial service degradation began 4:53 PM PST - Issue identified through customer reports 5:05 PM PST - Full service restored by failing over to backup cluster ## Changes we've implemented 1. Fixed the underlying control plane issue that triggered unnecessary plugin reloads 2. Implemented authentication token rotation to prevent credential expiration issues 3. Enhanced monitoring systems to improve detection of WebSocket routing failures If you enjoy realtime distributed systems, consider applying: https://jobs.ashbyhq.com/vapi/295f5269-1bb5-4740-81fa-9716adc32ad5

Phone calls are degraded

Tue, 12 Nov 2024 00:58:00 -0000

We're investigating.

API is down

Fri, 08 Nov 2024 02:11:00 -0000

Misconfiguration on networking cluster. Resolved now. Here's what happened: ## Summary On November 7, 2024, from 5:59 PM to 6:10 PM PT, our API service experienced an outage due to an unintended configuration change. During this period, new API calls were unable to initiate, though existing connections remained largely unaffected. ## Impact - Duration: 11 minutes - Service returned 521 errors for new inbound API calls - Existing API calls remained stable - Service was fully restored at 6:10 PM PT ## Root Cause The incident occurred when a configuration intended for our staging environment was accidentally applied to production during a routine debugging session. This resulted in the deletion of a critical API gateway configuration. ## Timeline - 5:59 PM PT - Accidental deletion of production configuration during staging environment debugging - 6:00 PM PT - Monitoring systems detected service degradation - 6:08 PM PT - Engineering team identified root cause - 6:09 PM PT - Fix deployed (configuration restored) - 6:10 PM PT - Full service recovery confirmed ## Changes we've implemented 1. Changing namespace to include cluster name. `networking` > `networking-staging` and `networking-production`. This forces you to specify the environment while running kubectl commands. 2. Preventing deletion of resources that would never be expected to be deleted using Kubernetes deletion webhook. If working on realtime distributed systems excites you, consider applying: https://jobs.ashbyhq.com/vapi/295f5269-1bb5-4740-81fa-9716adc32ad5

API is down

Fri, 08 Nov 2024 02:09:00 -0000

API is down. We're investigating. Updates to follow.

Cartesia is down, please use another Voice Provider in the meanwhile

Wed, 23 Oct 2024 18:08:00 -0000

Back to normal. You can follow the updates here: https://status.cartesia.ai.

Cartesia is down, please use another Voice Provider in the meanwhile

Wed, 23 Oct 2024 17:35:00 -0000

*We're working on automated fallbacks for this scenario but currently, please switch manually your assistants.* Latest update from the Cartesia team: > We're currently experiencing an outage in our API due to our infrastructure provider Together being down. We'll update you as soon as possible when it's back up. Please check out and subscribe to our status page for future updates: https://status.cartesia.ai/. Latest update from the Together.ai team: > https://status.together.ai

Web calls creation is degraded

Tue, 22 Oct 2024 20:04:00 -0000

We haven't seen an error in last 15 minutes, resolving for now. This will be updated if anything changes.

Web calls creation is degraded

Tue, 22 Oct 2024 20:02:00 -0000

Web call creation is mostly restored. From Daily team: > API error levels have decreased considerably, but we're still working on full remediation. More updates to come.

Web calls creation is degraded

Tue, 22 Oct 2024 19:46:00 -0000

Daily.co team is continuing to investigate. The issue has been tracked down to AWS Aurora DB and they're working with the AWS team.

Web calls creation is degraded

Tue, 22 Oct 2024 18:50:00 -0000

Daily.co is experiencing degradation (status.daily.co). Latest update: > One of our databases is being unexpectedly slow. We started getting alarms about it right about the same time you started seeing problems. We're in the process of posting about it on the status site. We'll share more shortly! We'll share more updates as we have it. For a workaround, it is recommend to create a Phone Number in dashboard.vapi.ai and direct users to call that to reach the Assistants instead.

Deepgram is degraded, please switch to Gladia or Talkscriber

Fri, 18 Oct 2024 15:32:00 -0000

Deepgram was fully restored at 8:32am, ending close to a 2h degradation. Summary: **Deepgram was degraded from ~6:12am PT to ~8:32am PT** (status.deepgram.com). Their main datacenter fell over, they routed traffic to their AWS fallback, but the latencies on their streaming endpoint were still incredibly high (>10s). Ideally, this degradation shouldn't have happened because it's our job to ensure we have fallbacks to mitigate 3rd party risks in real-time. As an **immediate action item**, we're bringing back standby onprem deepgram into our clusters which would have let us cut this degradation to a couple minutes. ------------- **To give more detail**: We could have run Deepgram on-prem before, giving us control over any changes to the transcription model. Unfortunately, we had phased that out couple months ago because we saw better performance from their SaaS service: 1. They run on better GPUs including H100s (and soon H200s). AWS limits the GPUs we can get and scaling is unpredictable. 2. They are continually upgrading their Nvidia inference stack, including proprietary optimizations. 3. They ship continual updates and bug fixes to their SaaS offering compared to monthly updates to onprem. This degradation alongside another from ElevenLabs earlier in the week (status.elevenlabs.io) has made it clear we need to prioritize redundancy further. 1. We need to have a tiered approach to falling back every piece of the stack. 2. We do this well with the assistant.model but assistant.voice and assistant.transcriber need it too. 3. This need will only get more acute with speech to speech models being the single point of failure. 4. We've been cautious with automated fallbacks because of how complex it is to get right (picking up exactly where the failure happened, etc.). But, it's now clear given our positioning as an orchestrator and critical infrastructure, we bear final accountability. Reliability is our #1 priority, and this incident only makes us more committed to prioritizing it above all else.

Deepgram is degraded, please switch to Gladia or Talkscriber

Fri, 18 Oct 2024 15:20:00 -0000

We have gotten an update from Deepgram that their main datacenter (S31) is back up. They expect ~20 more minutes of backlog batch work to transcribe and then things should be back to completely normal.

Deepgram is degraded, please switch to Gladia or Talkscriber

Fri, 18 Oct 2024 15:03:00 -0000

Deepgram is still degraded. We're still waiting on Deepgram for more accurate estimates and information. Meanwhile, we're spinning up a new cluster with onprem Deepgram but it will take ~30m to come up.

Deepgram is degraded, please switch to Gladia or Talkscriber

Fri, 18 Oct 2024 13:31:00 -0000

Deepgram is extremely degraded, https://status.deepgram.com Please switch to Gladia or Talkscriber in the meanwhile. We're spinning up remediations on our side, too.

DB partitioning Saturday afternoon.

Sat, 12 Oct 2024 21:05:00 -0000

Maintenance completed

DB partitioning Saturday afternoon.

Sat, 12 Oct 2024 21:05:00 -0000

Maintenance completed

DB partitioning Saturday afternoon.

Sat, 12 Oct 2024 21:00:00 -0000

We're partitioning our biggest table call. We expect this to be zero downtime but want to be communicative.

DB partitioning Saturday afternoon.

Sat, 12 Oct 2024 21:00:00 -0000

We're partitioning our biggest table call. We expect this to be zero downtime but want to be communicative.

DB partitioning Saturday afternoon.

Sat, 12 Oct 2024 21:00:00 -0000

We're partitioning our biggest table call. We expect this to be zero downtime but want to be communicative.

DB partitioning Saturday afternoon.

Sat, 12 Oct 2024 21:00:00 -0000

We're partitioning our biggest table call. We expect this to be zero downtime but want to be communicative.

API is degraded

Wed, 09 Oct 2024 16:24:00 -0000

We're back. RCA: * At 9:15am PT: We were alerted by a big spike in `request aborted`. * By 9:20am: We identified the root cause was head of line blocking on the API pods (some requests were taking too long, blocking other requests) * By 9:25am: We scaled and restarted the api pods. Everything reverted to normal. Action Items: * We'll be setting a hard query timeout and returning timeout on ones that exceed. Eg. GET /assistant?limit=1000. (statement_timeout) * We'll be making API pods aware of the health of their own DB connection, so it can restart gracefully. * We'll be lowering how long each API pod can hold a DB connection so it can't monopolize time (idle_timeout).

API is degraded

Wed, 09 Oct 2024 16:18:00 -0000

We're investigating.

API is degraded

Wed, 09 Oct 2024 09:27:00 -0000

Everything is back up for now. Here's what happened: * At 2:05am PT: We were alerted of the `cannot execute UPDATE in a read-only transaction` errors by Datadog. * By 2:15am: We determined it was unhealthy pooler state and restarted the DB to force reset all the connections. * By 2:25am: We are back up. We have several hypothesis on how the pooler session state got mangled. We're tracking them down right now. UPDATE: We spent several days going back and forth with Supabase on why our DB was put in read-only mode. They didn't have a concrete answer either, our collective best guess is transaction wraparound.

API is degraded

Wed, 09 Oct 2024 09:11:00 -0000

We're investigating and will have more to share soon. For now, write paths seem to be completely down with the error `cannot execute INSERT in a read-only transaction` and `cannot execute UPDATE in a read-only transaction` while read paths are going through.

API is degraded

Wed, 02 Oct 2024 19:00:00 -0000

# Post-mortem ## TL;DR Human error on our end led us to being index-less on our biggest table `call`s, increasing DB CPU usage to 100%, and causing API request timeouts. This was a tactical mistake from us (the engineering team) in planning out the migration. We're sorry, we seek to do better than this. We've now engaged a Postgres scaling expert who's scaled multiple large-scale real-time systems before to ensure this never happens again. ## Background Timeline 1. Our Postgres DB CPU usage has been steadily increasing due to scaling pressure. Until recently, it had worked to scale the PG resources and add simple indexes but that reached its limits causing the Sept 24th outage. To be specific, while scaling resources lets PG handle increased volume of requests, each request is still slow due to the nature of how fast a CPU can work to move data to RAM. This means each request holds the PG connection for a longer period increasing chances of connection starvation and lock contention. 2. We initiated a project to understand our query bottlenecks and find better patterns to scale from here on—sharding, partitioning, compound indexes and OLAP warehousing for analytics. 3. Through this project, we found that our biggest table is `call`s and as expected, list and aggregation queries on that were consuming majority of CPU time. We sought to add a compound index on `org_id` and `created_at` to speed them up since they followed the structure `SELECT ... FROM call WHERE org_id=X ORDER BY created_at DESC`. 4. We issued `CREATE INDEX CONCURRENTLY IF NOT EXISTS call_org_id_created_at_idx ON call USING BTREE (org_id, created_at DESC)` at Oct 1st 10pm PT through the Supabase SQL editor. 5. Noticing successful creation in the Supabase UI of the index, Oct 2nd morning at 9am, we sought to drop the simple index on (org_id) to nudge PG to use our compound index. (check remediations) 6. At 9am PT, our DB CPU usage spiked to 100% full throttle, causing API request timeouts and thundering herd as Kubernetes tried to restart unhealthy pods. ## Incident Response 1. At 9:05am PT, we didn't understand that the above timeline had caused the degradation and proceeded to investigate after being paged of the degradation. (check remediation) 2. By 9:15am PT, per our incident response playbook, we were on our backup cluster but that didn't help and degradation was getting worse as the bottleneck of requests in the API pods deepened. We moved our investigation to the DB and noticed the spike in CPU usage. 3. By 9:30am, in attempt to reduce CPU usage, we released a change out to disable some of our aggregation queries that were causing most of the load. It became clear that didn't help. 4. By 9:45am, we discovered that in fact step #4 from the timeline actually had failed and the underlying index was `INVALID`. We were index-less on our biggest table `call`s. 5. By 10am, we had rebuilt the index and restored the system. As a precautionary measure, we're keeping analytics queries disabled until we sort our DB scaling fully. ## Remediations and Reflections 1. As clear from timeline #5 and incident response #1, fundamentally, this degradation happened we didn't realize our migration could fail and did fail. This was as in our "unknown unknowns". The solution is to seek out a PG expert who's done these scaling migrations multiple times before and can help us bridge our unknown unknowns through their first-hand knowledge of different failure modes. We're on it and already have couple leads. 2. Secondly, it was a big tactical mistake on our part to run the migration at 9am PT, right before peak time. We felt increasing pressure on the DB that created urgency and clouded proper planning. We're sorry. We're implementing better procedures to analyze the potential impacts of a change and ease of rollback before pushing things out; the kind of type 1 and type 2 decision theory that's common in business strategy. This is being helped by finding experts in different aspects of scaling that we as the engineering org can tap into, similar to remediation #1. 3. Lastly, we take infrastructure reliability deathly seriously and are really sorry about this error on our part. If you or someone you know is obsessed with infrastructure reliability, we'd love to chat. You can find our JD here: https://www.ycombinator.com/companies/vapi/jobs/BnVHTaQ-founding-senior-engineer-infrastructure

API is degraded

Wed, 02 Oct 2024 17:00:00 -0000

The system is back up barring analytics. Post-mortem to follow soon.

API is degraded

Wed, 02 Oct 2024 16:59:00 -0000

We have identified the bottleneck. The system is recovering and we're continuing to monitor.

API is degraded

Wed, 02 Oct 2024 16:41:00 -0000

DB expanded but CPU is still maxed out, continuing to investigate.

API is degraded

Wed, 02 Oct 2024 16:41:00 -0000

DB expanded but CPU is still maxed out, continuing to investigate.

API is degraded

Wed, 02 Oct 2024 16:38:00 -0000

We're expanding DB resources to resolve the CPU spike and bottleneck. Complete downtime for next 2 minutes. Port-mortem to follow soon

API is degraded

Wed, 02 Oct 2024 16:38:00 -0000

We're expanding DB resources to resolve the CPU spike and bottleneck. Complete downtime for next 2 minutes. Port-mortem to follow soon

API is degraded

Wed, 02 Oct 2024 16:15:00 -0000

API is experiencing degraded performance, including starting call timeouts

API is degraded

Tue, 24 Sep 2024 20:48:00 -0000

We have identified the root cause of the issue and deployed a fix. Everything is good now. Here's what happened: 1. Most of our API pods' DB pooler's connections' came to be completely deadlocked. 2. This should have been caught by the Kubernetes health checks and/or our Uptime bot but was not (see below on remediation). 3. We immediately scaled up our backup cluster and moved the traffic over. 4. The system (`api.vapi.ai`) was back to full capacity in 13m. 5. With production in clear, we got to the root cause analysis on the abandoned cluster. 6. It's unclear what triggered the deadlock simultaneously on multiple pods but our best guess is something on our DB provider side (Supabase). 7. It's also possible that one of them deadlocked and caused additional load on others which triggered the same deadlock mechanism on others. 8. Our last hypothesis was some client-side library bug (Postgres.js) but unclear why simultaneously would trigger. 9. Either way, we had enough data to build up remediations and prevent another incident of this kind. Remediations: 1. Within our Kubernetes health checks for the API pods, we are adding a dummy query `SELECT now()` to actually check the viability of the connection. 2. This does add risks to API pods becoming completely unresponsive in case of a DB outage but that's okay since DB being down would be clear RCA in that case. 3. With this check in place, Kubernetes will take the bad pods have a non-viable connection out of rotation and restart them preventing that a partial or full outage.

API is degraded

Tue, 24 Sep 2024 20:23:00 -0000

Requests to the API are experiencing higher latency including timeouts for 30-40% of the requests resulting in a partial downtime. This includes requests to start calls. We're investigating ASAP.

Call transfers are degraded

Wed, 14 Aug 2024 13:30:00 -0000

We have identified the root cause of the issue and a fix has been deployed. The cause of the issue was an edge case causing infinite loop on tool.messages. We had a secondary issue that caused delay in resolution. Usually, we're able to move to our backup cluster with last known working state ASAP. But, we had unknowingly hit our AWS account limits so the backup cluster couldn't scale to handle full volume. It took some time to get hold of AWS and get more quota. We're auditing and setting up alerts for our AWS service quotas.

Call transfers are degraded

Wed, 14 Aug 2024 12:30:00 -0000

Call transfers causing call failure, we are investigating

Calls are degraded

Tue, 30 Jul 2024 21:00:00 -0000

We have resolved the issue. The cause of the issue was the default core-dns scaler in EKS didn't scale to according to the workload causing DNS queries within our cluster to start failing and causing requests to hang.

Calls are degraded

Tue, 30 Jul 2024 20:00:00 -0000

We are investigating