Incidents | Vapi Incidents reported on status page for Vapi https://status.vapi.ai/ en Weekly cluster call export maintenance https://status.vapi.ai/incident/608457 Wed, 25 Jun 2025 01:00:02 +0000 https://status.vapi.ai/incident/608457#1c40b1e026f9e855364f9e869dc3732a24a06a30bb6da459df9f1ba6ef332ec2 Maintenance completed Anthropic recovered https://status.vapi.ai/ Tue, 24 Jun 2025 15:07:32 +0000 https://status.vapi.ai/#72be81303501858051ffe4187d614d1fdbb5f48eaf85b6d0d6efdc42309e7c6d Anthropic recovered Weekly cluster call export maintenance https://status.vapi.ai/incident/608457 Tue, 24 Jun 2025 15:00:02 -0000 https://status.vapi.ai/incident/608457#0155401031b04db193306c40a08d981b98fd6b12ebabb4c653323b10627ccd3e We’re currently performing maintenance on our analytics database. As a result, call exports from the weekly cluster may return blank CSV files until maintenance is complete. If you run into this issue and need to export data, please temporarily switch your organization’s export setting to daily, then revert back to weekly after exporting. Maintenance will finish by 6 PM PST today. Thank you for your patience. Anthropic went down https://status.vapi.ai/ Tue, 24 Jun 2025 12:58:32 +0000 https://status.vapi.ai/#72be81303501858051ffe4187d614d1fdbb5f48eaf85b6d0d6efdc42309e7c6d Anthropic went down Anthropic recovered https://status.vapi.ai/ Mon, 23 Jun 2025 07:09:15 +0000 https://status.vapi.ai/#37416361f154b8f40298697b87ad060c5a87a46dc24ef691fcfb535acd6235a5 Anthropic recovered Anthropic went down https://status.vapi.ai/ Mon, 23 Jun 2025 06:46:00 +0000 https://status.vapi.ai/#37416361f154b8f40298697b87ad060c5a87a46dc24ef691fcfb535acd6235a5 Anthropic went down Anthropic recovered https://status.vapi.ai/ Sun, 22 Jun 2025 11:11:00 +0000 https://status.vapi.ai/#3ef0b782c0bdfd346b58cafa47f9586ecd07d7912157d1494eac0d9799331fee Anthropic recovered Anthropic went down https://status.vapi.ai/ Sun, 22 Jun 2025 10:35:01 +0000 https://status.vapi.ai/#3ef0b782c0bdfd346b58cafa47f9586ecd07d7912157d1494eac0d9799331fee Anthropic went down Increased database latency causing requests to fail https://status.vapi.ai/incident/606496 Fri, 20 Jun 2025 20:43:00 -0000 https://status.vapi.ai/incident/606496#9d3e93fe165fc1b31e46f2acce3a37a0f856a21f458feca19a3f42f2967cdfeb Our database provider has reported this issue as resolved from their end Increased database latency causing requests to fail https://status.vapi.ai/incident/606496 Fri, 20 Jun 2025 20:43:00 -0000 https://status.vapi.ai/incident/606496#9d3e93fe165fc1b31e46f2acce3a37a0f856a21f458feca19a3f42f2967cdfeb Our database provider has reported this issue as resolved from their end Increased database latency causing requests to fail https://status.vapi.ai/incident/606496 Fri, 20 Jun 2025 18:04:00 -0000 https://status.vapi.ai/incident/606496#ec8a0b2458266571f8e4abea6ca144407897a7156e03ccd056819dc42dab44d5 We are seeing issues with API requests being timed out or aborted. This is because of an increase in latency from our database provider. We are monitoring the issue: https://neonstatus.com/aws-us-west-oregon. Increased database latency causing requests to fail https://status.vapi.ai/incident/606496 Fri, 20 Jun 2025 18:04:00 -0000 https://status.vapi.ai/incident/606496#ec8a0b2458266571f8e4abea6ca144407897a7156e03ccd056819dc42dab44d5 We are seeing issues with API requests being timed out or aborted. This is because of an increase in latency from our database provider. We are monitoring the issue: https://neonstatus.com/aws-us-west-oregon. Breaking changes to Success Evaluation API Response https://status.vapi.ai/incident/605961 Thu, 19 Jun 2025 02:42:00 -0000 https://status.vapi.ai/incident/605961#902fd7c280a0879890c28cad74de8325765fd1abf259330d605890188ca00f8e ## TL;DR In response to hallucinations reports in the Success Evaluation feature, we updated our integration with Gemini LLM to use Structured Output. This inadvertently changed the type of the call.analysis.successEvaluation field from string | null to string | number | boolean | null, introducing a breaking change for customers with strict type validation and those using Vapi Server SDKs. ## Timeline (all in PT) - June 12, 11:32pm: Enterprise and Startup users report hallucinations in Success Evaluation field. Engineer acknowledges reports and begins work in a solution by migrating to Gemini Structured Output. - June 16, 11:35pm: Migration to Structured Output is completed. Update passes automated code tests and is merged into main branch. - June 17, 1:24pm: Update is released, inadvertently introducing changes in the type of call.analysis.successEvaluation property. - June 18, 11:15am: Enterprise users reports breaking change in webhook message; investigation begins. - June 18, 1:51pm: Vapi team decides to retain the new type change and communicates to affected users, requesting updates to their servers to accept string | number | boolean | null. - June 18, 3:43pm: Enterprise users reports Go SDK-specific issue; investigation begins. - June 18, 4:08pm: Team identifies broader SDK impact and start work on a patch to revert API to string-only output while keeping Structured Output. - June 18, 7:42pm: Patch reverting API output to string-only is released. ## Impact Between June 17th 1:24 pm and June 18th 7:42 pm, organizations in daily channel, using strict type validation on their servers or using Vapi Server SDKs experienced issues when processing post call analysis events. ## What went wrong? - Automated tests failed to catch the breaking change in API response. - Poor communication of internal changes to core platform features. - Underestimated the impact, leading to a late rollback (+24hrs) ## What went well? - Organizations in weekly channel were not affected. - Calls were not affected on any of the channels. - Hallucination issue appears resolved. ## Action Items - Testing: Build comprehensive integration tests to catch response type changes. - Communication: Design better notifications and public changelog protocols for potential breaking changes. - Support: Support affected customers and requested server updates. Follow ups to confirm no further issues and assist with any remaining fixes. OpenAI recovered https://status.vapi.ai/ Wed, 18 Jun 2025 21:32:26 +0000 https://status.vapi.ai/#47d080943f76151cbbf67a491ec1874a0b9054c0cc383f506147c8f0631ed995 OpenAI recovered OpenAI went down https://status.vapi.ai/ Wed, 18 Jun 2025 21:13:29 +0000 https://status.vapi.ai/#47d080943f76151cbbf67a491ec1874a0b9054c0cc383f506147c8f0631ed995 OpenAI went down OpenAI recovered https://status.vapi.ai/ Wed, 18 Jun 2025 21:13:01 +0000 https://status.vapi.ai/#8dce3bc79458eac980a1089fbe32bf51c248a7236f6045094511f559db2a3b01 OpenAI recovered OpenAI went down https://status.vapi.ai/ Wed, 18 Jun 2025 20:04:26 +0000 https://status.vapi.ai/#8dce3bc79458eac980a1089fbe32bf51c248a7236f6045094511f559db2a3b01 OpenAI went down Anthropic recovered https://status.vapi.ai/ Wed, 18 Jun 2025 16:43:26 +0000 https://status.vapi.ai/#3f1d46a203fce6f851b4f896a4e275485fd5a7e5eefc571b5bea0676980d58cd Anthropic recovered Anthropic went down https://status.vapi.ai/ Wed, 18 Jun 2025 15:47:25 +0000 https://status.vapi.ai/#3f1d46a203fce6f851b4f896a4e275485fd5a7e5eefc571b5bea0676980d58cd Anthropic went down Anthropic recovered https://status.vapi.ai/ Wed, 18 Jun 2025 15:41:25 +0000 https://status.vapi.ai/#248bc559b7892987937e81ddcc7d8913b2be1e7b3233c18e82aa9f1ea7a5c0e4 Anthropic recovered Anthropic went down https://status.vapi.ai/ Wed, 18 Jun 2025 14:32:26 +0000 https://status.vapi.ai/#248bc559b7892987937e81ddcc7d8913b2be1e7b3233c18e82aa9f1ea7a5c0e4 Anthropic went down Anthropic recovered https://status.vapi.ai/ Wed, 18 Jun 2025 10:18:12 +0000 https://status.vapi.ai/#54b4ac02a5baeae0343b20272ac0ad934ce6192927aba81bb693ca4e299032ad Anthropic recovered Anthropic went down https://status.vapi.ai/ Wed, 18 Jun 2025 08:36:08 +0000 https://status.vapi.ai/#54b4ac02a5baeae0343b20272ac0ad934ce6192927aba81bb693ca4e299032ad Anthropic went down Breaking changes to Success Evaluation API Response https://status.vapi.ai/incident/605961 Tue, 17 Jun 2025 20:24:00 -0000 https://status.vapi.ai/incident/605961#0cf8f4bd41ab722d768ca947f46b81a978283e47596ce1f8532cb8dd1d608118 Organizations in daily channel report breaking change end of call report. Property `call.analysis.successEvaluation` was migrated from `string | null` to `string | number | boolean | null`. Organizations in weekly channel are not affected. Anthropic recovered https://status.vapi.ai/ Tue, 17 Jun 2025 16:15:49 +0000 https://status.vapi.ai/#fc66dffcc715b138caa8e7e480c2798f849802574efb014f601ad59116f4c672 Anthropic recovered Anthropic went down https://status.vapi.ai/ Tue, 17 Jun 2025 13:56:39 +0000 https://status.vapi.ai/#fc66dffcc715b138caa8e7e480c2798f849802574efb014f601ad59116f4c672 Anthropic went down Anthropic recovered https://status.vapi.ai/ Tue, 17 Jun 2025 08:56:04 +0000 https://status.vapi.ai/#9987af7504c97fc062ac25c3d663d52a7ec408c0b9e3ccf6b13af6cdb8368990 Anthropic recovered Anthropic went down https://status.vapi.ai/ Tue, 17 Jun 2025 08:13:59 +0000 https://status.vapi.ai/#9987af7504c97fc062ac25c3d663d52a7ec408c0b9e3ccf6b13af6cdb8368990 Anthropic went down Sign-ups/Sign-ins are not working https://status.vapi.ai/incident/601786 Fri, 13 Jun 2025 08:12:00 -0000 https://status.vapi.ai/incident/601786#3fdeab6f07ec18dcf7895376149f38548377c01d1c9dcd38235b06ea2829b565 It is resolved. Anthropic recovered https://status.vapi.ai/ Thu, 12 Jun 2025 20:44:59 +0000 https://status.vapi.ai/#bcff332bf0b760c3b2e868f737138ecfd522a403341e876bba02855b9960a017 Anthropic recovered Vapi DB recovered https://status.vapi.ai/ Thu, 12 Jun 2025 20:03:39 +0000 https://status.vapi.ai/#70efece8a90b28456c6d12d14db0212b31af41c5149f2ae4e647f0575d6093d3 Vapi DB recovered Vapi DB went down https://status.vapi.ai/ Thu, 12 Jun 2025 19:54:10 +0000 https://status.vapi.ai/#70efece8a90b28456c6d12d14db0212b31af41c5149f2ae4e647f0575d6093d3 Vapi DB went down Vapi DB recovered https://status.vapi.ai/ Thu, 12 Jun 2025 19:52:45 +0000 https://status.vapi.ai/#0e2ca76a127d186496a6cfd75b1a4ead111473847511bb3aac6fb711841132fb Vapi DB recovered Sign-ups/Sign-ins are not working https://status.vapi.ai/incident/601786 Thu, 12 Jun 2025 19:47:00 -0000 https://status.vapi.ai/incident/601786#1d558e16ae8b8da9fe82dbe5bd39bc5f73dca972711643687cb39bed9e8bc615 Supabase and its upstream provider Cloudflare are reporting that services are recovering. Similarly, we are seeing sign-ups and sign-ins working again, though there may be intermittent disruption to the service. We are continuing to monitor and observe our upstream providers status pages for change. https://status.supabase.com/ https://www.cloudflarestatus.com/ Anthropic went down https://status.vapi.ai/ Thu, 12 Jun 2025 18:23:49 +0000 https://status.vapi.ai/#bcff332bf0b760c3b2e868f737138ecfd522a403341e876bba02855b9960a017 Anthropic went down Sign-ups/Sign-ins are not working https://status.vapi.ai/incident/601786 Thu, 12 Jun 2025 18:19:00 -0000 https://status.vapi.ai/incident/601786#6b5f1c0c536c59152d21e251f36f74c53825372adc789cdac04068a422a3c1f4 We use Supabase for authentication which is having an issue due to a Cloudflare outage. Our authentication endpoint is down impacting auth flows for sign-ups and sign-ins. We are investigating. Phone calls are still working and our API is accessible. WebRTC (daily.co) calls will fail. Vapi DB went down https://status.vapi.ai/ Thu, 12 Jun 2025 18:13:32 +0000 https://status.vapi.ai/#0e2ca76a127d186496a6cfd75b1a4ead111473847511bb3aac6fb711841132fb Vapi DB went down Elevenlabs voice provider not working with custom API Key https://status.vapi.ai/incident/599433 Thu, 12 Jun 2025 10:20:00 -0000 https://status.vapi.ai/incident/599433#a8d078e54f95a6ba18743e75334b87d01d23673a6f12ec02ec2dcb465ecc75f7 Summary: We experienced an issue related to API key validation within our WebSockets implementation when sending the API key more than once. Details: The issue arose during API key validation within our WebSockets implementation. Our system validates that the API key provided during the initial message is the same as proceeding messages. A recent change introduced during a release caused a mismatch in how API keys were compared. Specifically, the system was comparing hashed API keys against non-hashed API keys. This comparison would always fail, as hashed and non-hashed keys are inherently different. The impacted API keys were legacy API keys, which were not being hashed. Timeline (GMT +2): Release Ready: 9:23 AM Full Deployment: 9:52 AM Reported by Vapi: 12:28 PM Rollback Initiated: 12:53 PM Impact: This issue impacted a small number of clients using non-legacy API keys who also provided the API key multiple times during the WebSocket connection. Specifically, if the API key was provided during the initial connection and then again in subsequent messages, our system performs a validation check. Due to a flawed comparison between hashed and non-hashed API keys, this validation check failed for those clients sending API keys multiple times, resulting in the error you saw. Resolution: - The engineering team has implemented a fix to ensure API keys are compared correctly, regardless of whether they are hashed or non-hashed. The fix has been deployed. Preventative Measures: - To prevent similar issues in the future, the following steps are being taken: We already had tests for this, but unfortunately, we found issues with the tests that clearly didn't catch this because of a race condition. That race condition has since been solved. - We’ve also made sure the tests now block merges. Anthropic recovered https://status.vapi.ai/ Mon, 09 Jun 2025 18:46:54 +0000 https://status.vapi.ai/#6580b3f095050be43ae8fb6c8ee11b5133d64c781781464297dfb1ee5b24b0bb Anthropic recovered Anthropic went down https://status.vapi.ai/ Mon, 09 Jun 2025 17:24:56 +0000 https://status.vapi.ai/#6580b3f095050be43ae8fb6c8ee11b5133d64c781781464297dfb1ee5b24b0bb Anthropic went down Elevenlabs voice provider not working with custom API Key https://status.vapi.ai/incident/599433 Mon, 09 Jun 2025 11:02:00 -0000 https://status.vapi.ai/incident/599433#16e20764077b3e05dbb58f575296ad311bd6c4456b2e383fe41cf16d0bef2ea0 Services are back up now. Elevenlabs rolled back a change, errors have come down now, resolving it. We will keep monitoring the situation further for some time. Elevenlabs voice provider not working with custom API Key https://status.vapi.ai/incident/599433 Mon, 09 Jun 2025 10:39:00 -0000 https://status.vapi.ai/incident/599433#6d8f4265ddf901b107aa673159060b31dfbe1896a44a1172027c862802b71f9f We are working with 11labs team to resolve an issue wherein 11labs are not working when users bring their own key on Vapi. Anthropic recovered https://status.vapi.ai/ Sun, 08 Jun 2025 23:51:39 +0000 https://status.vapi.ai/#e48a774c9084740fa0b390ef295d3e3b1e8e70fe096b22d24f0b614e23f4b64c Anthropic recovered Anthropic went down https://status.vapi.ai/ Sun, 08 Jun 2025 22:57:37 +0000 https://status.vapi.ai/#e48a774c9084740fa0b390ef295d3e3b1e8e70fe096b22d24f0b614e23f4b64c Anthropic went down Anthropic recovered https://status.vapi.ai/ Sat, 07 Jun 2025 18:59:39 +0000 https://status.vapi.ai/#5aaf81a67bb0a896905d2058467fd1879778bcce91a7e0e6f2e318568288d6a7 Anthropic recovered Anthropic went down https://status.vapi.ai/ Sat, 07 Jun 2025 18:50:38 +0000 https://status.vapi.ai/#5aaf81a67bb0a896905d2058467fd1879778bcce91a7e0e6f2e318568288d6a7 Anthropic went down Anthropic recovered https://status.vapi.ai/ Sat, 07 Jun 2025 00:18:29 +0000 https://status.vapi.ai/#728a91a05e8612e593013ce830f2a421461f3578a8d4d849e5aaeb05bb518d3a Anthropic recovered Anthropic went down https://status.vapi.ai/ Fri, 06 Jun 2025 22:36:28 +0000 https://status.vapi.ai/#728a91a05e8612e593013ce830f2a421461f3578a8d4d849e5aaeb05bb518d3a Anthropic went down Anthropic recovered https://status.vapi.ai/ Fri, 06 Jun 2025 22:36:10 +0000 https://status.vapi.ai/#b4c3a29be88ead4f83c0fcf83e8dd0ea52b5f1156f8f66d0b0ec566b62a282f1 Anthropic recovered Anthropic went down https://status.vapi.ai/ Fri, 06 Jun 2025 20:52:28 +0000 https://status.vapi.ai/#b4c3a29be88ead4f83c0fcf83e8dd0ea52b5f1156f8f66d0b0ec566b62a282f1 Anthropic went down Anthropic recovered https://status.vapi.ai/ Fri, 06 Jun 2025 20:52:13 +0000 https://status.vapi.ai/#80c275d04d3b42e86ed2372cf4593b75eb815d5ea7c005a5351461888be888fc Anthropic recovered Anthropic went down https://status.vapi.ai/ Fri, 06 Jun 2025 01:47:17 +0000 https://status.vapi.ai/#80c275d04d3b42e86ed2372cf4593b75eb815d5ea7c005a5351461888be888fc Anthropic went down Anthropic recovered https://status.vapi.ai/ Fri, 06 Jun 2025 01:46:42 +0000 https://status.vapi.ai/#e193091154199631e37e5ee91cdb40342faf217d45b8d7bfa2fd5605f9e6dee5 Anthropic recovered Anthropic went down https://status.vapi.ai/ Thu, 05 Jun 2025 20:54:18 +0000 https://status.vapi.ai/#e193091154199631e37e5ee91cdb40342faf217d45b8d7bfa2fd5605f9e6dee5 Anthropic went down Anthropic recovered https://status.vapi.ai/ Wed, 04 Jun 2025 18:52:03 +0000 https://status.vapi.ai/#81a1a63692abb0a5fb150844d5354d662f77cfdbce050b872df7eeaee0810d1f Anthropic recovered Anthropic went down https://status.vapi.ai/ Wed, 04 Jun 2025 18:23:01 +0000 https://status.vapi.ai/#81a1a63692abb0a5fb150844d5354d662f77cfdbce050b872df7eeaee0810d1f Anthropic went down Weekly cluster maintenance https://status.vapi.ai/incident/596392 Wed, 04 Jun 2025 02:28:33 -0000 https://status.vapi.ai/incident/596392#0c66983c18defde7bd44b75d04e02ceabec1ec38246a87f2eb8596085f530960 Weekly cluster is undergoing additional maintenance Weekly cluster maintenance https://status.vapi.ai/incident/596392 Tue, 03 Jun 2025 17:00:28 +0000 https://status.vapi.ai/incident/596392#11bd25bfc40ddebff0c5c2e2754f02dfa9f4360de62f87d72ad510a91761977b Maintenance completed Weekly cluster maintenance https://status.vapi.ai/incident/596392 Tue, 03 Jun 2025 17:00:28 -0000 https://status.vapi.ai/incident/596392#0c66983c18defde7bd44b75d04e02ceabec1ec38246a87f2eb8596085f530960 Weekly cluster is undergoing additional maintenance Weekly cluster maintenance https://status.vapi.ai/incident/595722 Tue, 03 Jun 2025 08:00:25 +0000 https://status.vapi.ai/incident/595722#3527b5520d80abb8e9a1b5a6f7fa41a3edb31a7c1f07a4505ffbc263f3c08306 Maintenance completed Anthropic recovered https://status.vapi.ai/ Mon, 02 Jun 2025 19:30:59 +0000 https://status.vapi.ai/#1e45904fb697e8a200c4f7a2d1980e8e635c666ec530c0724c816805d3145fa3 Anthropic recovered Anthropic went down https://status.vapi.ai/ Mon, 02 Jun 2025 19:12:59 +0000 https://status.vapi.ai/#1e45904fb697e8a200c4f7a2d1980e8e635c666ec530c0724c816805d3145fa3 Anthropic went down Weekly cluster maintenance https://status.vapi.ai/incident/595722 Mon, 02 Jun 2025 18:00:25 -0000 https://status.vapi.ai/incident/595722#6520f342d0eba9ca95157cc8585b66e698d1f702a42d61ec428eebe6a148925e Weekly cluster is under additional monitoring and maintenance after update. We should have things resolved by tonight Vapi API [Weekly] recovered https://status.vapi.ai/ Mon, 02 Jun 2025 08:10:16 +0000 https://status.vapi.ai/#d9f8245acfd340b49e794580a46fb8faef4f742b3c077b290bdfcbc874f2359c Vapi API [Weekly] recovered API was down https://status.vapi.ai/incident/595644 Mon, 02 Jun 2025 06:00:00 -0000 https://status.vapi.ai/incident/595644#cce7580f1b27e8911f01139b86aa8de213bb5d7bbe018a87df6836f1aff8c543 API was down due to user error in routine maintenance. Service has since been restored API was down https://status.vapi.ai/incident/595644 Mon, 02 Jun 2025 06:00:00 -0000 https://status.vapi.ai/incident/595644#cce7580f1b27e8911f01139b86aa8de213bb5d7bbe018a87df6836f1aff8c543 API was down due to user error in routine maintenance. Service has since been restored API was down https://status.vapi.ai/incident/595644 Mon, 02 Jun 2025 05:45:00 -0000 https://status.vapi.ai/incident/595644#9b7c759b9efcbd15621371f683b6282cae5980ce58c57159ef2b84a52cf1ec2c API was down due to user error in routine maintenance. Service has since been restored API was down https://status.vapi.ai/incident/595644 Mon, 02 Jun 2025 05:45:00 -0000 https://status.vapi.ai/incident/595644#9b7c759b9efcbd15621371f683b6282cae5980ce58c57159ef2b84a52cf1ec2c API was down due to user error in routine maintenance. Service has since been restored Vapi SIP recovered https://status.vapi.ai/ Mon, 02 Jun 2025 05:38:02 +0000 https://status.vapi.ai/#cb53031a83951d6953c5960c1d06cbcc0cdb9d87346dea3a2c7c71bdf273009f Vapi SIP recovered Vapi API recovered https://status.vapi.ai/ Mon, 02 Jun 2025 05:37:47 +0000 https://status.vapi.ai/#dda03839d959a7c2f21f5daaf95d4db364497fdc8346fec9b42798b73b8e65a1 Vapi API recovered Vapi SIP went down https://status.vapi.ai/ Mon, 02 Jun 2025 05:27:59 +0000 https://status.vapi.ai/#cb53031a83951d6953c5960c1d06cbcc0cdb9d87346dea3a2c7c71bdf273009f Vapi SIP went down Vapi API went down https://status.vapi.ai/ Mon, 02 Jun 2025 05:27:38 +0000 https://status.vapi.ai/#dda03839d959a7c2f21f5daaf95d4db364497fdc8346fec9b42798b73b8e65a1 Vapi API went down Vapi API [Weekly] went down https://status.vapi.ai/ Mon, 02 Jun 2025 05:09:38 +0000 https://status.vapi.ai/#d9f8245acfd340b49e794580a46fb8faef4f742b3c077b290bdfcbc874f2359c Vapi API [Weekly] went down Anthropic recovered https://status.vapi.ai/ Thu, 29 May 2025 14:18:54 +0000 https://status.vapi.ai/#d4e1d4bf79c655813fe214d6b2a791f3faf19cd7a6f9ff0a65db67e975424b11 Anthropic recovered Anthropic went down https://status.vapi.ai/ Thu, 29 May 2025 14:05:55 +0000 https://status.vapi.ai/#d4e1d4bf79c655813fe214d6b2a791f3faf19cd7a6f9ff0a65db67e975424b11 Anthropic went down Anthropic recovered https://status.vapi.ai/ Wed, 28 May 2025 00:29:51 +0000 https://status.vapi.ai/#cbad7892da18432e8acd67c1409b04afe31947edc3e76cd7bae7692783ebeb7a Anthropic recovered Anthropic went down https://status.vapi.ai/ Tue, 27 May 2025 22:25:50 +0000 https://status.vapi.ai/#cbad7892da18432e8acd67c1409b04afe31947edc3e76cd7bae7692783ebeb7a Anthropic went down Users Unable to Sign In to Dashboard https://status.vapi.ai/incident/580899 Tue, 27 May 2025 01:37:00 -0000 https://status.vapi.ai/incident/580899#c680ad111a4ef05524bc8aa1d804a543c23e0ea73e1e9826b192335b2cc5e725 Summary Users experienced login issues with our dashboard due to an unintended deployment of a staging version to the production environment. Timeline (in PST): * 3:17 PM: Internal engineers identified issues affecting developer workflows. * 4:19 PM: Breaking change is introduced and unintentionally deployed to production * 4:38 PM: First customer reports surfaced; engineering team immediately escalated internally. * 4:43 PM: Public status page updated to notify customers. * 4:54 PM: Corrective actions deployed. * 5:08 PM: Additional steps taken to accelerate resolution for users. * 5:17 PM: Issue fully resolved and status page updated accordingly. Impact: * Users were temporarily unable to log into the dashboard. * The issue was promptly reported and escalated by affected users. Root Cause: A configuration change intended to streamline internal development processes unintentionally led to the deployment of a staging version of our dashboard to the production environment. This occurred because the system did not adequately distinguish between environments in the deployment workflows, resulting in incorrect settings being applied in production. What Went Well: * Internal escalation was rapid, and the status page effectively informed users quickly. What Went Poorly: * Limited tooling for rapid rollbacks led to extended resolution time. * Insufficient clarity around deployment workflows contributed to the incident. Corrective Actions Taken: * Immediately reverted the unintended deployment and restored the correct production configuration. * Purged caches to expedite the resolution. Future Preventative Measures: * Enhance deployment configuration to clearly separate staging and production environments. * Improve tools and processes for more rapid rollback capabilities in future deployments. Anthropic recovered https://status.vapi.ai/ Tue, 27 May 2025 00:37:49 +0000 https://status.vapi.ai/#632f0c6272811134b2864c1a15a14bd7aafdd706639e6e232733c56bdc1632f1 Anthropic recovered Users Unable to Sign In to Dashboard https://status.vapi.ai/incident/580899 Tue, 27 May 2025 00:08:00 -0000 https://status.vapi.ai/incident/580899#ce7553fabc85f311093b2c8a74f3cd6e87bfe37c6c8641cf1bf5090268b80e14 The sign-in issue has been resolved, and a fix has been successfully deployed. Users should now be able to access the dashboard as expected. We are currently preparing a RCA and will share it soon. Users Unable to Sign In to Dashboard https://status.vapi.ai/incident/580899 Mon, 26 May 2025 23:40:00 -0000 https://status.vapi.ai/incident/580899#a3a9921c769f12d11d2be11b6bd74c4a4107ecb8b183f8352c531107f53ebcd8 We are currently investigating an issue preventing some users from signing in to the dashboard. The team is actively working on a fix. We will provide updates as progress is made. Thank you for your patience. Anthropic went down https://status.vapi.ai/ Mon, 26 May 2025 18:28:34 +0000 https://status.vapi.ai/#632f0c6272811134b2864c1a15a14bd7aafdd706639e6e232733c56bdc1632f1 Anthropic went down Anthropic recovered https://status.vapi.ai/ Mon, 26 May 2025 15:15:23 +0000 https://status.vapi.ai/#8d929b2063af78a1bbba697324a99d6cf3a1616fba38e8e7ab4504ec75cdfc4e Anthropic recovered Anthropic went down https://status.vapi.ai/ Mon, 26 May 2025 14:45:21 +0000 https://status.vapi.ai/#8d929b2063af78a1bbba697324a99d6cf3a1616fba38e8e7ab4504ec75cdfc4e Anthropic went down Vapi API [Weekly] recovered https://status.vapi.ai/ Sun, 25 May 2025 04:08:41 +0000 https://status.vapi.ai/#40c0880e03f22e48596e375a0c64c7cda41f7b9027bab667a7fcbdc6c41d5986 Vapi API [Weekly] recovered Vapi API [Weekly] went down https://status.vapi.ai/ Sun, 25 May 2025 03:56:40 +0000 https://status.vapi.ai/#40c0880e03f22e48596e375a0c64c7cda41f7b9027bab667a7fcbdc6c41d5986 Vapi API [Weekly] went down Anthropic recovered https://status.vapi.ai/ Fri, 23 May 2025 10:32:23 +0000 https://status.vapi.ai/#e946f056c73bc360625a035fffdfaf78bd982f2333a3e23133d78aee11d59ae3 Anthropic recovered Anthropic went down https://status.vapi.ai/ Fri, 23 May 2025 08:27:17 +0000 https://status.vapi.ai/#e946f056c73bc360625a035fffdfaf78bd982f2333a3e23133d78aee11d59ae3 Anthropic went down Anthropic recovered https://status.vapi.ai/ Thu, 22 May 2025 21:58:05 +0000 https://status.vapi.ai/#8e7a668c9a45070307311a3fe3dbc1dae047755972a23a06361f65fb9474cc0d Anthropic recovered Anthropic went down https://status.vapi.ai/ Thu, 22 May 2025 19:43:05 +0000 https://status.vapi.ai/#8e7a668c9a45070307311a3fe3dbc1dae047755972a23a06361f65fb9474cc0d Anthropic went down Vapi API [Weekly] recovered https://status.vapi.ai/ Thu, 22 May 2025 05:27:49 +0000 https://status.vapi.ai/#cefaad09c9ca5691cba7afe42b375b8e98c5c5b9a8edc2abf7c425ddeae4b0a5 Vapi API [Weekly] recovered Vapi API [Weekly] went down https://status.vapi.ai/ Thu, 22 May 2025 05:20:20 +0000 https://status.vapi.ai/#cefaad09c9ca5691cba7afe42b375b8e98c5c5b9a8edc2abf7c425ddeae4b0a5 Vapi API [Weekly] went down Vapi API [Weekly] recovered https://status.vapi.ai/ Thu, 22 May 2025 04:50:22 +0000 https://status.vapi.ai/#ace2fa63e3a686baf75aba86e9611e41ed1c162449fb1cba0940ef8712aec486 Vapi API [Weekly] recovered Vapi API [Weekly] went down https://status.vapi.ai/ Thu, 22 May 2025 04:43:20 +0000 https://status.vapi.ai/#ace2fa63e3a686baf75aba86e9611e41ed1c162449fb1cba0940ef8712aec486 Vapi API [Weekly] went down Anthropic recovered https://status.vapi.ai/ Wed, 21 May 2025 23:41:39 +0000 https://status.vapi.ai/#49cc485dadf3e7d4ceaa55cb8c7ae8a9816b6d7a8ebaec2b13b655e56364e401 Anthropic recovered Anthropic went down https://status.vapi.ai/ Wed, 21 May 2025 23:31:41 +0000 https://status.vapi.ai/#49cc485dadf3e7d4ceaa55cb8c7ae8a9816b6d7a8ebaec2b13b655e56364e401 Anthropic went down Anthropic recovered https://status.vapi.ai/ Wed, 21 May 2025 23:25:40 +0000 https://status.vapi.ai/#31f97fd9843d8e6e37ff8f6ac8cea5f4d9f51b947f6dee728fec8ac64af70627 Anthropic recovered Anthropic went down https://status.vapi.ai/ Wed, 21 May 2025 21:21:38 +0000 https://status.vapi.ai/#31f97fd9843d8e6e37ff8f6ac8cea5f4d9f51b947f6dee728fec8ac64af70627 Anthropic went down Anthropic recovered https://status.vapi.ai/ Wed, 21 May 2025 11:10:54 +0000 https://status.vapi.ai/#6f51d97e2e5814f68f31a3418620cbfcafdd6117d280783e4aaf4d78b16587ef Anthropic recovered Anthropic went down https://status.vapi.ai/ Wed, 21 May 2025 10:24:41 +0000 https://status.vapi.ai/#6f51d97e2e5814f68f31a3418620cbfcafdd6117d280783e4aaf4d78b16587ef Anthropic went down Cartesia voices are degraded https://status.vapi.ai/incident/570316 Sun, 18 May 2025 20:56:00 -0000 https://status.vapi.ai/incident/570316#05fce5af4df67beb04bf689e367290403a8142a8c4da21b2ab21d49120852ef5 Everything is functional. We're still working with Cartesia to get to bottom. We'll change back to degraded if the issue raises again during investigation. Cartesia voices are degraded https://status.vapi.ai/incident/570316 Sun, 18 May 2025 20:17:00 -0000 https://status.vapi.ai/incident/570316#5748e530714a6652fc52fa18c718a4495f51ffa47e12a61f5429197ac74f403c It's all working now as Cartesia team has bumped our limits. We're still investigating the issue Cartesia voices are degraded https://status.vapi.ai/incident/570316 Sun, 18 May 2025 20:11:00 -0000 https://status.vapi.ai/incident/570316#01a094d7cbdab264fed44790a3ba06d4b88f31988015f27dd3812fd041b93118 We're investigating an internal bug causing 429s on Cartesia. Vapifault Worker Timeouts https://status.vapi.ai/incident/564575 Tue, 13 May 2025 17:31:00 -0000 https://status.vapi.ai/incident/564575#d434e578a52a15ba9babd8fb6675778aebc13fd11d8536d32de75c3dc074b0be # RCA: Vapifault Worker Timeouts ## TL;DR On May 12, approximately 335 concurrent calls were either web-based or exceeded 15 minutes in duration, surpassing the prescaled worker limit of 250 on the weekly environment. Due to infrastructure constraints, Lambda functions could not supplement the increased call load. Kubernetes call-worker pods could not scale quickly enough to meet demand, resulting in worker timeout issues. The following day, this issue reoccurred due to the prescaling limit being inadvertently reset to the lower default value during a routine deployment. ## Timeline (PT) - **May 12, 1:30 pm:** Customer reports issues related to worker timeouts. - **May 12, 4:39 pm:** Another customer reports the same issue with worker timeouts. - **May 12, 5:19 pm:** Workers scaled manually from 250 to 350; service restored. - **May 12, 11:48 pm:** Routine deployment resets worker prescale count back to 250. - **May 13, 10:47 am:** Customer reports recurrence of worker timeout issue. - Concurrent increase in overall call volume further exacerbates worker availability. - **May 13, 11:29 am:** Workers scaled again to 350 on weekly and increased to 750 on daily; service fully restored. ## Impact - Approximately **2,461 calls** dropped due to worker connection timeouts. ## What Went Wrong? - **Insufficient Monitoring:** Worker timeout events were not correctly captured by monitoring because of how `callEndedReason` is logged. - Customers identified and reported the issue before internal monitoring did. - **Configuration Drift:** Prescale worker count change was not committed to the main configuration branch, causing resets during routine deployments. - **Alert Handling:** Lambda invocation alerts fired but were deprioritized as "requires investigation but not urgent." ## What Went Well? - Rapid remediation once the problem was identified. Vapifault Worker Timeouts https://status.vapi.ai/incident/564575 Tue, 13 May 2025 17:31:00 -0000 https://status.vapi.ai/incident/564575#d434e578a52a15ba9babd8fb6675778aebc13fd11d8536d32de75c3dc074b0be # RCA: Vapifault Worker Timeouts ## TL;DR On May 12, approximately 335 concurrent calls were either web-based or exceeded 15 minutes in duration, surpassing the prescaled worker limit of 250 on the weekly environment. Due to infrastructure constraints, Lambda functions could not supplement the increased call load. Kubernetes call-worker pods could not scale quickly enough to meet demand, resulting in worker timeout issues. The following day, this issue reoccurred due to the prescaling limit being inadvertently reset to the lower default value during a routine deployment. ## Timeline (PT) - **May 12, 1:30 pm:** Customer reports issues related to worker timeouts. - **May 12, 4:39 pm:** Another customer reports the same issue with worker timeouts. - **May 12, 5:19 pm:** Workers scaled manually from 250 to 350; service restored. - **May 12, 11:48 pm:** Routine deployment resets worker prescale count back to 250. - **May 13, 10:47 am:** Customer reports recurrence of worker timeout issue. - Concurrent increase in overall call volume further exacerbates worker availability. - **May 13, 11:29 am:** Workers scaled again to 350 on weekly and increased to 750 on daily; service fully restored. ## Impact - Approximately **2,461 calls** dropped due to worker connection timeouts. ## What Went Wrong? - **Insufficient Monitoring:** Worker timeout events were not correctly captured by monitoring because of how `callEndedReason` is logged. - Customers identified and reported the issue before internal monitoring did. - **Configuration Drift:** Prescale worker count change was not committed to the main configuration branch, causing resets during routine deployments. - **Alert Handling:** Lambda invocation alerts fired but were deprioritized as "requires investigation but not urgent." ## What Went Well? - Rapid remediation once the problem was identified. providerfault-transport errors https://status.vapi.ai/incident/564574 Tue, 13 May 2025 17:29:00 -0000 https://status.vapi.ai/incident/564574#0c3ef35708717e3d3ea3c164bfce5ff757c227deb35d509c9db86e520fa36ccb # RCA: Providerfault-transport-never-connected ## Summary During a surge in inbound call traffic, two distinct errors were observed: "vapifault-transport-worker-not-available" and "providerfault-transport-never-connected." This report focuses on the root cause analysis of the "providerfault-transport-never-connected" errors occurring during the increased call volume. ## Timeline of Events (PT) - **10:26 AM:** Significant spike in inbound call volume. - **10:26 – 10:40 AM:** Intermittent HTTP 520 errors returned by CDN for inbound call endpoints (46 calls impacted). - **11:00 AM – 12:00 PM:** Infrastructure intermittently failed to establish transport connections despite successfully picking up calls (172 calls impacted). - **12:00 PM:** Call volume returns to normal; errors cease. ## Root Cause Analysis ### 1. HTTP 520 Errors at CDN - High load triggered intermittent HTTP 520 errors for critical endpoints. - Internal tracing confirmed successful API responses not properly relayed back, indicating issues in network layers external to core services. - Active investigation ongoing with network provider to identify the underlying cause. ### 2. Resource Exhaustion on Proxy Service - During peak load, the proxy service responsible for handling call connections exhausted available CPU and memory resources (observed usage ~1.27 CPU cores and 1.2 GB RAM). - Insufficient resource allocation led to failed transport connections. - Logs showed degraded pod performance, including failures in auxiliary tasks like recording uploads. ## What Went Wrong? - **Misclassification of Errors:** Internally treated as external provider faults rather than recognizing infrastructure capacity issues. - **Insufficient Monitoring:** Lack of alerts and monitoring for proxy resource saturation conditions. - **Load-Testing Gap:** Prior load tests did not replicate proxy resource constraints encountered in production scenarios. providerfault-transport errors https://status.vapi.ai/incident/564574 Tue, 13 May 2025 17:29:00 -0000 https://status.vapi.ai/incident/564574#0c3ef35708717e3d3ea3c164bfce5ff757c227deb35d509c9db86e520fa36ccb # RCA: Providerfault-transport-never-connected ## Summary During a surge in inbound call traffic, two distinct errors were observed: "vapifault-transport-worker-not-available" and "providerfault-transport-never-connected." This report focuses on the root cause analysis of the "providerfault-transport-never-connected" errors occurring during the increased call volume. ## Timeline of Events (PT) - **10:26 AM:** Significant spike in inbound call volume. - **10:26 – 10:40 AM:** Intermittent HTTP 520 errors returned by CDN for inbound call endpoints (46 calls impacted). - **11:00 AM – 12:00 PM:** Infrastructure intermittently failed to establish transport connections despite successfully picking up calls (172 calls impacted). - **12:00 PM:** Call volume returns to normal; errors cease. ## Root Cause Analysis ### 1. HTTP 520 Errors at CDN - High load triggered intermittent HTTP 520 errors for critical endpoints. - Internal tracing confirmed successful API responses not properly relayed back, indicating issues in network layers external to core services. - Active investigation ongoing with network provider to identify the underlying cause. ### 2. Resource Exhaustion on Proxy Service - During peak load, the proxy service responsible for handling call connections exhausted available CPU and memory resources (observed usage ~1.27 CPU cores and 1.2 GB RAM). - Insufficient resource allocation led to failed transport connections. - Logs showed degraded pod performance, including failures in auxiliary tasks like recording uploads. ## What Went Wrong? - **Misclassification of Errors:** Internally treated as external provider faults rather than recognizing infrastructure capacity issues. - **Insufficient Monitoring:** Lack of alerts and monitoring for proxy resource saturation conditions. - **Load-Testing Gap:** Prior load tests did not replicate proxy resource constraints encountered in production scenarios. SIP calls abruptly closing after 30 seconds https://status.vapi.ai/incident/564570 Tue, 13 May 2025 17:27:00 -0000 https://status.vapi.ai/incident/564570#e72777b6e3107e381cea216b653c43b3b616381eec36f1b651c574d2c2f14dc3 # RCA: SIP Calls Ending Abruptly ## TL;DR A SIP node was rotated, and the associated Elastic IP (EIP) was reassigned to the new node. However, the SIP service was not restarted afterward, causing the SIP service to use an incorrect (private) IP address when sending SIP requests. Consequently, users receiving these SIP requests attempted to respond to the wrong IP address, resulting in ACK timeouts. ## Timeline (PT) - **May 12, ~9:00 pm:** SIP node rotated and Elastic IP reassigned, but SIP service was not restarted. - Calls appeared to succeed initially because they were routed through a healthy SIP node. - **May 13, 12:44 pm:** Customer reports SIP calls consistently failing after approximately 30-31 seconds. - **May 13, 12:49 pm:** SIP service restarted; customer confirms issue resolved. ## Impact - 35 calls experienced "ACK timeout" failures, corresponding directly to failed customer calls. ## What Went Wrong? - Lack of monitoring and alerting for SIP-related failures. - Issue persisted unnoticed for approximately 3 hours. - Customer reported issue first, not internal systems. - Absence of documented runbooks for SIP node rotation process. - No load test conducted following node rotation to verify successful SIP routing. ## What Went Well? - Rapid issue remediation following customer escalation. Stale data for Weekly users https://status.vapi.ai/incident/564566 Tue, 13 May 2025 17:22:00 -0000 https://status.vapi.ai/incident/564566#3d0b5fba07db1ededde19ffe44c56fed593a87eeb648c94f51a0e3bf1c303c80 # RCA: Phone Number Caching Error in Weekly Environment ## TL;DR Certain code paths allowed caching functions to execute without an associated organization ID, preventing correct lookup of the organization's channel. This unintentionally enabled caching for the weekly environment, specifically affecting inbound phone call paths. Users consequently received outdated server URLs after updating phone numbers. ## Timeline (PT) - **May 10, 1:26 am:** Caching re-enabled for users in daily environment using the feature flag. - **May 13, 10:42 am:** Customer reports phone calls referencing outdated server URLs after updates. - **May 13, 11:18 am:** Caching disabled globally; service fully restored. - **May 13, ~10:00 pm:** Fix deployed to weekly environment; caching globally re-enabled. ## Impact - Customers experienced degraded service; updates to server URLs or assistant configurations for phone numbers did not immediately reflect during calls. - Issue previously identified and resolved in daily environment resurfaced in weekly due to incomplete implementation of the feature flag. ## What Went Wrong? - Inadequate testing of the feature flag allowed unintended caching on some paths. - Lack of proper failure handling when organization ID was missing. - Issue surfaced through customer reporting, not internal monitoring. - Fix deployed to daily environment was not applied to weekly environment in time. ## What Went Well? - Feature flag system allowed rapid disabling of caching globally once identified. Voice issues due to 11labs quota https://status.vapi.ai/incident/564580 Sat, 10 May 2025 17:34:00 -0000 https://status.vapi.ai/incident/564580#261e78c84237f682a6bed6058927d62f3e9c35962e9c599cc1c04a94ce3185ef # RCA: 11Labs Voice Issue ## TL;DR Calls began failing due to exceeding the 11Labs voice service quota, resulting in errors (`vapifault-eleven-labs-quota-exceeded`). ## Timeline of Events (PT) - **12:04 PM:** Calls begin failing due to 11Labs quota being exceeded. - **12:16 PM:** Customer reports the issue as a production outage. - **12:24 PM:** Contacted 11Labs support regarding quota exhaustion. - **12:25 PM:** 11Labs support recommends enabling usage-based billing. - **12:26 PM:** Usage-based billing activated; issue resolved immediately. ## Root Cause Analysis - The incident occurred because the monthly quota limit for 11Labs voice services was reached. - Example error log: ``` { "message": "This request exceeds your quota of 2000000000. You have 4 credits remaining, while 23 credits are required for this request.", "error": "quota_exceeded", "code": 1008 } ``` ## What Went Wrong? - Lack of proactive alerting: No paging occurred because logs were being sampled and adequate monitors were not in place in the new logging system. - Initial difficulty diagnosing the issue quickly due to limited familiarity with the new logging tool (Axiom). ## What Went Well? - Rapid response and effective support provided by the external vendor (11Labs). - Swift resolution once the problem was clearly identified. Voice issues due to 11labs quota https://status.vapi.ai/incident/564580 Sat, 10 May 2025 17:34:00 -0000 https://status.vapi.ai/incident/564580#261e78c84237f682a6bed6058927d62f3e9c35962e9c599cc1c04a94ce3185ef # RCA: 11Labs Voice Issue ## TL;DR Calls began failing due to exceeding the 11Labs voice service quota, resulting in errors (`vapifault-eleven-labs-quota-exceeded`). ## Timeline of Events (PT) - **12:04 PM:** Calls begin failing due to 11Labs quota being exceeded. - **12:16 PM:** Customer reports the issue as a production outage. - **12:24 PM:** Contacted 11Labs support regarding quota exhaustion. - **12:25 PM:** 11Labs support recommends enabling usage-based billing. - **12:26 PM:** Usage-based billing activated; issue resolved immediately. ## Root Cause Analysis - The incident occurred because the monthly quota limit for 11Labs voice services was reached. - Example error log: ``` { "message": "This request exceeds your quota of 2000000000. You have 4 credits remaining, while 23 credits are required for this request.", "error": "quota_exceeded", "code": 1008 } ``` ## What Went Wrong? - Lack of proactive alerting: No paging occurred because logs were being sampled and adequate monitors were not in place in the new logging system. - Initial difficulty diagnosing the issue quickly due to limited familiarity with the new logging tool (Axiom). ## What Went Well? - Rapid response and effective support provided by the external vendor (11Labs). - Swift resolution once the problem was clearly identified. Vapi Docs recovered https://status.vapi.ai/ Tue, 06 May 2025 15:32:18 +0000 https://status.vapi.ai/#5fcdf80403905f0d2e55a6e0f4a86ea69e3aeb197e039a716816c33bc9b5e808 Vapi Docs recovered Vapi Docs went down https://status.vapi.ai/ Tue, 06 May 2025 15:25:30 +0000 https://status.vapi.ai/#5fcdf80403905f0d2e55a6e0f4a86ea69e3aeb197e039a716816c33bc9b5e808 Vapi Docs went down Upgrading Weekly Cluster https://status.vapi.ai/incident/556160 Sun, 04 May 2025 04:20:35 +0000 https://status.vapi.ai/incident/556160#12a7f2a1c6ad75fe56ee8c1cc3f8ec353ced88f88977ec7389add79d2764ed1d Maintenance completed Upgrading Weekly Cluster https://status.vapi.ai/incident/556160 Sun, 04 May 2025 03:20:35 -0000 https://status.vapi.ai/incident/556160#176bdbb88e6591794fa4861760d76610dcdb2b2b3adb38b61804bb7294ef3408 Regular upgrades to cluster API degradation https://status.vapi.ai/incident/555711 Sat, 03 May 2025 01:43:00 -0000 https://status.vapi.ai/incident/555711#b50995150e422613d2e9649e412f60b6d2c2e213de21c95657ca0bee4cd85a62 # RCA for May 2nd User error in manual rollout ## Root cause: * User error in kicking off a manual rollout, driven by unblocking a release * Due to this, load balancer was pointed at an invalid backend cluster ## Timeline * 5:24pm PT: Engineer flagged blocked rollout, Infra engineer identified transient error that auto-blocked rollout * 5:31pm PT: Infra engineer triggered manual rollout on behalf of engineer, to unblock release * 5:43pm PT: On-call was paged with issue in rollout manager, engineering team internally escalated downtime * 5:45pm PT: Infra engineer fixed misconfigured rollout and confirmed load balancer was correctly pointed * 5:50pm PT: Engineering team manually tested API and calls were working again ## Impact * Calls, API and dashboard were down or degraded for up to 15 minutes * User experience was disrupted temporarily; Issue reported internally and by self-serve users ## What went wrong? * We rushed through a manual rollout, which is gated to Infra team * Manual rollout tools did not catch user error ## What went well? * Our pagers flagged this issue * Team responded quickly and was able to mitigate * Status page was put up proactively ## Action Items: * Update manual deployment tools to avoid such user error [Done] * Expand rollout auto-blocking mechanism to incorporate other pages [Done] * Better documentation for rollout/rollback steps * Further lock down manual deployment, gate behind approval by 1 more infra eng API degradation https://status.vapi.ai/incident/555711 Sat, 03 May 2025 01:43:00 -0000 https://status.vapi.ai/incident/555711#b50995150e422613d2e9649e412f60b6d2c2e213de21c95657ca0bee4cd85a62 # RCA for May 2nd User error in manual rollout ## Root cause: * User error in kicking off a manual rollout, driven by unblocking a release * Due to this, load balancer was pointed at an invalid backend cluster ## Timeline * 5:24pm PT: Engineer flagged blocked rollout, Infra engineer identified transient error that auto-blocked rollout * 5:31pm PT: Infra engineer triggered manual rollout on behalf of engineer, to unblock release * 5:43pm PT: On-call was paged with issue in rollout manager, engineering team internally escalated downtime * 5:45pm PT: Infra engineer fixed misconfigured rollout and confirmed load balancer was correctly pointed * 5:50pm PT: Engineering team manually tested API and calls were working again ## Impact * Calls, API and dashboard were down or degraded for up to 15 minutes * User experience was disrupted temporarily; Issue reported internally and by self-serve users ## What went wrong? * We rushed through a manual rollout, which is gated to Infra team * Manual rollout tools did not catch user error ## What went well? * Our pagers flagged this issue * Team responded quickly and was able to mitigate * Status page was put up proactively ## Action Items: * Update manual deployment tools to avoid such user error [Done] * Expand rollout auto-blocking mechanism to incorporate other pages [Done] * Better documentation for rollout/rollback steps * Further lock down manual deployment, gate behind approval by 1 more infra eng API degradation https://status.vapi.ai/incident/555711 Sat, 03 May 2025 00:54:00 -0000 https://status.vapi.ai/incident/555711#941db56004b882c6868abf8d318191a9e40aa3ab688e2c3371efb3b3e14e30cb We identified the root cause of the issue in a bad deployment. The team rolled out a fix. API is fully operational again. API degradation https://status.vapi.ai/incident/555711 Sat, 03 May 2025 00:54:00 -0000 https://status.vapi.ai/incident/555711#941db56004b882c6868abf8d318191a9e40aa3ab688e2c3371efb3b3e14e30cb We identified the root cause of the issue in a bad deployment. The team rolled out a fix. API is fully operational again. API degradation https://status.vapi.ai/incident/555711 Sat, 03 May 2025 00:44:00 -0000 https://status.vapi.ai/incident/555711#1d4eec1b4b76d241b314a7b5fbf853dab0a6e279e471b6e70eb9a44ad1794bb8 Some API endpoints may be unavailable. Team is working on implementing a fix. API degradation https://status.vapi.ai/incident/555711 Sat, 03 May 2025 00:44:00 -0000 https://status.vapi.ai/incident/555711#1d4eec1b4b76d241b314a7b5fbf853dab0a6e279e471b6e70eb9a44ad1794bb8 Some API endpoints may be unavailable. Team is working on implementing a fix. OpenAI recovered https://status.vapi.ai/ Wed, 30 Apr 2025 07:18:42 +0000 https://status.vapi.ai/#7ae0d628e4ced84be564752224888ce0edd42ce9ecec024e03c8b14f666edd88 OpenAI recovered OpenAI went down https://status.vapi.ai/ Wed, 30 Apr 2025 07:08:44 +0000 https://status.vapi.ai/#7ae0d628e4ced84be564752224888ce0edd42ce9ecec024e03c8b14f666edd88 OpenAI went down Call Recordings May Fail For Some Users https://status.vapi.ai/incident/554190 Wed, 30 Apr 2025 06:59:00 -0000 https://status.vapi.ai/incident/554190#1b2dcbdcfa02f1a954270b09d42cbb49e4d26471f65a9e1d507be04a7c4ee003 We have resolved the issue. Will upload RCA 04/30 noon PST. TL;DR: Recordings weren't uploaded to object storage due to some invalid credentials. We generated and applied new keys. Call Recordings May Fail For Some Users https://status.vapi.ai/incident/554190 Wed, 30 Apr 2025 06:59:00 -0000 https://status.vapi.ai/incident/554190#1b2dcbdcfa02f1a954270b09d42cbb49e4d26471f65a9e1d507be04a7c4ee003 We have resolved the issue. Will upload RCA 04/30 noon PST. TL;DR: Recordings weren't uploaded to object storage due to some invalid credentials. We generated and applied new keys. Call Recordings May Fail For Some Users https://status.vapi.ai/incident/554190 Wed, 30 Apr 2025 05:30:00 -0000 https://status.vapi.ai/incident/554190#b229b2b7fb742823a50c60693da240022665625c5e5eb668353dae18e951f0c4 Some users may not receive call recordings due to an issue with our Cloudflare R2 Storage, the team is deploying a fix now Call Recordings May Fail For Some Users https://status.vapi.ai/incident/554190 Wed, 30 Apr 2025 05:30:00 -0000 https://status.vapi.ai/incident/554190#b229b2b7fb742823a50c60693da240022665625c5e5eb668353dae18e951f0c4 Some users may not receive call recordings due to an issue with our Cloudflare R2 Storage, the team is deploying a fix now Auth DB restart https://status.vapi.ai/incident/551227 Fri, 25 Apr 2025 05:05:00 +0000 https://status.vapi.ai/incident/551227#49065f870fb9a83bfa462886de46b0c268d873ca4521b51955009738e2354497 Maintenance completed Auth DB restart https://status.vapi.ai/incident/551227 Fri, 25 Apr 2025 05:00:26 -0000 https://status.vapi.ai/incident/551227#c526351a07064d68239f280afc8ee5accf115b082c606cd096653802db39fb5c We will be performing a brief restart of our authentication database to accommodate increased scale. This maintenance is expected to complete within one minute. We appreciate your patience and apologize for any inconvenience. Should only impact the signin & signup on dashboard. Calls and other APIs will not be impacted by it. Increased 404 Errors Related to Phone Numbers Found https://status.vapi.ai/incident/548968 Tue, 22 Apr 2025 11:39:00 -0000 https://status.vapi.ai/incident/548968#6ee37d23de0acc507bd851bf4b287a15d40291ab680b473a9e078dd55eb955ff We have determined the issue and resolved. We will update by noon PST with an RCA. TL;DR: Adding a new CIDR range to our SIP cluster caused issues where the servers were unable to discover each other. Increased 404 Errors Related to Phone Numbers Found https://status.vapi.ai/incident/548968 Tue, 22 Apr 2025 09:58:00 -0000 https://status.vapi.ai/incident/548968#8e63a9c5ea0b4848812e7e2e48e050fad01b893db1a2276519f77b5a1c082478 We are seeing an increase in 404 responses for SIP outbound calls. Vapi DB recovered https://status.vapi.ai/ Sun, 20 Apr 2025 01:46:36 +0000 https://status.vapi.ai/#ee6a0c4fff658805034aef7465c635f6951205dc84566e2fac2a09ad705b9402 Vapi DB recovered Vapi DB went down https://status.vapi.ai/ Sun, 20 Apr 2025 01:34:35 +0000 https://status.vapi.ai/#ee6a0c4fff658805034aef7465c635f6951205dc84566e2fac2a09ad705b9402 Vapi DB went down Upgrading Weekly API https://status.vapi.ai/incident/545796 Tue, 15 Apr 2025 19:12:26 +0000 https://status.vapi.ai/incident/545796#59b41a2460ef9a82529c52eccff1d232da58444a0efdbc71c5a18a5dd6ce04f4 Maintenance completed Upgrading Weekly API https://status.vapi.ai/incident/545796 Tue, 15 Apr 2025 19:12:26 -0000 https://status.vapi.ai/incident/545796#cbdf444c8b6a535dc24ab371e6773fbd6a3fa000638912c819eb257d4594eed5 Applying performance optimizations Upgrading Weekly API https://status.vapi.ai/incident/545796 Tue, 15 Apr 2025 18:21:20 -0000 https://status.vapi.ai/incident/545796#cbdf444c8b6a535dc24ab371e6773fbd6a3fa000638912c819eb257d4594eed5 Applying performance optimizations Increased 480 Temporarily Unavailable cases for SIP inbound https://status.vapi.ai/incident/537355 Tue, 08 Apr 2025 05:00:00 -0000 https://status.vapi.ai/incident/537355#66648dafa6613540b5807d7461c079b40e66e08ac81f196e80b86e9bcca9b0b9 For RCA please checkout https://status.vapi.ai/incident/528384?mp=true SIP calls failing intermittently https://status.vapi.ai/incident/536229 Tue, 08 Apr 2025 05:00:00 -0000 https://status.vapi.ai/incident/536229#2d065523dbd765e411722438354438b137f58c1e2647772257b547d50909a2a0 For RCA please check https://status.vapi.ai/incident/528384?mp=true SIP call failures to connect https://status.vapi.ai/incident/528384 Tue, 08 Apr 2025 04:56:00 -0000 https://status.vapi.ai/incident/528384#753181a59c4b65690dfae03b748282cb4abddd437fd10b225ba7d19ec33062a4 #RCA for SIP Degradation for sip.vapi.ai **TLDR;** Vapi sip service (sip.vapi.ai) was intermittently throwing errors and not able to connect to calls. We had some major flaws in our SIP infrastructure which was resolved by rearchitecting the whole thing from scratch. **Impact** - Call to Vapi SIP uri or Vapi phone numbers were failing to connect with 480/487/503 errors - Inbound calls to Vapi getting connected but audio not coming out, eventually causing silence timeouts or customer-did-not-answer - Outbound calls from Vapi numbers or custom SIP trunks were mostly unimpacted due to whole migration but we did add some rate limiting recently which could have caused 429's failing Vapi call creation. - Around 1% calls were failing intermittently with failure rate going up to 10% at times briefly. **Root Cause** - In order to scale out our SIP infrastructure, Vapi moved to a Kubernetes based SIP deployment back in mid January. - SIP networking in kubernetes was complex to get right and we released multiple fixes throughout February and mid March and operated the service on a satisfactory level but with intermittent failures. - Periods of degraded experience during this time were specifically due to networking errors between different components of our SIP infrastructure. Most of the time we were able to resolve issues as they occur by restarting services, releasing patches, blocking malicious traffic, scaling out more, etc. - By mid march we realised that the kubernetes deployment is not going to be stable and started devising a new infrastructure for SIP. We started migration for SIP to a more stable autoscaling group based deployment on 31st March, and continued doing so over the next day or two. - The team monitored the new deployment very closely, and kept releasing patches for every small failure that we saw. - The new deployment has been looking great so far **What went poorly?** - We took a lot of time in deciding to pull the plug on our kubernetes deployment. - Users were impacted intermittently and the SIP reliability was not what we aspire for **Remediations** - SIP infrastructure was revamped to an autoscaling group based deployment which is more stable. - Audit of each error case and apply immediate fixes where needed - Add better monitoring and telemetry across the SIP infrastructure to make sure we catch issues and act on them preemptively. SIP call failures to connect https://status.vapi.ai/incident/528384 Mon, 07 Apr 2025 22:48:00 -0000 https://status.vapi.ai/incident/528384#e6d5a248a1a10c032fda3b6a63c1f8bd0298a760b4f6f6e0cebab46ca2aaeefe SIP infrastructure has been upgraded on our side. So far seeing good performance for it. Degradation in phone calls stuck in queued state. https://status.vapi.ai/incident/540048 Fri, 04 Apr 2025 19:00:00 -0000 https://status.vapi.ai/incident/540048#2ea1a7e4e8cd896b6a24b52b37215da340fc6db4cf79b9b80edbd5deccd45a87 Resolved the issue, blocked offending user and reviewed rate limits Degradation in phone calls stuck in queued state. https://status.vapi.ai/incident/540048 Fri, 04 Apr 2025 18:19:00 -0000 https://status.vapi.ai/incident/540048#aecb92f6c56c7c9bce7cbbe0e565ff985bcab3316646013f1660a376bfe60c33 We're actively investigating the issue that popped up in the last 15 minutes Degradation in API https://status.vapi.ai/incident/540074 Fri, 04 Apr 2025 16:00:00 -0000 https://status.vapi.ai/incident/540074#7ee85a7a1b2d3c3480f8d9dc901a2d8b9e8232c70c2c23bda25a7e90e8ae72b9 API rollback completed and errors subsided Degradation in API https://status.vapi.ai/incident/540074 Fri, 04 Apr 2025 15:30:00 -0000 https://status.vapi.ai/incident/540074#ded364dce40721ff9dd517f2ae80073fb832a509441bf14691158fec269fc45c API was degraded Friday morning, the team was proactively notified via monitors and started a rollback Intermittent 503s in api https://status.vapi.ai/incident/538915 Thu, 03 Apr 2025 18:00:00 -0000 https://status.vapi.ai/incident/538915#b03abd8479566b3208a6f44f7f9b15ef97e239ac6e2c6926a18005ce835c2784 Improvements shipped reliably fixed the issue. Team has commenced medium-term, and is investigating long-term scalability improvements Intermittent 503s in api https://status.vapi.ai/incident/538915 Thu, 03 Apr 2025 06:11:00 -0000 https://status.vapi.ai/incident/538915#7079f58c892acac25bf58e2c6298fe578bc3d7634a64639269b97386eee4b172 We have identified the issue, pushed a fix, and are monitoring for improvements. Intermittent 503s in api https://status.vapi.ai/incident/538915 Wed, 02 Apr 2025 21:14:00 -0000 https://status.vapi.ai/incident/538915#cf8653db24891f4f17b2eb9e37d9cff900cde73758ce40e9eae71ddc09261123 We are investigating increased cases of 503s in our APIs. Experiencing Anthropic rate limits on model calls https://status.vapi.ai/incident/538378 Wed, 02 Apr 2025 03:04:00 -0000 https://status.vapi.ai/incident/538378#3acad9ddb368e539ac5f693fa468542217dab886f764aa61cf028b0eb6292f3d Anthropic rate limiting is resolved after raising quota Experiencing Anthropic rate limits on model calls https://status.vapi.ai/incident/538378 Wed, 02 Apr 2025 02:04:00 -0000 https://status.vapi.ai/incident/538378#259b33d5048fca9cc337efee3a521e568c7b6f808aa36c8d76530a302af7747b Assistants using Anthropic models with Vapi-provided API keys are intermittently experiencing rate limits. Those using bring-your-own API keys are unaffected Increased 480 Temporarily Unavailable cases for SIP inbound https://status.vapi.ai/incident/537355 Mon, 31 Mar 2025 16:00:00 -0000 https://status.vapi.ai/incident/537355#f1009d6ce8136f1b6f0f47bc5732d6ce1d4be26236e6c8c293e1c30981ac6835 Issue should be resolved now, we will be publishing a RCA for it later today.Sorry for the disruption. Increased 480 Temporarily Unavailable cases for SIP inbound https://status.vapi.ai/incident/537355 Mon, 31 Mar 2025 14:37:00 -0000 https://status.vapi.ai/incident/537355#e49835aa26d3eb29f49d8bf4400a8988fda8c1803313c420a952141400483b9c We have identified the problem and working on a fix. Increased 480 Temporarily Unavailable cases for SIP inbound https://status.vapi.ai/incident/537355 Mon, 31 Mar 2025 13:47:00 -0000 https://status.vapi.ai/incident/537355#d2ca637258d662cb539ef70433938595feb389e08e5540531ad5d7a4ea70e80b We are seeing increased cases of 480 Temporarily Unavailable cases for SIP inbound and are investigating on priority. SIP calls failing intermittently https://status.vapi.ai/incident/536229 Sun, 30 Mar 2025 15:50:00 -0000 https://status.vapi.ai/incident/536229#551cc08e8c74b40bf6836dba336f9e1e50223318e8fe705d70faf66c39a006b6 This should be resolved. We will be posting an RCA soon. SIP calls failing intermittently https://status.vapi.ai/incident/536229 Fri, 28 Mar 2025 22:18:00 -0000 https://status.vapi.ai/incident/536229#84c0483500f654dbdb73f64e9a79a4333e139b147577c1ae0b2cf96098f0fc30 We are seeing a degradation in our SIP service and are working towards resolving it on priority. Some SIP calls have longer reported call duration than reality https://status.vapi.ai/incident/536225 Fri, 28 Mar 2025 22:10:00 -0000 https://status.vapi.ai/incident/536225#fe8faed0eb08a1339edfea76baa65d8b82e194c0aaff43e1bb7ec21de1265861 Between 2025/03/27 8:40 PST and 9:35 PST, a small portion of SIP calls had their call durations initially inflated due to an internal system hang. The call duration information has been fixed retroactively. Upgrade SIP infrastructure https://status.vapi.ai/incident/534587 Thu, 27 Mar 2025 04:00:00 +0000 https://status.vapi.ai/incident/534587#79f931c17a5809e6b7b660fbfa1ee7dc72aa3d12ba2b59c240ec234e4da1f44a Maintenance completed Upgrade SIP infrastructure https://status.vapi.ai/incident/534587 Thu, 27 Mar 2025 02:00:30 -0000 https://status.vapi.ai/incident/534587#c50935ef7ac59cbe45242e553e39517657dfe860a347fe46bced63ac3a10d633 We are rolling out some major infra changes to our SIP infrastructure that should make it more stable. There should not be any downtime but could be some cases of call drops that rely on SIP during the infrastructure rollout. API degradation https://status.vapi.ai/incident/533963 Tue, 25 Mar 2025 04:33:00 -0000 https://status.vapi.ai/incident/533963#5b9421adfa44baa947ef15f07c9dc2e817eb3967d0bf31940741a43f3d17111d # TL;DR After deploying recent infrastructure changes to backend-production1, Redis Sentinel pods began restarting due to failing liveness checks (`/health/ping_sentinel.sh`). These infra changes included adding a new IP range, causing all cluster nodes to cycle. When Redis pods restarted, they continually failed health checks, resulting in repeated restarts. A rollback restored API functionality. The entire cluster is being re-created to address DNS resolution failures before rolling forward. # Timeline 1. March 30th: New IP range and subnets added. 2. March 24th, 3:55 PM: Deployment to backend-production1 initiated. 3. March 24th, 4:14 PM: Deployment completed. - Immediate increase in Redis errors observed in API pods. - API pods scaled dramatically and restarted frequently. - API service degraded with significant timeouts. 4. March 24th, 4:19 PM: Rollback initiated. 5. March 24th, 4:27 PM: Rollback completed; API service fully restored. # Resolution A rollback to the previous stable configuration resolved the immediate API timeout issues. The complete cluster re-creation is underway to permanently resolve underlying DNS resolution failures related to the new IP range before future deployments. # Impact - Approximately 2.67k API requests failed (5xx responses) or timed out. - Impacted areas included logs and database write operations. - Errors included Redis AudioCache failures, API database connection issues, and aborted API requests due to timeouts. # Root Cause The rollout caused a rotation of all cluster nodes due to subnet changes tied to the new IP range. DNS resolution failures associated with this new IP range caused Redis I/O operations to block on TCP connections, resulting in prolonged hanging TCP connections. These hanging connections intermittently caused Redis pods to fail liveness checks, resulting in continuous restarts. API pods, maintaining open connections to Redis, experienced similar blockages, leading to extensive API request timeouts and service degradation. The permanent resolution involves recreating the cluster entirely to address these DNS resolution issues comprehensively. If working on realtime distributed systems excites you, consider applying: https://jobs.ashbyhq.com/vapi/295f5269-1bb5-4740-81fa-9716adc32ad5 API degradation https://status.vapi.ai/incident/533963 Tue, 25 Mar 2025 04:14:00 -0000 https://status.vapi.ai/incident/533963#d2116c92fbee55847c13856703ca8232453f5db0acec0df0f2186c2e192d4652 API in degraded state, as identified by our monitors. We're rolling back to previous cluster Call worker degradation https://status.vapi.ai/incident/533837 Mon, 24 Mar 2025 23:45:00 -0000 https://status.vapi.ai/incident/533837#12858108fe1361baf3438fe8039eb9fe87953933b3bb4879049a2b320e2ed736 Issue was mitigated via rollback. We're investigating and will update with an RCA Call worker degradation https://status.vapi.ai/incident/533837 Mon, 24 Mar 2025 23:39:00 -0000 https://status.vapi.ai/incident/533837#6bb002ea5b46b98adea661eada6c099cf89c0efaf0b36bb975c1af2c5a9bd48a After most recent deploy, we noticed degradation in call initiation API. Changes were immediately rolled back, we are investigating the issue Cloudflare R2 storage is degraded, causing call recording upload failures https://status.vapi.ai/incident/532433 Fri, 21 Mar 2025 22:55:00 -0000 https://status.vapi.ai/incident/532433#442762097ac96476ba0ffa69f14ee9c855d2a382c01ac84db39cb67e7bc970df Recording upload errors are recovered. We are continuing to monitor Cloudflare R2 storage is degraded, causing call recording upload failures https://status.vapi.ai/incident/532433 Fri, 21 Mar 2025 22:54:00 -0000 https://status.vapi.ai/incident/532433#69fe1ea7738194600d901d54d1ae6e5831ca413d9cf63eec3b7e34268c00bbff Root issue has been fixed by Cloudflare. We are now monitoring Cloudflare R2 storage is degraded, causing call recording upload failures https://status.vapi.ai/incident/532433 Fri, 21 Mar 2025 22:16:00 -0000 https://status.vapi.ai/incident/532433#9f2e22a7b17e08620a8b867fe72a17a0f749ad7d3b6787ea3f4a85b81ffe3d6a Call recording uploads are failing, due to degradations in Cloudflare R2 (our default storage provider). See https://www.cloudflarestatus.com/ Google Gemini Voicemail Detection is intermittently failing https://status.vapi.ai/incident/530911 Wed, 19 Mar 2025 23:05:00 -0000 https://status.vapi.ai/incident/530911#66ebc701f007d9f872487591f8b5b0a84e4ec0d2aab214a41bb6f5345e26aeb5 # TL;DR It was decided that we should make Google Voicemail Detection the default option. On 16th March 2025, a PR was merged which implemented this change. This PR was released into production on 18th March 2025. On the morning of 19th March 2025, it was discovered that customers were experiencing call failures due to this change. Specifically: Google VMD was turned on by default, with no obvious way to disable it via the dashboard. Google VMD generated false positives when the bot identified itself as a bot. # Timeline in PST - **16th March 2025**: the offending PR is merged. - **18th March 2025, 3:08 PM**: the offending PR is released to production. - **19th March 2025, 8:52 AM**: Vapi Eng bot reports an incident: [https://vapi-ai.slack.com/archives/C06GT64R399/p1742399522864239](https://vapi-ai.slack.com/archives/C06GT64R399/p1742399522864239) - **19th March 2025, 9:18 AM**: It is determined that the issue is likely caused by Gemini VMD. - **19th March 2025, 10:04 AM**: Production is rolled back, immediately resolving the issue. - **19th March 2025, 11:00 AM**: Hotfix is committed to production. # Root Cause Several issues were identified: - Google VMD should not have been set as the default option. Any non-essential feature should be disabled by default. - From a dashboard perspective, `"undefined"` should always imply `"off"`. Additionally: - Google VMD produced false positives whenever the bot revealed itself as an AI or otherwise implied it was non-human. Examples: - *"Thank you for calling Jim Adler and Associates! I’m Kendall, an AI assistant. This call may be recorded for quality and training purposes as well as to help direct your information to the right person. I’m here to answer questions or book appointments—how may I assist you?"* - *"Thank you for calling Max Electric! This call is being recorded for quality and training purposes. You are calling outside of our business hours. This is Matthew. Please let me know how I can help!"* This appears to be an edge case identifiable primarily through actual usage. # What went poorly? - A non-essential feature was set as a default option. # What went well? - The issue was taken seriously as soon as it was identified. - The root cause was quickly discovered. # Remediation - Production was rolled back promptly. - A hotfix was implemented to stabilize production (ensuring Google VMD is no longer the default). - A longer-term fix has been developed to mitigate false positives. If working on realtime distributed systems excites you, consider applying: https://jobs.ashbyhq.com/vapi/295f5269-1bb5-4740-81fa-9716adc32ad5 Google Gemini Voicemail Detection is intermittently failing https://status.vapi.ai/incident/530911 Wed, 19 Mar 2025 19:30:00 -0000 https://status.vapi.ai/incident/530911#ec8e1c5412921cd596f60c2f9840bb4bd435277f9c4f452ed9450435213c0c86 We have released a fix for this issue Google Gemini Voicemail Detection is intermittently failing https://status.vapi.ai/incident/530911 Wed, 19 Mar 2025 18:30:00 -0000 https://status.vapi.ai/incident/530911#ee78bf40ab0ecfe808b7482abf9a5124388aa9c4ad4f18bf78bafced9c7805bf We have identified the root cause and rolled back. We are working on fix. Google Gemini Voicemail Detection is intermittently failing https://status.vapi.ai/incident/530911 Wed, 19 Mar 2025 16:55:00 -0000 https://status.vapi.ai/incident/530911#676c1ca5859b5dc86943b10903b1db87ed264c8ab60a7ab281ede3a4db229708 Google VMD is intermittently flagging on-going calls as "voicemail" and causing them to end with customer-did-not-answer. We are investigating and will have an update by 12pm PST latest. Users can resolve this by using an alternate VMD provider (Twilio or OpenAI). Intermittent errors during end calls. https://status.vapi.ai/incident/530440 Tue, 18 Mar 2025 23:36:00 -0000 https://status.vapi.ai/incident/530440#7b2472782c79b949c0029488eabc7eadbb2f56462478ffad4347bcea133a4db8 Resolved now. **RCA:** **Timeline (in PT)** 4:10pm New release went out for a small percentage of users. 4:15pm Our monitoring picked up increased errors in ending calls. 4:34pm Release was auto rolled back due to increased errors and incident was resolved. **Impact** Calls to end with unknown-error End of call report was missing **Root cause:** A missing DB migration caused issues in fetching data during end of call. **Remediation:** Add CI check to make sure we don't release code when the dependent DB migration hasn't been run yet. Intermittent errors during end calls. https://status.vapi.ai/incident/530440 Tue, 18 Mar 2025 23:29:00 -0000 https://status.vapi.ai/incident/530440#79fefdc9ff10b470d0cfd40ce5b628bd6db5f77c944baa1d396f13e97f1fcac1 We are investigating a increased cases of call drops. Will post updates soon. sip.vapi.ai degradation https://status.vapi.ai/incident/527911 Tue, 18 Mar 2025 04:00:00 -0000 https://status.vapi.ai/incident/527911#ef07f800adf393fcc98a64802be7687b7f319b6f4bb9c061e26153b7bb9adb48 **RCA: SIP 480 Failures (March 13-14)** **Summary** Between March 13-14, SIP calls intermittently failed due to recurring 480 errors. This issue was traced to our SIP SBC service failing to communicate with the SIP inbound service. As a temporary mitigation, restarting the SBC service resolved the issue. However, a long-term fix is planned, involving a transition to a more stable Auto Scaling Group (ASG) deployment. **Incident Timeline** (All times in PT) **March 13, 2025** 07:00 AM – SIP SBC pod starts showing symptoms of failure to connect to the SIP inbound pod, resulting in intermittent 480 errors. 01:19 PM – A customer reported an increase in 480 SIP errors, prompting escalation to the infrastructure team. 01:30 PM – The infrastructure team took corrective action, and service was restored. **March 14, 2025** 07:30 AM – Similar issue recurred, triggering monitoring alerts. 08:30 AM – The infrastructure team was engaged for remediation as failures persisted. 08:43 AM – The affected SIP SBC pod was deleted, restoring service. 09:43 AM – The issue reappeared, requiring repeated manual intervention. Additional occurrences throughout the day: 11:10 AM – 11:17 AM 12:03 PM – 12:09 PM 01:04 PM – 01:22 PM 02:08 PM – 02:37 PM **Challenges Identified** The failures appear due to broken connection between services, there were no health checks to keep the connections intact. Increased frequency – The number of occurrences was higher than usual, impacting a lot customers. Delayed response on Day 1 – The application remained in a somewhat degraded state for six hours before customer escalation prompted action. **Positive Takeaways** *Effective monitoring* – Alerts triggered as expected, enabling swift identification of the issue. *Improved response time on Day 2* – The team responded more promptly to subsequent incidents. **Remediation Actions Taken** *Enhance alerting mechanisms* – Modified alerts to periodically refire when in an alarm state, ensuring timely on-call responses. *Transition to ASG-based deployment* – Move SIP workloads from Kubernetes to an ASG-based infrastructure for improved stability. *Health check* - Add health check between the 2 services so that the system is able to auto heal incase issue reoccurs. Vapi workers not connecting due to lack of workers https://status.vapi.ai/incident/528459 Tue, 18 Mar 2025 03:56:00 -0000 https://status.vapi.ai/incident/528459#ccfc74f291896ec45c5bcfb460057233fe498e7e76df44ec14428e5a8912899b # TL;DR Weekly Cluster customers saw vapifault-transport-never-connected errors due to workers not scaling fast enough to meet demand # Timeline in PST * 7:00am - Customers report an increased number of vapifault-transport-never-connected errors. A degradation incident is posted on BetterStack * 7:30am - The issue is resolved as call workers scaled to meet demand # Root Cause - Call workers did not scale fast enough on the weekly cluster # Impact There were 34 instances of vapifault-transport-never-connected errors, meaning there were 34 calls that failed due to the issue. # What went poorly? - We were unable to detect the issue before customers did # What went well? - The solution was straightforward → Pre-scaling workers on the Weekly Cluster # Remediation - Pre scaling workers on all clusters to prevent vapifault errors - Increase size of worker nodes to aid in scaling, by allowing more call workers to fit per node - Increase sensitivity of pipeline error monitors / Dedicated monitor for vapifault errors If working on realtime distributed systems excites you, consider applying: https://jobs.ashbyhq.com/vapi/295f5269-1bb5-4740-81fa-9716adc32ad5 SIP call failures to connect https://status.vapi.ai/incident/528384 Mon, 17 Mar 2025 21:30:00 -0000 https://status.vapi.ai/incident/528384#c8a19878b8f61e220568cdde52bad5097d7978cbb45782a204f18af41c0a44b3 Degarading sip.vapi.ai instead of api.vapi.ai as only sip part is currently impacted. Increased error in calls https://status.vapi.ai/incident/528764 Sat, 15 Mar 2025 19:37:00 -0000 https://status.vapi.ai/incident/528764#6216324b5366963ed4acf93085cf03de464b1cfd4d0c2ca4fc07b9f8e71bb6d7 The issue has subsided, we experienced a brief spike in call initiations and didn't scale up fast enough. Immediate term, we're vertically scaling our call worker instances. Near term, we're rolling out our new call worker architecture for rapid scaling Increased error in calls https://status.vapi.ai/incident/528764 Sat, 15 Mar 2025 19:17:00 -0000 https://status.vapi.ai/incident/528764#916ccc30be9d84f8a3231312ad662bf7264510f69add0e6a3a9404b3052f96d0 Users are experiencing `vapifault-transport-never-connected` errors NeonDB Scheduled Maintenance: DB endpoing restart https://status.vapi.ai/incident/526599 Sat, 15 Mar 2025 16:00:00 +0000 https://status.vapi.ai/incident/526599#d38e9dba97b7b5fd398c8463f859734330a37f9410150974742cc9efbbf0a6ba Maintenance completed NeonDB Scheduled Maintenance: DB endpoing restart https://status.vapi.ai/incident/526599 Sat, 15 Mar 2025 16:00:00 +0000 https://status.vapi.ai/incident/526599#d38e9dba97b7b5fd398c8463f859734330a37f9410150974742cc9efbbf0a6ba Maintenance completed NeonDB Scheduled Maintenance: DB endpoing restart https://status.vapi.ai/incident/526599 Sat, 15 Mar 2025 16:00:00 +0000 https://status.vapi.ai/incident/526599#d38e9dba97b7b5fd398c8463f859734330a37f9410150974742cc9efbbf0a6ba Maintenance completed NeonDB Scheduled Maintenance: DB endpoing restart https://status.vapi.ai/incident/526599 Sat, 15 Mar 2025 12:00:00 -0000 https://status.vapi.ai/incident/526599#c946fab547dc33ed64fc8868cb81d6bf91fe1f61a670ccfb70e5d29e4d4a3e81 Neon is doing scheduled maintenance in our region `us-west-2`: https://neonstatus.com/aws-us-west-oregon/incidents/01JP2WGPKFV2GDV4QSKV8F8NGP. This will require a restart of our endpoint that will result in seconds of downtime. We have marked off the block of time in which this restart will likely happen. NeonDB Scheduled Maintenance: DB endpoing restart https://status.vapi.ai/incident/526599 Sat, 15 Mar 2025 12:00:00 -0000 https://status.vapi.ai/incident/526599#c946fab547dc33ed64fc8868cb81d6bf91fe1f61a670ccfb70e5d29e4d4a3e81 Neon is doing scheduled maintenance in our region `us-west-2`: https://neonstatus.com/aws-us-west-oregon/incidents/01JP2WGPKFV2GDV4QSKV8F8NGP. This will require a restart of our endpoint that will result in seconds of downtime. We have marked off the block of time in which this restart will likely happen. NeonDB Scheduled Maintenance: DB endpoing restart https://status.vapi.ai/incident/526599 Sat, 15 Mar 2025 12:00:00 -0000 https://status.vapi.ai/incident/526599#c946fab547dc33ed64fc8868cb81d6bf91fe1f61a670ccfb70e5d29e4d4a3e81 Neon is doing scheduled maintenance in our region `us-west-2`: https://neonstatus.com/aws-us-west-oregon/incidents/01JP2WGPKFV2GDV4QSKV8F8NGP. This will require a restart of our endpoint that will result in seconds of downtime. We have marked off the block of time in which this restart will likely happen. SIP call failures to connect https://status.vapi.ai/incident/528384 Sat, 15 Mar 2025 01:23:00 -0000 https://status.vapi.ai/incident/528384#f8744b07ba4dbc391f26b4c8250a4d5a3f9ac0454fb53e7833d24662a9de0904 SIP service has faced partial degradation multiple times in the last day. Things are looking stable now, but we are keeping the incident open until we rollout a major infra level change which is going to solve it for good. We apologise for this inconvenience and are working with urgency to solve the issue permanently. Here's the timeline of the issue for today (in Pacific Time): 7:30am SBC pod not able to connect to sbc inbound pod resulting in 480. Our monitoring picks it up. 8:30am Infra team is pulled in for remediation as the failures dont stop for a while. 8:43am The faulty SIP sbc pod was deleted and the service was restored. 9:43am The same issue pops up again and a manual action is taken to restore the service everytime. More instances for the same issue pop up multiple time throughout the day. 11:10 - 11:17am 12:03pm - 12:09pm 1:04pm - 1:22pm 2:08pm - 2:37pm Investigating GET /call/:id timeouts https://status.vapi.ai/incident/528345 Sat, 15 Mar 2025 00:00:00 -0000 https://status.vapi.ai/incident/528345#d6abbda6c82b290abd438b92f2b3b8823911eb0cc22058d1980c8b5243c2f648 We are working with impacted customers to investigate but have not seen this issue occurring regularly. SIP call failures to connect https://status.vapi.ai/incident/528384 Fri, 14 Mar 2025 23:36:00 -0000 https://status.vapi.ai/incident/528384#ebcf7a45adbe91bb166ee13fed0f3bcc29307afcf44364c77a8e87a1ac8e0f67 We have released a temporary fix to the problem and the issue hasn't been reported again in the last 2 hours. We are still working on a more permanent fix for it. Calls are intermittently ending abruptly https://status.vapi.ai/incident/528344 Fri, 14 Mar 2025 23:01:00 -0000 https://status.vapi.ai/incident/528344#24bff532fa31c908a715e66c78889e5c0cf30803e13799822136274b3373883e # TL;DR Calls ended abruptly due to call-workers restarting themselves caused by high memory usage (OOMKilled). # Timeline in PST - March 13th 3:47am: Issue raised regarding calls ending without a call-ended-reason. - 1:57pm: High memory usage identified on call-workers exceeding the 2GB limit. - 3:29pm: Confirmation received that another customer experienced the same issue. - 4:30pm: Changes implemented to increase memory request and limit on call-workers. - March 14th 12:27pm: Changes deployed. # Root Cause Call-workers exceeded Kubernetes-set memory limits, causing containers to restart unexpectedly and terminate ongoing calls. Since call-workers maintain call state internally, calls could not be recovered, leading to abrupt terminations. # Impact 1705 call-workers exceeded the 2GB memory threshold, causing 1705 abrupt call terminations. # What went poorly? - Issue identified only after user notification. - The fix required a code change rather than immediate manual intervention, delaying remediation. - Release complications delayed quick deployment. - Investigation took 10 hours, and remediation required an additional 3 hours. # What went well? - Effective communication allowed identification and planning of the fix once the issue was understood. # Remediation - Increase memory requests and limits on call-workers. - Implement monitoring for call-worker memory usage exceeding limits. - Implement monitoring for call-worker container restarts. If working on realtime distributed systems excites you, consider applying: https://jobs.ashbyhq.com/vapi/295f5269-1bb5-4740-81fa-9716adc32ad5 Calls are intermittently ending abruptly https://status.vapi.ai/incident/528344 Fri, 14 Mar 2025 23:01:00 -0000 https://status.vapi.ai/incident/528344#24bff532fa31c908a715e66c78889e5c0cf30803e13799822136274b3373883e # TL;DR Calls ended abruptly due to call-workers restarting themselves caused by high memory usage (OOMKilled). # Timeline in PST - March 13th 3:47am: Issue raised regarding calls ending without a call-ended-reason. - 1:57pm: High memory usage identified on call-workers exceeding the 2GB limit. - 3:29pm: Confirmation received that another customer experienced the same issue. - 4:30pm: Changes implemented to increase memory request and limit on call-workers. - March 14th 12:27pm: Changes deployed. # Root Cause Call-workers exceeded Kubernetes-set memory limits, causing containers to restart unexpectedly and terminate ongoing calls. Since call-workers maintain call state internally, calls could not be recovered, leading to abrupt terminations. # Impact 1705 call-workers exceeded the 2GB memory threshold, causing 1705 abrupt call terminations. # What went poorly? - Issue identified only after user notification. - The fix required a code change rather than immediate manual intervention, delaying remediation. - Release complications delayed quick deployment. - Investigation took 10 hours, and remediation required an additional 3 hours. # What went well? - Effective communication allowed identification and planning of the fix once the issue was understood. # Remediation - Increase memory requests and limits on call-workers. - Implement monitoring for call-worker memory usage exceeding limits. - Implement monitoring for call-worker container restarts. If working on realtime distributed systems excites you, consider applying: https://jobs.ashbyhq.com/vapi/295f5269-1bb5-4740-81fa-9716adc32ad5 SIP call failures to connect https://status.vapi.ai/incident/528384 Fri, 14 Mar 2025 21:30:00 -0000 https://status.vapi.ai/incident/528384#441962b96169530bc5443373897f6bcc87cb278af60ce158d2a09db7a8d9f630 sip.vapi.ai is not responding intermittently. We are investigating the failures and will be coming up with a fix soon. Vapi workers not connecting due to lack of workers https://status.vapi.ai/incident/528459 Fri, 14 Mar 2025 20:00:00 -0000 https://status.vapi.ai/incident/528459#9f0cebdf8fbea581001594745e40aeb6930ffdc5195fe8e277150aa4487247ef We have investigated and resolved this issue by prescaling the impacted cluster to handle a higher volume of traffic. We will update with an RCA. Calls are intermittently ending abruptly https://status.vapi.ai/incident/528344 Fri, 14 Mar 2025 19:11:00 -0000 https://status.vapi.ai/incident/528344#f0d89bb8b577775290247c41a38eb2bdfc27daab415a1848f6e21024a413c8a2 We are currently experiencing higher memory usage in our call workers which may be causing calls to end abruptly. Our team is actively investigating and working to resolve the issue promptly. We apologize for any inconvenience this may cause and appreciate your patience. Further updates will be provided by 2pm PST. Calls are intermittently ending abruptly https://status.vapi.ai/incident/528344 Fri, 14 Mar 2025 19:11:00 -0000 https://status.vapi.ai/incident/528344#f0d89bb8b577775290247c41a38eb2bdfc27daab415a1848f6e21024a413c8a2 We are currently experiencing higher memory usage in our call workers which may be causing calls to end abruptly. Our team is actively investigating and working to resolve the issue promptly. We apologize for any inconvenience this may cause and appreciate your patience. Further updates will be provided by 2pm PST. Investigating GET /call/:id timeouts https://status.vapi.ai/incident/528345 Fri, 14 Mar 2025 18:54:00 -0000 https://status.vapi.ai/incident/528345#cbc965b1c92d95626d9333321f508456788b495facf30de3d4f16cf7fd538ac0 Some users are experiencing timeouts in `GET /call/:id` API endpoint. Our team is actively investigating this and working to resolve the issue promptly. We apologize for any inconvenience this may cause and appreciate your patience. Further updates will be provided shortly. Vapi workers not connecting due to lack of workers https://status.vapi.ai/incident/528459 Fri, 14 Mar 2025 14:30:00 -0000 https://status.vapi.ai/incident/528459#430d7032bf87c9757c8da5bb019c668e7f96b9f816c91d5a08739238ae9cea89 This issue resolved itself as more workers were created. We are investigating further to provide a more long-term remediation and will update. Vapi workers not connecting due to lack of workers https://status.vapi.ai/incident/528459 Fri, 14 Mar 2025 14:00:00 -0000 https://status.vapi.ai/incident/528459#4d3dca1efd58f9e48ea7a4f5e2d866f5e286d10ce53b42968380d634011102f4 Workers did not scale to meet an increase in demand resulting in vapifault-transport-never-connected errors. sip.vapi.ai degradation https://status.vapi.ai/incident/527911 Thu, 13 Mar 2025 23:29:00 -0000 https://status.vapi.ai/incident/527911#9e9b73249c07354c8ea89829152819995319cbf932c1317ae2eb27ad99fc1888 Incident was resolved at 1:30pm PT One of the 2 ips behind sip.vapi.ai was failing to connect to an internal service resulting in 480 error. sip.vapi.ai degradation https://status.vapi.ai/incident/527911 Thu, 13 Mar 2025 23:18:00 -0000 https://status.vapi.ai/incident/527911#b09c5f5472e63483f719bd5758559e4812ac7c4094af1ed955e627b2eec28719 Intermittent "480 temporarily unavailable" errors while connecting calls to sip.vapi.ai. Started happening at 7am PT. We are seeing degraded service from Deepgram https://status.vapi.ai/incident/526295 Tue, 11 Mar 2025 07:59:00 -0000 https://status.vapi.ai/incident/526295#261a22bc79fd7c2fad84bbfe6138e4b16edd5bcfd404606ef247caa32dcc4c3e # TL;DR An application-level bug was leaked into production, causing a spike in pipeline-error-deepgram-returning-502-network-error errors. This resulted in roughly 1.48K failed calls. # Timeline in PST * 12:03am - Rollout to prod1 containing the offending change is started * 12:13am - Rollout to prod1 is complete * 12:25am - A huddle in #eng-scale is started * 12:43am - Rollback to prod3 is started * 12:55am - Rollback to prod3 is complete # Root Cause * An application-level bug related to the Deepgram Numerals setting caused WebSocket connections to return a non-101 status code. This was masked as a pipeline-error-deepgram-returning-502-network-error error, initially leading us to believe it was a Deepgram issue. # Impact There were 1.48K pipeline-error-deepgram-returning-502-network-error errors, meaning there were 1.48K calls that failed due to this issue. # What went poorly? * The monitor did not fire early enough to trigger the Canary Manager’s rollback * We did not roll back immediately upon noticing the correlation between the error-count increase and the start of the canary rollout * We were misled by the error name # What went well? * The monitor caught the issue and alerted us shortly after rollout completion * Multiple team members responded promptly, initiating a huddle in #eng-scale # Remediation * Increase sensitivity of pipeline error monitor * Investigate and resolve the application bug * Refactor Deepgram error categorization to clearly indicate non-Deepgram related issues * Refactor Canary Manager to use direct DD metrics instead of relying on monitor alerts If working on realtime distributed systems excites you, consider applying: https://jobs.ashbyhq.com/vapi/295f5269-1bb5-4740-81fa-9716adc32ad5 We are seeing degraded service from Deepgram https://status.vapi.ai/incident/526295 Tue, 11 Mar 2025 07:30:00 -0000 https://status.vapi.ai/incident/526295#c4473284c2a9d22a61af41282b97bf451f73ecccb3a8fc0dda4204c65518b0b1 Assistants which use Deepgram for transcription are unresponsive, consider using another transcription model. Increased call start errors due to Vapi fault transport errors + Twilio timeouts https://status.vapi.ai/incident/525770 Tue, 11 Mar 2025 02:18:00 -0000 https://status.vapi.ai/incident/525770#3cfb68f405de3c99e100d698291ab5fd1a20d0bd663dbfd76c05c64b1bddcd67 RCA: vapifault-transport-never-connected errors caused call failures Date: 03/10/2025 Summary: A recent update to our production environment increased the memory usage of one of our core call-processing services. This led to an unintended triggering of our automated process restart mechanism, resulting in a brief period of call failures. The issue was resolved by adjusting the memory threshold for these restarts. Timeline: 1. 5:50am A few calls start facing issues in starting due to vapifault-transport-never-connected. 2. 6:40am Call failures start to increase. Partial outage of call starts. Our monitoring picked it up and paged oncall. Some discord users and customers on slack start reporting errors. 3. 6:55am - 7:20am Investigated causes for failures. Shifted the calls to a previous cluster, but calls were still failing. 4. 7:35am We reached a RCA on why the failures were occurring and a fix was scoped out. 5. 7:58am The hotfix was completely deployed and the failures stopped. The incident was resolved at this point. Root Cause: A recent production update increased the memory requirements of our call-processing service. As a result, an internal safeguard—designed to restart processes exceeding a set memory threshold—was activated more frequently than anticipated. Mediation: 1. Threshold Adjustment: We have increased the memory threshold that triggers a process restart to better handle higher usage. 2. Enhanced Monitoring: We are implementing additional alerts to detect similar issues earlier. 3. Process Review: We are further examining our restart protocols to reduce unnecessary service interruptions during periods of high demand. Increased call start errors due to Vapi fault transport errors + Twilio timeouts https://status.vapi.ai/incident/525770 Mon, 10 Mar 2025 15:12:00 -0000 https://status.vapi.ai/incident/525770#bd0ea476e94b68c61561f68461cdd329ef071df61cbbe6770451358982b5c7ca Issue has been patched and we are monitoring the fix. We will be following up with a detailed RCA soon. Increased call start errors due to Vapi fault transport errors + Twilio timeouts https://status.vapi.ai/incident/525770 Mon, 10 Mar 2025 14:09:00 -0000 https://status.vapi.ai/incident/525770#f3373d37418b4f898714ace58235e5e69c9b96b3030a0be3ddab2ad0e07a24c1 We are noticing increased occurrences of 31920 error in Twilio calls. Team in investigating and mitigating the issue. Kubernetes cluster upgrades https://status.vapi.ai/incident/524956 Sat, 08 Mar 2025 20:30:38 +0000 https://status.vapi.ai/incident/524956#2a980f2409a66927259df54aaf20afa0dfae54c6729deba48900b3c3b0b5c3e2 Maintenance completed Kubernetes cluster upgrades https://status.vapi.ai/incident/524956 Sat, 08 Mar 2025 19:00:38 -0000 https://status.vapi.ai/incident/524956#e2133bb71f134d0fc1dc4d970e2308a5d4740a52904e50666499c8d0bc628ddc We're rolling out Kubernetes cluster upgrades for security and reliability. Increased Twilio errors causing 31902 & 31920 websocket connection issues. Increase in customer-did-not-answer for twilio calls https://status.vapi.ai/incident/524526 Fri, 07 Mar 2025 22:00:00 -0000 https://status.vapi.ai/incident/524526#d72daad243290feafb670a87a6054b2a89d6bc3b144bfb74321900b990044325 We have rolled back the faulty release which caused this issue. We are monitoring the situation now. Increased Twilio errors causing 31902 & 31920 websocket connection issues. Increase in customer-did-not-answer for twilio calls https://status.vapi.ai/incident/524526 Fri, 07 Mar 2025 21:57:00 -0000 https://status.vapi.ai/incident/524526#3802245e8acbbd1e95af288fbcd78173c9ca3b3c822bfd095120e40d9d70ef30 We are investigating the problem. Vonage inbound calling is degraded https://status.vapi.ai/incident/523885 Thu, 06 Mar 2025 22:39:00 -0000 https://status.vapi.ai/incident/523885#15b88a4f0a11aa48d270d8cb3dcf3f651b77bc8075ceba18f591be9f11c1ab1a The issue was caused by Vonage sending an unexpected payload schema, causing validation to fail at the API level. We deployed a fix to accommodate for the schema. Signups temporarily unavailable https://status.vapi.ai/incident/523943 Thu, 06 Mar 2025 06:00:00 -0000 https://status.vapi.ai/incident/523943#345c4f88a1afaf721140bd87566c63187d07a442f980b337b0263d52434d8c00 The API bug was reverted and we confirmed service restoration Weekly cluster at capacity limits https://status.vapi.ai/incident/523259 Wed, 05 Mar 2025 20:04:00 -0000 https://status.vapi.ai/incident/523259#2d9a05c5549176e75b24856cfcf726184f783b14c921d7e172ce90ca0db9ab1d We are seeing calls go through fine now, and are still keeping an eye out Weekly cluster at capacity limits https://status.vapi.ai/incident/523259 Wed, 05 Mar 2025 19:42:00 -0000 https://status.vapi.ai/incident/523259#0a18136a19d9a4d5c7179e291cfc7431adb366b50b555d67e3728474f14000df Resolution: we've scaled up and are monitoring Assembly AI transcriber calls are facing degradation. https://status.vapi.ai/incident/517216 Sat, 22 Feb 2025 14:17:00 -0000 https://status.vapi.ai/incident/517216#e4f28d6dc37b7bdcc0808b6ac750b3d900c972214df8169bc2014f5981c200c8 It is resolved now. It was due to a account related problem which has been fixed now. We will be taking steps to make sure it doesn't happen again. Assembly AI transcriber calls are facing degradation. https://status.vapi.ai/incident/517216 Sat, 22 Feb 2025 13:41:00 -0000 https://status.vapi.ai/incident/517216#0ef3ae20c90812ea3363a1703e827629f3ed5d31cbe9bf4b4f4fe0105751f1b8 We're coordinating with assembly AI team to fix the issue on priority. Try switching transcriber meanwhile. API returning 413 (payload too large) due to networking misconfiguration https://status.vapi.ai/incident/516890 Fri, 21 Feb 2025 19:24:00 -0000 https://status.vapi.ai/incident/516890#5684b93693f6328becfbd39f3bf4e2fa50637ead5fe0d698d27a25c54231a80d # TL;DR A change in the cluster-router networking filter caused an increase in 413 (request entity too large) errors. API requests to POST /call, /assistant, and /file were impacted. # Timeline 1. **February 20th 9:54pm PST:** A change to the cluster-router is released and traffic is cut over to prod1. 2. **10:19pm PST:** 413 responses from Cloudflare begin appearing in increased Datadog logs. 3. **February 21st ~8:50am:** Users in Discord flag requests failing with 413 errors. 4. **9:58am PST:** The IR team rolls back the networking cluster to the previous deployment without the filter change; service is restored and the 413 errors subside. # Impact - During the time of impact, POST requests to /call, /assistant, and /file failed with a 413 error code. # Root Cause - A change in the cluster-router filter added buffering of POST requests for all endpoints (previously only applied to /status, /inbound, and /inbound_call). - The envoy filter was configured with a stream window size of approximately 65Kb, so request bodies larger than that received a 413 response. # Changes we've made - Monitor to catch 4xx and 5xx errors from Cloudflare. # Changes we will make - Improve change testing for the networking cluster. - Implement a percentage-based cutover of traffic for networking rollouts instead of a 100% switch. # What went well - The cause was identified quickly by investigating changes in Cloudflare responses. # What went poorly - There was a 12-hour delay between identifying the cause and remediation due to the lack of alerts for this error. - The issue was initially flagged by the Discord community rather than through internal monitoring. If working on realtime distributed systems excites you, consider applying: https://jobs.ashbyhq.com/vapi/295f5269-1bb5-4740-81fa-9716adc32ad5 Deepgram is failing to send transcription intermittently https://status.vapi.ai/incident/516593 Fri, 21 Feb 2025 08:57:00 -0000 https://status.vapi.ai/incident/516593#09553def4bd2287c29e60d280b4c25e0f8066b92898afd2f585fc4621af22dad Deepgram has resolved the incident on their side. Back to normal. https://status.deepgram.com/incidents/wr5whbzk45mg Deepgram is failing to send transcription intermittently https://status.vapi.ai/incident/516593 Fri, 21 Feb 2025 07:26:00 -0000 https://status.vapi.ai/incident/516593#5b38a95749d968401e9f013f2124ac8ae892830434c9e4598d29ddce1e7e115a Deepgram has ackowledged the problem and are working to resolve it. More information on https://status.deepgram.com/incidents/wr5whbzk45mg Deepgram is failing to send transcription intermittently https://status.vapi.ai/incident/516593 Fri, 21 Feb 2025 06:28:00 -0000 https://status.vapi.ai/incident/516593#05d94e46b85071831a550bb5cb34bff3f136aef04cc292a9a0dcb8facd6ae433 Transcriptions are failing to generate which cause calls to hang and end earlier than expected. Elevenlabs rate limiting and high latency https://status.vapi.ai/incident/516247 Thu, 20 Feb 2025 17:11:00 -0000 https://status.vapi.ai/incident/516247#5e47e592ce02df9adc207f6d6908f64b2cb2f9392608ebffafd5ce4ae9a84a36 11labs has confirmed that the problem has been fixed. No failures in last 10mins. Resolving incident. Here is the elevenlabs report on the incident https://status.elevenlabs.io/incidents/01JMJ4B025B83H28C3K81B1YS4 Elevenlabs rate limiting and high latency https://status.vapi.ai/incident/516247 Thu, 20 Feb 2025 16:55:00 -0000 https://status.vapi.ai/incident/516247#f07e914641391e35b2e3e9bbc877579aa5e18a0ecfacb2b8bfb6daa1df0f699b 11labs is having issues with a latest deployment. We're seeing high latency and rate limits. We have reached out to them and they are fixing it ASAP. ElevenLabs Rate Limiting https://status.vapi.ai/incident/515657 Wed, 19 Feb 2025 19:43:00 -0000 https://status.vapi.ai/incident/515657#807d679eb130611fc52e0e2f90d53d2e0c16fcc7023f5030cb299773437a2f35 ElevenLabs is imposing rate limits which will have impact on Vapi users who have it configured as their voice model. We are working to resolve this issue, but users can restore service by switching to Cartesia or using their own API key. API is degraded https://status.vapi.ai/incident/504402 Thu, 30 Jan 2025 11:44:00 -0000 https://status.vapi.ai/incident/504402#136d91d36a6f9b1b860a2e8d2c5021012376e43622bd60bb616f96669e11b5cb ## TL;DR The API experienced intermittent downtime due to choked database connections and subsequent call failures caused by the database running out of memory. A forced deployment using direct connections and capacity adjustments restored service. ## Timeline 2:09AM: Alerts triggered for API unavailability (503 errors) and frequent pod crashes. 2:40AM: A switch to a backup deployment showed temporary stability, but pods continued to restart and out-of-memory errors began appearing. 3:27AM: A forced deployment was initiated on the primary environment using direct database connections; the database team was notified. 3:42AM: The database was restarted and traffic was rerouted, leading to improved service health. 3:50AM: The database’s capacity was increased and the service stabilized fully. ## Impact The API experienced multiple intermittent outages. Calls were affected due to the database running out of memory, with thousands of calls and jobs left in an active or stuck state. ## Root Cause Choked database connections due to a spike in aborted request errors led to failing health checks, which in turn caused API pods to restart continuously. The database ran out of memory—not because of sheer volume alone, but due to a misconfiguration (insufficient max_locks_per_transaction), which was exacerbated by a thundering herd of requests. ## Changes we've made Increase Capacity: Boost the database’s capacity. Adjust Configuration: Raise the max_locks_per_transaction setting. Cleanup Operations: Remove stuck pods and clear active call jobs from the affected environment. Enhance Monitoring and Deployment: Improve alerting for database health and reduce urgent deployment times from ~15 minutes to ~5 minutes. If working on realtime distributed systems excites you, consider applying: https://jobs.ashbyhq.com/vapi/295f5269-1bb5-4740-81fa-9716adc32ad5 API is degraded https://status.vapi.ai/incident/504402 Thu, 30 Jan 2025 11:30:00 -0000 https://status.vapi.ai/incident/504402#b94b5375dcc2589a74263ce37af8f884c984f37b0d7de97d1b264614e868ae74 We're suspecting another Supabase DB issue, remediating ASAP. SIP cluster scaling https://status.vapi.ai/incident/504040 Thu, 30 Jan 2025 07:00:00 +0000 https://status.vapi.ai/incident/504040#b99cf24ddb4b7e71c8cea3df384fdd00e973c1d12258c615b7a4d2db8fe80b62 Maintenance completed SIP cluster scaling https://status.vapi.ai/incident/504040 Thu, 30 Jan 2025 04:00:00 -0000 https://status.vapi.ai/incident/504040#59b705687ec6a76e027862f48cdf0c749d83073bd5d9574279813ba4e7b5670c We will be retrying our deployment of SIP cluster to make sure we are ready for upcoming scale. There might be some minor disruptions wrt connecting SIP calls, but we will be closely monitoring the situation and complete the migration swiftly. API is down https://status.vapi.ai/incident/503892 Wed, 29 Jan 2025 17:24:00 -0000 https://status.vapi.ai/incident/503892#1f242188048931e214c465d4cb239d44c01ee8886ab3c4abf472cdb6bb3bdc24 ## TL;DR A failed deployment by Supabase of their connection pooler, Supavisor, in one region caused all database connections to fail. Since API pods rely on a successful database health check at startup, none could start properly. The workaround was to bypass the pooler and connect directly to the database, restoring service. ## Timeline 8:08am PST, Jan 29: Monitoring detects Postgres errors. 8:13am: The provider’s status page reports a failed connection pooler deployment. (Due to subscription issues, the team wasn’t immediately notified.) 8:18am: The API goes down. 8:22am: Temporary API recovery occurs as some non-pooler-dependent requests succeed. 8:25am: The API fails again; the incident response team assembles. 8:28am: Investigation reveals API pods are repeatedly restarting. 8:30am: It’s determined that database call failures are triggering the pod restarts. 8:36am: Support confirms that a connection pooler outage in the region is affecting service. 8:38am: A call with support leads to the decision to use direct database connections. 8:44am: A change is deployed to bypass the pooler. 9:12am: The API begins to recover as calls start succeeding. 9:19am: Full service is restored. ## Impact The API was down for 54 minutes, with all calls failing due to reliance on the provider’s system for tracking and organization data. While some API requests not dependent on the pooler continued working, new API pods entered crash loops because their health checks (which made database requests) failed. Database operation failures led to call processing hanging, causing errors that prevented proper job closure. ## Root Cause A failed connection pooler deployment disrupted all database connections. This affected API operations that depended on those connections, leading to cascading failures and hanging processes. ## Changes we've made Reduce Deployment Time: Shorten backend update runtimes to under five minutes. Switch to Direct Connections: Use direct database connections exclusively to avoid pooler issues. Increase Connection Capacity: Boost the number of direct connections available to handle higher loads. If working on realtime distributed systems excites you, consider applying: https://jobs.ashbyhq.com/vapi/295f5269-1bb5-4740-81fa-9716adc32ad5 API is down https://status.vapi.ai/incident/503892 Wed, 29 Jan 2025 17:05:00 -0000 https://status.vapi.ai/incident/503892#67ee5d8da81b9643dfa903c9f5c4840de7ce50133bf5244c11690fa4070900f9 We've rolled out direct connection to database for now. Calls are going through. We're waiting on Supabase to confirm fix to resolve the outage. API is down https://status.vapi.ai/incident/503892 Wed, 29 Jan 2025 16:35:00 -0000 https://status.vapi.ai/incident/503892#a9c24ee500d434bc4397fd4e80e6716c00f6e9e0c4bcdda602cc4cd0a51f1a18 We are impacted by supabase outage. https://status.supabase.com Working with their team to get it working ASAP. API is down https://status.vapi.ai/incident/503892 Wed, 29 Jan 2025 16:28:00 -0000 https://status.vapi.ai/incident/503892#8306d671337d3a5c11868e4b9ba9a313a2c33358eb22e9f00ceca254765c5442 API is down. We're investigating. Updates to follow. Updates to DB are failing https://status.vapi.ai/incident/499408 Tue, 21 Jan 2025 13:23:00 -0000 https://status.vapi.ai/incident/499408#e43fc033799893ee1b78d0061322662622e03b356527efd0775243d38531c882 ## TL;DR A configuration error caused the production database to switch to read-only mode, blocking write operations and eventually leading to an API outage. Restarting the database restored service. ## Timeline 5:03:04am: A SQL client connected to the production database via the connection pooler, which inadvertently set the database to read-only. 5:05am: Write operations began failing. 5:18am: The API went down due to accumulated errors. ~5:23am: The team initiated a database restart. 5:25am: The database restarted. 5:33am: Service was fully restored. ## Impact Write operations were blocked for 30 minutes. The API experienced a 15-minute outage. ## Root Cause A direct connection from a SQL client, configured in read-only mode, propagated this setting across all sessions through the connection pooler. This disabled updates, inserts, and deletes, eventually leading to API failure. ## Changes we've made Disable Replication Jobs: Halt the replication jobs suspected of triggering the issue. Escalate Support: The support case is escalated to the relevant team with a 24-hour follow-up. Enhance Auditing: Enable and configure detailed audit logging (DDL and role operations) to help trace future incidents. Restrict Direct Access: Eliminate direct production database connections by updating the access credentials. If working on realtime distributed systems excites you, consider applying: https://jobs.ashbyhq.com/vapi/295f5269-1bb5-4740-81fa-9716adc32ad5 Updates to DB are failing https://status.vapi.ai/incident/499408 Tue, 21 Jan 2025 13:20:00 -0000 https://status.vapi.ai/incident/499408#34d8df43dcb6f4c23dab516c6effd53cf806c7c1e635b7e0ca2e253b519ae1e7 We are investigating. Calls not connecting for `weekly` channel https://status.vapi.ai/incident/495219 Mon, 13 Jan 2025 16:49:00 -0000 https://status.vapi.ai/incident/495219#e14452cba94b1c38d70b72a693968c3f0abcb60283c41c9170db510f76f085aa TL;DR: Scaler failed and we didn't have enough workers ## Root Cause During a weekly deployment, Redis IP addresses changed. This prevented our scaling system from finding the queue, leaving us stuck at fixed number workers instead of scaling up as needed. We resolved the issue by temporarily moving traffic to our daily environment. ## Timeline Jan 11, 5:12 PM: Deploy started Jan 13, 6:00 AM: Calls started failing due to scaling issues Jan 13, 8:45 AM: Resolved by moving traffic to daily Jan 13, 11:00 AM: Full service restored ## Changes We've Implemented - Load testing on every deploy - Added better monitoring for scaling errors If working on realtime distributed systems excites you, consider applying: https://jobs.ashbyhq.com/vapi/295f5269-1bb5-4740-81fa-9716adc32ad5 Calls not connecting for `weekly` channel https://status.vapi.ai/incident/495219 Mon, 13 Jan 2025 16:31:00 -0000 https://status.vapi.ai/incident/495219#455b593ec476790a7234d214344305df16bacc3066d9e9fa71992d0e71d0d1ef We're investigating. We'll update ASAP. DB resizing, 5m of downtime expected. https://status.vapi.ai/incident/451110 Sat, 23 Nov 2024 20:15:00 +0000 https://status.vapi.ai/incident/451110#231975c9585675d1fef413f151e3884dbe1f390a529e44e429c6143c8748be01 Maintenance completed DB resizing, 5m of downtime expected. https://status.vapi.ai/incident/451110 Sat, 23 Nov 2024 20:00:00 -0000 https://status.vapi.ai/incident/451110#389ad992f9fae1583f3b19f5ab3ee645be8274364c15d70ffd0a218159d8974c We need to resize the DB to handle increased load. 5m of downtime is expected. ElevenLabs is degraded https://status.vapi.ai/incident/461672 Thu, 14 Nov 2024 21:08:00 -0000 https://status.vapi.ai/incident/461672#a9409160a215a68ccfa34b6c8881157c5db7d2d3d3c5d71a42f933cc24408ed9 Should be back to normal now as per 11labs. https://status.elevenlabs.io/ ElevenLabs is degraded https://status.vapi.ai/incident/461672 Thu, 14 Nov 2024 21:01:00 -0000 https://status.vapi.ai/incident/461672#5e9594e039c916003c09d13449bc547802604d32ed7dc1e070c9836b351a05d2 11labs is suffering degradation for high latency on API. We have contacted them and they are looking into it with urgency. You can also directly track the progress at https://status.elevenlabs.io API is degraded https://status.vapi.ai/incident/460351 Tue, 12 Nov 2024 22:15:00 -0000 https://status.vapi.ai/incident/460351#9db9f41907e6e23f2ac6c2117ce988d11e409c966cd6b8437d6d0f01ca428c5d TL;DR: API pods were choked. Our probes missed it. ## Root Cause Our API experienced DB contention. Recent monitoring system changes meant our probes didn't pick up this contention and restart the pods. ## Timeline - November 12th 2:00pm PT - Customer reports of API failures - November 12th 2:05pm PT - Oncall team determined cause and scaled and restarted pods - November 12th 2:10pm PT - Full functionality restored. ## Changes we've implemented 1. Restored higher sensitivity thresholds for our monitoring systems 2. Currently investigating underlying database connection management If working on realtime distributed systems excites you, consider applying: https://jobs.ashbyhq.com/vapi/295f5269-1bb5-4740-81fa-9716adc32ad5 API is degraded https://status.vapi.ai/incident/460351 Tue, 12 Nov 2024 22:12:00 -0000 https://status.vapi.ai/incident/460351#3dce639499c16ad032e29ffa6b00bbe0c381c344596267cff55580ca4d466e13 Seeing long connection times. Investigating. Phone calls are degraded https://status.vapi.ai/incident/459737 Tue, 12 Nov 2024 01:03:00 -0000 https://status.vapi.ai/incident/459737#3b39922cae2f46737f3d1d1b7bfda3cc1fa593f24115d70f1b4896ac36774028 TL;DR: API gateway rejected Websocket requests ## Summary On November 11, 2024, from 4:22 PM to 5:05 PM PST, our WebSocket-based calls experienced disruption due to a configuration issue in our API gateway. This affected both inbound and outbound phone calls in one of our production clusters. ## Impact - Duration: 43 minutes - Affected services: WebSocket-based phone calls - System returned 404 errors for affected connections - Service was fully restored by routing traffic to our backup cluster ## Root Cause The incident occurred due to a control plane issue in our API gateway that attempted to reload plugin configurations. Due to an expired authentication token, this reload failed, causing the WebSocket routing system to enter a degraded state. ## Timeline 4:22 PM PST - Initial service degradation began 4:53 PM PST - Issue identified through customer reports 5:05 PM PST - Full service restored by failing over to backup cluster ## Changes we've implemented 1. Fixed the underlying control plane issue that triggered unnecessary plugin reloads 2. Implemented authentication token rotation to prevent credential expiration issues 3. Enhanced monitoring systems to improve detection of WebSocket routing failures If you enjoy realtime distributed systems, consider applying: https://jobs.ashbyhq.com/vapi/295f5269-1bb5-4740-81fa-9716adc32ad5 Phone calls are degraded https://status.vapi.ai/incident/459737 Tue, 12 Nov 2024 00:58:00 -0000 https://status.vapi.ai/incident/459737#8b0f970cab11a1f0085c657533e5d889c5e36862f92d74ab8921d18a86fb49ec We're investigating. API is down https://status.vapi.ai/incident/457863 Fri, 08 Nov 2024 02:11:00 -0000 https://status.vapi.ai/incident/457863#7144b4a70055742ee804f7994dce08b8c16d521629133deb93cc1ea2514e6178 Misconfiguration on networking cluster. Resolved now. Here's what happened: ## Summary On November 7, 2024, from 5:59 PM to 6:10 PM PT, our API service experienced an outage due to an unintended configuration change. During this period, new API calls were unable to initiate, though existing connections remained largely unaffected. ## Impact - Duration: 11 minutes - Service returned 521 errors for new inbound API calls - Existing API calls remained stable - Service was fully restored at 6:10 PM PT ## Root Cause The incident occurred when a configuration intended for our staging environment was accidentally applied to production during a routine debugging session. This resulted in the deletion of a critical API gateway configuration. ## Timeline - 5:59 PM PT - Accidental deletion of production configuration during staging environment debugging - 6:00 PM PT - Monitoring systems detected service degradation - 6:08 PM PT - Engineering team identified root cause - 6:09 PM PT - Fix deployed (configuration restored) - 6:10 PM PT - Full service recovery confirmed ## Changes we've implemented 1. Changing namespace to include cluster name. `networking` > `networking-staging` and `networking-production`. This forces you to specify the environment while running kubectl commands. 2. Preventing deletion of resources that would never be expected to be deleted using Kubernetes deletion webhook. If working on realtime distributed systems excites you, consider applying: https://jobs.ashbyhq.com/vapi/295f5269-1bb5-4740-81fa-9716adc32ad5 API is down https://status.vapi.ai/incident/457863 Fri, 08 Nov 2024 02:09:00 -0000 https://status.vapi.ai/incident/457863#31f1050e22c42f37bf5a3118b23074143d847dee4bba91ca41696c5a6d43dbe0 API is down. We're investigating. Updates to follow. Cartesia is down, please use another Voice Provider in the meanwhile https://status.vapi.ai/incident/449475 Wed, 23 Oct 2024 18:08:00 -0000 https://status.vapi.ai/incident/449475#a048958b394382a6948653e5a0da2ce63ed8cfb2b9572c932762a263d567bdd1 Back to normal. You can follow the updates here: https://status.cartesia.ai. Cartesia is down, please use another Voice Provider in the meanwhile https://status.vapi.ai/incident/449475 Wed, 23 Oct 2024 17:35:00 -0000 https://status.vapi.ai/incident/449475#6e89fd28bad5112134bd607c9be1fe0c9a3f2ce957444de43d7f19f194e8f3cb *We're working on automated fallbacks for this scenario but currently, please switch manually your assistants.* Latest update from the Cartesia team: > We're currently experiencing an outage in our API due to our infrastructure provider Together being down. We'll update you as soon as possible when it's back up. Please check out and subscribe to our status page for future updates: https://status.cartesia.ai/. Latest update from the Together.ai team: > https://status.together.ai Web calls creation is degraded https://status.vapi.ai/incident/448891 Tue, 22 Oct 2024 20:04:00 -0000 https://status.vapi.ai/incident/448891#8bd33bda746cf3495569059ac8f4b9192f929f3a20c1cf668b1ba90732accefc We haven't seen an error in last 15 minutes, resolving for now. This will be updated if anything changes. Web calls creation is degraded https://status.vapi.ai/incident/448891 Tue, 22 Oct 2024 20:02:00 -0000 https://status.vapi.ai/incident/448891#33139297b29ef6547bb76940c2b7b59c7ec34c9f2d953750e23ff3609e38f999 Web call creation is mostly restored. From Daily team: > API error levels have decreased considerably, but we're still working on full remediation. More updates to come. Web calls creation is degraded https://status.vapi.ai/incident/448891 Tue, 22 Oct 2024 19:46:00 -0000 https://status.vapi.ai/incident/448891#3bce8babb993b25d4510f3964bd3178b43224a98eca7cf1f5118f60ddfb66cff Daily.co team is continuing to investigate. The issue has been tracked down to AWS Aurora DB and they're working with the AWS team. Web calls creation is degraded https://status.vapi.ai/incident/448891 Tue, 22 Oct 2024 18:50:00 -0000 https://status.vapi.ai/incident/448891#5ef334124f8221c53170244e423b0653f1baa465ca32b20da07fdbca2c6e65fd Daily.co is experiencing degradation (status.daily.co). Latest update: > One of our databases is being unexpectedly slow. We started getting alarms about it right about the same time you started seeing problems. We're in the process of posting about it on the status site. We'll share more shortly! We'll share more updates as we have it. For a workaround, it is recommend to create a Phone Number in dashboard.vapi.ai and direct users to call that to reach the Assistants instead. Deepgram is degraded, please switch to Gladia or Talkscriber https://status.vapi.ai/incident/446871 Fri, 18 Oct 2024 15:32:00 -0000 https://status.vapi.ai/incident/446871#814a251796a69d7c0a88dc154bd688f49791dc18def4d7b48aff50645d402eed Deepgram was fully restored at 8:32am, ending close to a 2h degradation. Summary: **Deepgram was degraded from ~6:12am PT to ~8:32am PT** (status.deepgram.com). Their main datacenter fell over, they routed traffic to their AWS fallback, but the latencies on their streaming endpoint were still incredibly high (>10s). Ideally, this degradation shouldn't have happened because it's our job to ensure we have fallbacks to mitigate 3rd party risks in real-time. As an **immediate action item**, we're bringing back standby onprem deepgram into our clusters which would have let us cut this degradation to a couple minutes. ------------- **To give more detail**: We could have run Deepgram on-prem before, giving us control over any changes to the transcription model. Unfortunately, we had phased that out couple months ago because we saw better performance from their SaaS service: 1. They run on better GPUs including H100s (and soon H200s). AWS limits the GPUs we can get and scaling is unpredictable. 2. They are continually upgrading their Nvidia inference stack, including proprietary optimizations. 3. They ship continual updates and bug fixes to their SaaS offering compared to monthly updates to onprem. This degradation alongside another from ElevenLabs earlier in the week (status.elevenlabs.io) has made it clear we need to prioritize redundancy further. 1. We need to have a tiered approach to falling back every piece of the stack. 2. We do this well with the assistant.model but assistant.voice and assistant.transcriber need it too. 3. This need will only get more acute with speech to speech models being the single point of failure. 4. We've been cautious with automated fallbacks because of how complex it is to get right (picking up exactly where the failure happened, etc.). But, it's now clear given our positioning as an orchestrator and critical infrastructure, we bear final accountability. Reliability is our #1 priority, and this incident only makes us more committed to prioritizing it above all else. Deepgram is degraded, please switch to Gladia or Talkscriber https://status.vapi.ai/incident/446871 Fri, 18 Oct 2024 15:20:00 -0000 https://status.vapi.ai/incident/446871#ebf621aac7ba5b3d79efe47834dd187ec3ac1eeb3530801dc3c87deebcaa8892 We have gotten an update from Deepgram that their main datacenter (S31) is back up. They expect ~20 more minutes of backlog batch work to transcribe and then things should be back to completely normal. Deepgram is degraded, please switch to Gladia or Talkscriber https://status.vapi.ai/incident/446871 Fri, 18 Oct 2024 15:03:00 -0000 https://status.vapi.ai/incident/446871#6f3ee01b742bc6911ce1a5bbf27feda9b88600612032a858d7ec44ec9066095c Deepgram is still degraded. We're still waiting on Deepgram for more accurate estimates and information. Meanwhile, we're spinning up a new cluster with onprem Deepgram but it will take ~30m to come up. Deepgram is degraded, please switch to Gladia or Talkscriber https://status.vapi.ai/incident/446871 Fri, 18 Oct 2024 13:31:00 -0000 https://status.vapi.ai/incident/446871#6371eeb630091a8959d81a093976e1a1429f6f03c0bec78a8af79007dbe2b7ee Deepgram is extremely degraded, https://status.deepgram.com Please switch to Gladia or Talkscriber in the meanwhile. We're spinning up remediations on our side, too. DB partitioning Saturday afternoon. https://status.vapi.ai/incident/442681 Sat, 12 Oct 2024 22:09:12 -0000 https://status.vapi.ai/incident/442681#dd5bebb37dd51e8e4d03ac3c3c73c1464df20556737b37107cc145c04bb87c62 We're partitioning our biggest table call. We expect this to be zero downtime but want to be communicative. DB partitioning Saturday afternoon. https://status.vapi.ai/incident/442681 Sat, 12 Oct 2024 22:09:12 -0000 https://status.vapi.ai/incident/442681#dd5bebb37dd51e8e4d03ac3c3c73c1464df20556737b37107cc145c04bb87c62 We're partitioning our biggest table call. We expect this to be zero downtime but want to be communicative. DB partitioning Saturday afternoon. https://status.vapi.ai/incident/442681 Sat, 12 Oct 2024 21:05:00 +0000 https://status.vapi.ai/incident/442681#1d74ba054365d381cdc7f70f1c0d57e354b3b50edbc3971fed37642d5aa9f3d6 Maintenance completed DB partitioning Saturday afternoon. https://status.vapi.ai/incident/442681 Sat, 12 Oct 2024 21:05:00 +0000 https://status.vapi.ai/incident/442681#1d74ba054365d381cdc7f70f1c0d57e354b3b50edbc3971fed37642d5aa9f3d6 Maintenance completed DB partitioning Saturday afternoon. https://status.vapi.ai/incident/442681 Sat, 12 Oct 2024 21:00:00 -0000 https://status.vapi.ai/incident/442681#dd5bebb37dd51e8e4d03ac3c3c73c1464df20556737b37107cc145c04bb87c62 We're partitioning our biggest table call. We expect this to be zero downtime but want to be communicative. DB partitioning Saturday afternoon. https://status.vapi.ai/incident/442681 Sat, 12 Oct 2024 21:00:00 -0000 https://status.vapi.ai/incident/442681#dd5bebb37dd51e8e4d03ac3c3c73c1464df20556737b37107cc145c04bb87c62 We're partitioning our biggest table call. We expect this to be zero downtime but want to be communicative. API is degraded https://status.vapi.ai/incident/441937 Wed, 09 Oct 2024 16:24:00 -0000 https://status.vapi.ai/incident/441937#e897c16a4eb11a42ab52540f7cf9763c61573f0947de95294bb883df6db36b41 We're back. RCA: * At 9:15am PT: We were alerted by a big spike in `request aborted`. * By 9:20am: We identified the root cause was head of line blocking on the API pods (some requests were taking too long, blocking other requests) * By 9:25am: We scaled and restarted the api pods. Everything reverted to normal. Action Items: * We'll be setting a hard query timeout and returning timeout on ones that exceed. Eg. GET /assistant?limit=1000. (statement_timeout) * We'll be making API pods aware of the health of their own DB connection, so it can restart gracefully. * We'll be lowering how long each API pod can hold a DB connection so it can't monopolize time (idle_timeout). API is degraded https://status.vapi.ai/incident/441937 Wed, 09 Oct 2024 16:18:00 -0000 https://status.vapi.ai/incident/441937#27b84d941a0a6a1edec00a9388db184a174bd3579a06844474154ed471fe6047 We're investigating. API is degraded https://status.vapi.ai/incident/441705 Wed, 09 Oct 2024 09:27:00 -0000 https://status.vapi.ai/incident/441705#99820a90fbacf814523ce7bd8a584920f9dfaa5dc5055e674dcb3617489acc1b Everything is back up for now. Here's what happened: * At 2:05am PT: We were alerted of the `cannot execute UPDATE in a read-only transaction` errors by Datadog. * By 2:15am: We determined it was unhealthy pooler state and restarted the DB to force reset all the connections. * By 2:25am: We are back up. We have several hypothesis on how the pooler session state got mangled. We're tracking them down right now. UPDATE: We spent several days going back and forth with Supabase on why our DB was put in read-only mode. They didn't have a concrete answer either, our collective best guess is transaction wraparound. API is degraded https://status.vapi.ai/incident/441705 Wed, 09 Oct 2024 09:11:00 -0000 https://status.vapi.ai/incident/441705#ab6f40615b3caec13f486b5e52738969a38d1d4132087b9531c60d9e12e0cedc We're investigating and will have more to share soon. For now, write paths seem to be completely down with the error `cannot execute INSERT in a read-only transaction` and `cannot execute UPDATE in a read-only transaction` while read paths are going through. API is degraded https://status.vapi.ai/incident/438296 Wed, 02 Oct 2024 19:00:00 -0000 https://status.vapi.ai/incident/438296#c9b30b9c231d87c99df9a73d69f00c40098b61ed7e61175574868a93175cafbf # Post-mortem ## TL;DR Human error on our end led us to being index-less on our biggest table `call`s, increasing DB CPU usage to 100%, and causing API request timeouts. This was a tactical mistake from us (the engineering team) in planning out the migration. We're sorry, we seek to do better than this. We've now engaged a Postgres scaling expert who's scaled multiple large-scale real-time systems before to ensure this never happens again. ## Background Timeline 1. Our Postgres DB CPU usage has been steadily increasing due to scaling pressure. Until recently, it had worked to scale the PG resources and add simple indexes but that reached its limits causing the Sept 24th outage. To be specific, while scaling resources lets PG handle increased volume of requests, each request is still slow due to the nature of how fast a CPU can work to move data to RAM. This means each request holds the PG connection for a longer period increasing chances of connection starvation and lock contention. 2. We initiated a project to understand our query bottlenecks and find better patterns to scale from here on—sharding, partitioning, compound indexes and OLAP warehousing for analytics. 3. Through this project, we found that our biggest table is `call`s and as expected, list and aggregation queries on that were consuming majority of CPU time. We sought to add a compound index on `org_id` and `created_at` to speed them up since they followed the structure `SELECT ... FROM call WHERE org_id=X ORDER BY created_at DESC`. 4. We issued `CREATE INDEX CONCURRENTLY IF NOT EXISTS call_org_id_created_at_idx ON call USING BTREE (org_id, created_at DESC)` at Oct 1st 10pm PT through the Supabase SQL editor. 5. Noticing successful creation in the Supabase UI of the index, Oct 2nd morning at 9am, we sought to drop the simple index on (org_id) to nudge PG to use our compound index. (check remediations) 6. At 9am PT, our DB CPU usage spiked to 100% full throttle, causing API request timeouts and thundering herd as Kubernetes tried to restart unhealthy pods. ## Incident Response 1. At 9:05am PT, we didn't understand that the above timeline had caused the degradation and proceeded to investigate after being paged of the degradation. (check remediation) 2. By 9:15am PT, per our incident response playbook, we were on our backup cluster but that didn't help and degradation was getting worse as the bottleneck of requests in the API pods deepened. We moved our investigation to the DB and noticed the spike in CPU usage. 3. By 9:30am, in attempt to reduce CPU usage, we released a change out to disable some of our aggregation queries that were causing most of the load. It became clear that didn't help. 4. By 9:45am, we discovered that in fact step #4 from the timeline actually had failed and the underlying index was `INVALID`. We were index-less on our biggest table `call`s. 5. By 10am, we had rebuilt the index and restored the system. As a precautionary measure, we're keeping analytics queries disabled until we sort our DB scaling fully. ## Remediations and Reflections 1. As clear from timeline #5 and incident response #1, fundamentally, this degradation happened we didn't realize our migration could fail and did fail. This was as in our "unknown unknowns". The solution is to seek out a PG expert who's done these scaling migrations multiple times before and can help us bridge our unknown unknowns through their first-hand knowledge of different failure modes. We're on it and already have couple leads. 2. Secondly, it was a big tactical mistake on our part to run the migration at 9am PT, right before peak time. We felt increasing pressure on the DB that created urgency and clouded proper planning. We're sorry. We're implementing better procedures to analyze the potential impacts of a change and ease of rollback before pushing things out; the kind of type 1 and type 2 decision theory that's common in business strategy. This is being helped by finding experts in different aspects of scaling that we as the engineering org can tap into, similar to remediation #1. 3. Lastly, we take infrastructure reliability deathly seriously and are really sorry about this error on our part. If you or someone you know is obsessed with infrastructure reliability, we'd love to chat. You can find our JD here: https://www.ycombinator.com/companies/vapi/jobs/BnVHTaQ-founding-senior-engineer-infrastructure API is degraded https://status.vapi.ai/incident/438296 Wed, 02 Oct 2024 17:00:00 -0000 https://status.vapi.ai/incident/438296#5e33e90dd575375b44cd8e60279d0514cf028f3f92f601f009f94538fc0d12be The system is back up barring analytics. Post-mortem to follow soon. API is degraded https://status.vapi.ai/incident/438296 Wed, 02 Oct 2024 16:59:00 -0000 https://status.vapi.ai/incident/438296#9a3d080400a3235d6c111ac64adffb31f1bebb5bda14c66140fbabf02ae800a2 We have identified the bottleneck. The system is recovering and we're continuing to monitor. API is degraded https://status.vapi.ai/incident/438296 Wed, 02 Oct 2024 16:41:00 -0000 https://status.vapi.ai/incident/438296#3b0a8776ea545499a55385ca1d6c2cf3c960253f5b211f947fa4f2cc634eee30 DB expanded but CPU is still maxed out, continuing to investigate. API is degraded https://status.vapi.ai/incident/438296 Wed, 02 Oct 2024 16:41:00 -0000 https://status.vapi.ai/incident/438296#3b0a8776ea545499a55385ca1d6c2cf3c960253f5b211f947fa4f2cc634eee30 DB expanded but CPU is still maxed out, continuing to investigate. API is degraded https://status.vapi.ai/incident/438296 Wed, 02 Oct 2024 16:38:00 -0000 https://status.vapi.ai/incident/438296#024eddb63a4acc01c6f3289d81736ce2c33cd79c8fccddae0ad724c7ee68fba3 We're expanding DB resources to resolve the CPU spike and bottleneck. Complete downtime for next 2 minutes. Port-mortem to follow soon API is degraded https://status.vapi.ai/incident/438296 Wed, 02 Oct 2024 16:38:00 -0000 https://status.vapi.ai/incident/438296#024eddb63a4acc01c6f3289d81736ce2c33cd79c8fccddae0ad724c7ee68fba3 We're expanding DB resources to resolve the CPU spike and bottleneck. Complete downtime for next 2 minutes. Port-mortem to follow soon API is degraded https://status.vapi.ai/incident/438296 Wed, 02 Oct 2024 16:15:00 -0000 https://status.vapi.ai/incident/438296#7018fca7e151d3eb7d3f8dd0ab5a4fc9b2c5dc21e0282a2a5df9181321707a3d API is experiencing degraded performance, including starting call timeouts API is degraded https://status.vapi.ai/incident/434239 Tue, 24 Sep 2024 20:48:00 -0000 https://status.vapi.ai/incident/434239#1bee4ababc754ed25b78df37cab182c75b3839f687621c90cfbb9f1d05cb076f We have identified the root cause of the issue and deployed a fix. Everything is good now. Here's what happened: 1. Most of our API pods' DB pooler's connections' came to be completely deadlocked. 2. This should have been caught by the Kubernetes health checks and/or our Uptime bot but was not (see below on remediation). 3. We immediately scaled up our backup cluster and moved the traffic over. 4. The system (`api.vapi.ai`) was back to full capacity in 13m. 5. With production in clear, we got to the root cause analysis on the abandoned cluster. 6. It's unclear what triggered the deadlock simultaneously on multiple pods but our best guess is something on our DB provider side (Supabase). 7. It's also possible that one of them deadlocked and caused additional load on others which triggered the same deadlock mechanism on others. 8. Our last hypothesis was some client-side library bug (Postgres.js) but unclear why simultaneously would trigger. 9. Either way, we had enough data to build up remediations and prevent another incident of this kind. Remediations: 1. Within our Kubernetes health checks for the API pods, we are adding a dummy query `SELECT now()` to actually check the viability of the connection. 2. This does add risks to API pods becoming completely unresponsive in case of a DB outage but that's okay since DB being down would be clear RCA in that case. 3. With this check in place, Kubernetes will take the bad pods have a non-viable connection out of rotation and restart them preventing that a partial or full outage. API is degraded https://status.vapi.ai/incident/434239 Tue, 24 Sep 2024 20:23:00 -0000 https://status.vapi.ai/incident/434239#c03bdabb157ba0ee99cd5a5a402b5b4147504c3e53ca28ceeae79594de893c1a Requests to the API are experiencing higher latency including timeouts for 30-40% of the requests resulting in a partial downtime. This includes requests to start calls. We're investigating ASAP. Call transfers are degraded https://status.vapi.ai/incident/413839 Wed, 14 Aug 2024 13:30:00 -0000 https://status.vapi.ai/incident/413839#fadc9cdecc33bd81a387b8720867bed78c173345250f1fb9bc2955c7fb5fccbf We have identified the root cause of the issue and a fix has been deployed. The cause of the issue was an edge case causing infinite loop on tool.messages. We had a secondary issue that caused delay in resolution. Usually, we're able to move to our backup cluster with last known working state ASAP. But, we had unknowingly hit our AWS account limits so the backup cluster couldn't scale to handle full volume. It took some time to get hold of AWS and get more quota. We're auditing and setting up alerts for our AWS service quotas. Call transfers are degraded https://status.vapi.ai/incident/413839 Wed, 14 Aug 2024 12:30:00 -0000 https://status.vapi.ai/incident/413839#b29490d785d5936107b3355dfadc1a7dcbc8597941895195cd97bdb637d46cea Call transfers causing call failure, we are investigating Calls are degraded https://status.vapi.ai/incident/406346 Tue, 30 Jul 2024 21:00:00 -0000 https://status.vapi.ai/incident/406346#8a379ecbecb6000c93d3eb09611128bcda717cb053a77d3b8ba33f67aeb864f4 We have resolved the issue. The cause of the issue was the default core-dns scaler in EKS didn't scale to according to the workload causing DNS queries within our cluster to start failing and causing requests to hang. Calls are degraded https://status.vapi.ai/incident/406346 Tue, 30 Jul 2024 20:00:00 -0000 https://status.vapi.ai/incident/406346#e8d060b079e598292b27dcfe58f4e3ab3d917ba4e272bb889684eeffd44f9a7a We are investigating