Previous incidents
SIP is degraded
Resolved Jun 30 at 09:43pm PDT
TLDR: A temporary slowdown caused by saturation in our API gateway layer increased response times until they exceeded the edge-network timeout, causing a 524 HTTP response for some API requests.
Timeline in PST
01:00 AM First elevated 524 error responses detected
06:35 AM Rolled back recent backend release (no improvement)
07:19 AM Rolled back related network changes (no improvement).
08:22 AM Scaled up API gateway
09:36 AM Scaled up API gateway further
10:00 AM Reverted the previous night's...
4 previous updates
DB High Latency
Resolved Jun 25 at 11:54am PDT
The issue has been resolved: https://neonstatus.com/aws-us-west-oregon/incidents/01JYM23FB7HR82VPZR9DBKVPP8#01JYM4F8Y8MZARWJWRFKV01AP5.
1 previous update
Increased database latency causing requests to fail
Resolved Jun 20 at 01:43pm PDT
Our database provider has reported this issue as resolved from their end
1 previous update
Breaking changes to Success Evaluation API Response
Resolved Jun 18 at 07:42pm PDT
TL;DR
In response to hallucinations reports in the Success Evaluation feature, we updated our integration with Gemini LLM to use Structured Output. This inadvertently changed the type of the call.analysis.successEvaluation field from string | null to string | number | boolean | null, introducing a breaking change for customers with strict type validation and those using Vapi Server SDKs.
Timeline (all in PT)
- June 12, 11:32pm: Enterprise and Startup users report hallucinations in Su...
1 previous update
Sign-ups/Sign-ins are not working
Resolved Jun 13 at 01:12am PDT
It is resolved.
2 previous updates
Elevenlabs voice provider not working with custom API Key
Resolved Jun 12 at 03:20am PDT
Summary:
We experienced an issue related to API key validation within our WebSockets implementation when sending the API key more than once.
Details:
The issue arose during API key validation within our WebSockets implementation. Our system validates that the API key provided during the initial message is the same as proceeding messages.
A recent change introduced during a release caused a mismatch in how API keys were compared. Specifically, the system was comparing hashed API keys agains...
2 previous updates
API was down
Resolved Jun 01 at 11:00pm PDT
API was down due to user error in routine maintenance. Service has since been restored
1 previous update
Users Unable to Sign In to Dashboard
Resolved May 26 at 06:37pm PDT
Summary
Users experienced login issues with our dashboard due to an unintended deployment of a staging version to the production environment.
Timeline (in PST):
- 3:17 PM: Internal engineers identified issues affecting developer workflows.
- 4:19 PM: Breaking change is introduced and unintentionally deployed to production
- 4:38 PM: First customer reports surfaced; engineering team immediately escalated internally.
- 4:43 PM: Public status page updated to notify customers.
- 4:54 PM: Correc...
2 previous updates
Cartesia voices are degraded
Resolved May 18 at 01:56pm PDT
Everything is functional. We're still working with Cartesia to get to bottom. We'll change back to degraded if the issue raises again during investigation.
2 previous updates
Vapifault Worker Timeouts
Resolved May 13 at 10:31am PDT
RCA: Vapifault Worker Timeouts
TL;DR
On May 12, approximately 335 concurrent calls were either web-based or exceeded 15 minutes in duration, surpassing the prescaled worker limit of 250 on the weekly environment. Due to infrastructure constraints, Lambda functions could not supplement the increased call load. Kubernetes call-worker pods could not scale quickly enough to meet demand, resulting in worker timeout issues. The following day, this issue reoccurred due to the prescaling limit...
providerfault-transport errors
Resolved May 13 at 10:29am PDT
RCA: Providerfault-transport-never-connected
Summary
During a surge in inbound call traffic, two distinct errors were observed: "vapifault-transport-worker-not-available" and "providerfault-transport-never-connected." This report focuses on the root cause analysis of the "providerfault-transport-never-connected" errors occurring during the increased call volume.
Timeline of Events (PT)
- 10:26 AM: Significant spike in inbound call volume.
- 10:26 – 10:40 AM: Intermittent H...
SIP calls abruptly closing after 30 seconds
Resolved May 13 at 10:27am PDT
RCA: SIP Calls Ending Abruptly
TL;DR
A SIP node was rotated, and the associated Elastic IP (EIP) was reassigned to the new node. However, the SIP service was not restarted afterward, causing the SIP service to use an incorrect (private) IP address when sending SIP requests. Consequently, users receiving these SIP requests attempted to respond to the wrong IP address, resulting in ACK timeouts.
Timeline (PT)
- May 12, ~9:00 pm: SIP node rotated and Elastic IP reassigned, but SI...
Stale data for Weekly users
Resolved May 13 at 10:22am PDT
RCA: Phone Number Caching Error in Weekly Environment
TL;DR
Certain code paths allowed caching functions to execute without an associated organization ID, preventing correct lookup of the organization's channel. This unintentionally enabled caching for the weekly environment, specifically affecting inbound phone call paths. Users consequently received outdated server URLs after updating phone numbers.
Timeline (PT)
- May 10, 1:26 am: Caching re-enabled for users in daily envir...
Voice issues due to 11labs quota
Resolved May 10 at 10:34am PDT
RCA: 11Labs Voice Issue
TL;DR
Calls began failing due to exceeding the 11Labs voice service quota, resulting in errors (vapifault-eleven-labs-quota-exceeded
).
Timeline of Events (PT)
- 12:04 PM: Calls begin failing due to 11Labs quota being exceeded.
- 12:16 PM: Customer reports the issue as a production outage.
- 12:24 PM: Contacted 11Labs support regarding quota exhaustion.
- 12:25 PM: 11Labs support recommends enabling usage-based billing.
- 12:26 PM: Usag...
API degradation
Resolved May 02 at 06:43pm PDT
RCA for May 2nd User error in manual rollout
Root cause:
- User error in kicking off a manual rollout, driven by unblocking a release
- Due to this, load balancer was pointed at an invalid backend cluster
Timeline
- 5:24pm PT: Engineer flagged blocked rollout, Infra engineer identified transient error that auto-blocked rollout
- 5:31pm PT: Infra engineer triggered manual rollout on behalf of engineer, to unblock release
- 5:43pm PT: On-call was paged with issue in rollout manager, e...
2 previous updates
WebCalls from Dashboard are unavailable
Resolved May 01 at 04:14pm PDT
We've applied a fix to the dashboard web call feature. Root cause was misconfiguration on the authentication strategy used for this feature.
1 previous update
Issues with KYC verification
Resolved May 01 at 04:10pm PDT
The team has applied a fix to frontend issue. Root cause was a change related to authentication handlers. Additionally, our KYC vendors are extending observability features in their SDK
3 previous updates