Resolved

Voice calls dropped and dashboard unavailable

May 21, 2026 at 1:55pm UTC

Affected services

Vapi API

Vapi API [Weekly]

Vapi Dashboard

Resolved
May 29, 2026 at 8:22pm UTC

Vapi RCA - US Platform Outage, May 21, 2026

RCA: US Platform Outage (Database Connection Exhaustion)
Status: Resolved
Primary Impact: Voice calls and API requests failed; the dashboard was unavailable
Date: May 21, 2026
Customer-facing systems affected: Calls, API, Chat, Dashboard, Logs
Severity: Critical

Executive Summary

On May 21, 2026, the Vapi US platform was unavailable for approximately 4 hours and 7 minutes, beginning at 7:02 AM PT, with service restored by 11:09 AM. Calls began failing at 7:02 AM PT, and by 7:12 AM PT the platform had progressed to a full outage of voice calls and the dashboard. During the window, voice calls failed to connect or complete, API requests returned errors, and the dashboard was inaccessible. Customers served from our EU and AUS regions were not affected.

The outage began with a configuration change our database provider applied at 6:44 AM PT to the audit logging module used by our databases. The telemetry/audit logging setting caused Postgres processes to block while writing log output (syslog) at the moment they accepted new connections. Two databases that every call and API request depends on stopped accepting traffic, so the platform could neither authorize requests nor record new calls. Service was restored after the provider disabled the affected logging and restarted the database, and we ramped traffic back in a controlled way to avoid a second failure. We marked the incident resolved at 11:49 AM PT after sustained monitoring confirmed stability.

The changes below focus on removing the dependencies that let a database fault become a full outage.

What You May Have Observed

Inbound and outbound voice calls failing to connect, or ending early
API requests returning server errors (HTTP 5xx)
Inability to access the Vapi dashboard
Scheduled and campaign calls not firing during the window
A brief period of partial recovery mid-incident, where some calls succeeded before failures resumed

Timeline (PT)

6:44 AM: Our database provider applies a configuration change to audit logging.
7:02 AM: First monitors fire for API degradation; calls begin failing.
7:18 AM: On-call engineer acknowledges the page and begins investigating.
7:38 AM: The core infrastructure team jumped into a formal incident.
7:43 AM: Status page updated to report degradation.
7:53 AM: Status page updated to confirm a full outage of calls and dashboard.
8:10 AM: Production database identified as the source; database provider support engaged.
8:16 AM: Escalated with the provider for higher urgency.
8:29 AM: Status page confirms the production database as the cause.
9:35 to 10:00 AM: Brief recovery after a database restart, followed by renewed degradation as traffic resumed.
10:03 AM: Provider identifies root cause: database processes stuck writing audit logs while accepting connections, caused by a misconfigured project-wide telemetry setting.
10:05 AM: Affected logging disabled and the database endpoint restarted; the database returns to normal.
10:05 to 11:09 AM: Traffic ramped back gradually, weekly cluster first (restored 10:38 AM), then daily (restored 11:09 AM).
11:49 AM: Incident resolved.

Root Cause

An audit-logging change exhausted the database connection pool

Vapi’s databases include audit logging. At 6:44 AM PT, our database provider changed where that audit output was written. A misconfiguration in the provider's project-wide telemetry settings caused Postgres processes to block while writing log output at the moment they accepted new connections. Connections accumulated in a waiting state until the pool was exhausted, and the affected databases could no longer accept traffic.

A hard dependency on two databases turned the problem into a full outage

Every API request and call passes through two of these databases: one authorizes the request at our API gateway, and the other records the call before it is queued for processing. Both were among the affected databases. With neither reachable, the platform could not authorize requests or create calls, so the failure was total rather than partial. A read replica was in place, but the provider's configuration change applied at the project level and affected both the primary and replica.

Resulting customer impact

For the duration of the window, calls failed to connect or complete and API requests returned errors. The dashboard, which also depends on these databases, did not load. Scheduled and campaign calls due to fire during the window did not run and were not retried automatically.

Customers also would have noticed incidents before the Vapi status page was updated, due to the delay between on-call acknowledgment and creation of any status page.

Resolution

Our database provider disabled the audit log collection on the affected databases and restarted the database endpoint, which restored their ability to accept connections.
To avoid a connection storm overwhelming the recovered databases, we brought internal services back up gradually rather than all at once, restoring the weekly cluster first and then the daily cluster.
We confirmed recovery through API success-rate, error-rate, and database latency monitoring before closing the incident. Our provider has since published and confirmed a fix to the affected audit log collectors.

Prevention

Within the next week we are working towards:

Add production monitoring for database query latency and connection-volume anomalies, so this signature is caught at onset rather than inferred from downstream symptoms.
Establish a faster, clearly defined escalation path with our database provider. Provider escalation and resolution took longer than it should have.
Implement a change management process with our database provider so changes that could impact availability are done in a more controlled manner.

Over the next quarter we are working towards:

Introduce a read cache in front of authorization and configuration data, removing the hard dependency on that database so a future database fault degrades the platform rather than disabling it.
Migrate our critical databases to a platform that supports a regional hot standby, so we can fail over when a primary becomes unavailable.
Implement automatic status updates based on monitors, to ensure customers are made aware of incidents when we are.
Move call-record writes to an asynchronous path, removing the hard dependency on the call database for connecting a call.

Customer Actions

No action is required to restore service.

We are committed to earning your trust. As part of that, progress on the prevention items will be shared with you via your account team.

Updated
May 22, 2026 at 3:03am UTC

Incident Report: Database Outage (Log Collector Misconfiguration)

What Happened

Vapi experienced a large service outage causing voice calls to fail and the dashboard to become unavailable. This was caused by a failure in an audit log collector in the Vapi production database. The triggering event was an apply_config that our database provider executed at 6:44 AM PST. A misconfiguration in the project-wide telemetry settings caused Postgres processes to become stuck writing to syslog when accepting new connections, exhausting the connection pool and rendering the database unable to accept traffic, including from within the pod itself.

We notified our provider's support line at 8:10 AM PST. The root cause was identified at 10:03 AM PST by our database provider. Mitigation was applied by disabling the OTEL connection and restarting the endpoint, after which the system returned to a normal state. A fix to the audit collectors was subsequently published and confirmed stable.

Customer Impact

Service availability: Large outage. Vapi's voice services were unavailable during the incident window, affecting 100% of customers from 7:12 AM PT until 11:49 AM PT when the incident was marked as resolved.
The Vapi dashboard was also unavailable during that time.

Timeline (PST)

Time	Event
6:44 AM	Our database provider executes an `apply_config` change, triggering the incident.
7:12 AM	Vapi begins to observe call degradation.
7:22 AM	The team begins its investigation.
7:43 AM	Vapi updates the status page to notify customers of observed degradation.
7:53 AM	Vapi updates the status page to confirm a full outage of both voice calls and the dashboard.
8:10 AM	Vapi suspects production database behavior as the source of the problem and notifies the database provider's support team. Initial investigation begins; a large spike in waiting-status connections is observed.
8:16 AM	Internal escalation with the database provider for increased urgency.
8:29 AM	Vapi confirms production database behavior as the source of the problem on the status page and continues to collaborate with the database provider on mitigation.
9:35–10:00 AM	A brief recovery is observed after restarting the database, but degradation reappears after services are scaled back up.
10:03 AM	Database provider identifies the root cause: Postgres processes stuck on syslog writes during connection acceptance, caused by a misconfigured project-wide `telemetry_setting` for log collectors.
10:03 AM+	OTEL connection disabled; endpoint restarted; system returns to normal state.
10:38 AM	Vapi increases traffic back on the weekly environment and confirms that service is restored.
11:09 AM	Vapi increases traffic back on the daily environment, confirms service is restored, and moves to a monitoring stage.
11:49 AM	Vapi marks the incident as resolved.

Updated
May 21, 2026 at 6:32pm UTC

Service has returned to normal operating levels. Call success metrics have recovered and remained stable for 30 minutes across both daily and weekly channels, and all platform functionality has been restored. We’re continuing to monitor closely and will provide further updates if anything changes. We will update the status page with an incident report within 12 hours. Thank you for your patience.

Updated
May 21, 2026 at 6:09pm UTC

Services are recovering across both our weekly and daily clusters, and all metrics are trending positive. Our DB provider has identified and confirmed the root cause and we have applied an initial remediation. Our DB provider team remains actively engaged with us as we scale load back up. We are continuing to monitor closely and will provide updates as we have them.

Updated
May 21, 2026 at 6:06pm UTC

Our weekly cluster is still showing recovery and calls are going through. We are still monitoring the situation. We have shifted our focus to our daily cluster.

Updated
May 21, 2026 at 5:41pm UTC

Our weekly cluster has seen recovery over the last 20 minutes. We are still monitoring the situation as there is a possibility for calls to fail again.
We are moving to fixing calls in our daily clusters now.

We will post updates when we have new information to share or in 30 minutes.

Updated
May 21, 2026 at 5:13pm UTC

Our daily cluster is still experiencing a full outage. Weekly is seeing some recovery.

Updated
May 21, 2026 at 5:11pm UTC

We are still in a degraded state on weekly and working on fully resolving the issue.
Our daily cluster is still out.

Updated
May 21, 2026 at 4:41pm UTC

We are seeing some recovery in Voice Calls and Dashboard calls and are continuing to monitor the situation. We will post updates as we have news or in 30 minutes.

Updated
May 21, 2026 at 4:29pm UTC

Our DB provider has escalated to the highest level. Their most senior architect is now directly involved in identifying the fix. We are collaborating closely on resolution.
We will post an update as we have news or in 30 minutes.

Updated
May 21, 2026 at 3:59pm UTC

Our DB provider confirmed the config change which we have identified as the cause for our DB outage, which causes voice calls to drop and our dashboard to not load.
We are collaborating with our provider on an eventual fix or workaround. They have escalated this issue to the highest level of urgency on their side.
We will post an update as we have news or in 30 minutes.

Updated
May 21, 2026 at 3:27pm UTC

We are still investigating a complete outage in Voice Calls. Our DB provider applied a configuration change at 6:44am which is causing our DB to be completely unavailable. We are working with them to get our DBs back up.

We do not have an ETA or resolution yet, however our provider has escalated the issue internally.

We will post an update as we learn more or in 30 minutes.

Created
May 21, 2026 at 1:55pm UTC

We are investigating an incident causing voice calls dropped. We will publish updates as we get more information or in 30 minutes.