Back to overview
Degraded

Elevated Call Failure Rate on Weekly

Mar 19, 2026 at 11:17pm UTC
Affected services
Vapi API [Weekly]

Resolved
Mar 21, 2026 at 05:46am UTC

Incident Report, March 19, 2026

Impact: A service disruption affected inbound and outbound call reliability on the Daily and Weekly channel. Some calls failed to connect with transport-never-connected, worker-not-available, worker-died, and deepgram-transcriber-failed end reasons.

Timeline (all times PDT):

12:20 PM We detected elevated call failure rates on the Weekly production cluster.

12:22 PM We published a status page incident and began investigating.

12:25 PM We identified the trigger as an unanticipated surge in call volume that exceeded our provisioned cluster capacity and downstream rate limits with a model provider.

12:30 PM We applied traffic controls and began working with the model provider to increase capacity. Call failures began declining.

1:40 PM Call success rates returned to normal and held stable. First incident window closed.

~4:00 PM A separate traffic spike re-triggered infrastructure constraints, leading to elevated failures. We began investigating immediately.

4:00 PM to 4:40 PM We rebalanced traffic and migrated affected workloads to dedicated infrastructure to restore headroom on shared clusters.

4:50 PM All mitigations took effect. Call success rates returned to normal.

4:50 PM to 8:10 PM We continued active monitoring. No further failures observed.

8:10 PM Second incident window closed.

Immediate Action Items: Improve workload isolation and per-account capacity guardrails to prevent resource contention from cascading across the platform.

Note: A full root cause analysis is underway and will be available upon request. We sincerely apologize for the disruption and thank you for your patience.

Updated
Mar 20, 2026 at 04:23am UTC

The incident has been resolved and services have been stable since 4:50pm PT. We will continue monitoring and will publish additional details if necessary.

Updated
Mar 20, 2026 at 03:09am UTC

Our earlier mitigation is working and services have been stable since 4:50 PM. We have identified a potential root cause and are working on a permanent fix. We will share further updates once we deploy and validate the fix.

Updated
Mar 20, 2026 at 12:22am UTC

The immediate mitigation we deployed is working and we are seeing call success rates continuing to recover. We are still investigating root cause and closely monitoring service performance.

Updated
Mar 19, 2026 at 11:50pm UTC

We're seeing improved call success rates, but we're still monitoring the situation.

Created
Mar 19, 2026 at 11:17pm UTC

We're seeing elevated call failures on weekly, and the team is actively looking into it.