Back to overview

Incident Report: Increased call failures on July 25 (PST)

Jul 25 at 02:46pm PDT
Affected services
Vapi API

Resolved
Jul 25 at 02:46pm PDT

Incident Report:
Increased call failures on July 25 (PST)

Summary (TL;DR)
On July 25 between 7:00–7:15am PST, a spike in call volume caused some calls to fail with a worker-not-available error. The fallback service for short calls (our serverless workers) could not start because its image architecture didn’t match the configured runtime (image built for ARM64, runtime set to x86). We stabilized the platform by scaling primary workers and then corrected the configuration. Service is operating normally.

Impact
Total failed calls: 3,028 between 7:00–7:15am PST with error call.in-progress.error-vapifault-worker-not-available.

1,122 of these calls were eligible to be handled by our serverless workers but still failed.

Current status: Resolved. No action is required from customers. If you experienced failures during this window, please retry the affected calls.

Timeline (PST, July 25)
7:00am: Sudden spike in incoming calls.

7:00–7:15am: Elevated failures with worker-not-available.

~11:51am: Incident triage began; we confirmed our autoscaling attempted to invoke serverless workers.

~12:56pm: Root cause identified: serverless worker image was built for ARM64 while the runtime was still configured for x86, preventing startup.

After identification: We increased capacity on primary backends to minimize reliance on the fallback path and then redeployed the serverless workers with the correct architecture.

Root Cause
A configuration mismatch between the container image architecture (ARM64) and the serverless runtime setting (x86) prevented our fallback workers from starting during a sudden traffic surge.

Remediation & Prevention
Completed

Aligned serverless runtime architecture with the container image (ARM64 ↔ ARM64).

Temporarily scaled primary worker capacity to handle surges while deploying the fix.

In Progress / Planned

Automated canary tests: Periodically invoke serverless workers to ensure readiness and catch regressions early.

Alerting: Add targeted alerts when the fallback path is degraded or invocation rates drop unexpectedly.

Build-time and deploy-time guards: Enforce architecture checks so image and runtime must match before deployment.

Dependency review: Audit and, where needed, adjust dependencies to ensure reliable ARM operation in serverless environments.