Back to overview
Degraded

Calls failing due to worker unavailability

Dec 17 at 11:06am PST
Affected services
Vapi API
Vapi API [Weekly]

Resolved
Dec 23 at 03:18pm PST

[IR] Dec 17th — Call Worker Degradation — Object Storage Upload Errors

Summary

On December 17, 2025, at 10:25 AM PST, we observed degradation in our Call Worker service. The issue was caused by Call Workers becoming blocked while uploading call recordings to a downstream object storage provider facing an outage of their own. The incident was fully resolved by 11:02 AM PST, once the downstream provider recovered and Call Workers returned to normal operation.

Timeline (PST)

  • 10:25 AM — Initial call degradation alert triggered
  • 10:40 AM — Investigation began
  • 10:45 AM — Downstream provider partially recovered
  • 10:52 AM — Downstream provider fully recovered
  • 11:02 AM — Call Workers fully recovered

Root Cause

A downstream object storage provider experienced a partial outage, during which call recording upload requests began failing or stalling. Requests either timed out or returned 502 errors.

These stalled upload operations increased processing time within Call Workers, leading to worker exhaustion. Due to this resource saturation, the system was unable to scale quickly enough to accept new incoming calls, resulting in dropped or unaccepted calls during the affected period.

Impact

For approximately 30 minutes, a subset of calls could not be accepted or were dropped due to unavailable or terminated Call Workers. There was no data loss.

Calls not picked up (worker not available):
- Daily organizations: 15,555

- Weekly organizations: 478

What Went Well

  • Autoscaling eventually resolved the issue of workers being unavailable without the need for manual intervention.

What Went Poorly

  • No fallback mechanism was in place for object storage uploads.
  • Monitoring did not quickly identify the downstream dependency as the root cause.

Remediation

  • Make the object storage upload process asynchronous
  • Add more aggressive timeouts and retries for upload operations
  • Investigate procedures for manually scaling capacity during incidents
  • Add monitoring for object storage upload errors

If working on realtime distributed systems excites you, consider applying:

https://jobs.ashbyhq.com/vapi/295f5269-1bb5-4740-81fa-9716adc32ad5

Updated
Dec 17 at 11:20am PST

The system has recovered. The team is monitoring, and we will update here with a full RCA.

Created
Dec 17 at 11:06am PST

We have detected an issue with our call workers not scaling to meet demand. The team is investigating and will update here.