RCA: Vapifault Worker Timeouts

TL;DR

On May 12, approximately 335 concurrent calls were either web-based or exceeded 15 minutes in duration, surpassing the prescaled worker limit of 250 on the weekly environment. Due to infrastructure constraints, Lambda functions could not supplement the increased call load. Kubernetes call-worker pods could not scale quickly enough to meet demand, resulting in worker timeout issues. The following day, this issue reoccurred due to the prescaling limit being inadvertently reset to the lower default value during a routine deployment.

Timeline (PT)

May 12, 1:30 pm: Customer reports issues related to worker timeouts.
May 12, 4:39 pm: Another customer reports the same issue with worker timeouts.
May 12, 5:19 pm: Workers scaled manually from 250 to 350; service restored.
May 12, 11:48 pm: Routine deployment resets worker prescale count back to 250.
May 13, 10:47 am: Customer reports recurrence of worker timeout issue.
- Concurrent increase in overall call volume further exacerbates worker availability.
May 13, 11:29 am: Workers scaled again to 350 on weekly and increased to 750 on daily; service fully restored.

Impact

Approximately 2,461 calls dropped due to worker connection timeouts.

What Went Wrong?

Insufficient Monitoring: Worker timeout events were not correctly captured by monitoring because of how callEndedReason is logged.
- Customers identified and reported the issue before internal monitoring did.
Configuration Drift: Prescale worker count change was not committed to the main configuration branch, causing resets during routine deployments.
Alert Handling: Lambda invocation alerts fired but were deprioritized as "requires investigation but not urgent."

What Went Well?

Rapid remediation once the problem was identified.