Elevated Call Failure Rate on Weekly
Resolved
Mar 21, 2026 at 05:46am UTC
Incident Report, March 19, 2026
Impact: A service disruption affected inbound and outbound call reliability on the Daily and Weekly channel. Some calls failed to connect with transport-never-connected, worker-not-available, worker-died, and deepgram-transcriber-failed end reasons.
Timeline (all times PDT):
12:20 PM We detected elevated call failure rates on the Weekly production cluster.
12:22 PM We published a status page incident and began investigating.
12:25 PM We identified the trigger as an unanticipated surge in call volume that exceeded our provisioned cluster capacity and downstream rate limits with a model provider.
12:30 PM We applied traffic controls and began working with the model provider to increase capacity. Call failures began declining.
1:40 PM Call success rates returned to normal and held stable. First incident window closed.
~4:00 PM A separate traffic spike re-triggered infrastructure constraints, leading to elevated failures. We began investigating immediately.
4:00 PM to 4:40 PM We rebalanced traffic and migrated affected workloads to dedicated infrastructure to restore headroom on shared clusters.
4:50 PM All mitigations took effect. Call success rates returned to normal.
4:50 PM to 8:10 PM We continued active monitoring. No further failures observed.
8:10 PM Second incident window closed.
Immediate Action Items: Improve workload isolation and per-account capacity guardrails to prevent resource contention from cascading across the platform.
Note: A full root cause analysis is underway and will be available upon request. We sincerely apologize for the disruption and thank you for your patience.
Affected services
Updated
Mar 20, 2026 at 04:23am UTC
The incident has been resolved and services have been stable since 4:50pm PT. We will continue monitoring and will publish additional details if necessary.
Affected services
Updated
Mar 20, 2026 at 03:09am UTC
Our earlier mitigation is working and services have been stable since 4:50 PM. We have identified a potential root cause and are working on a permanent fix. We will share further updates once we deploy and validate the fix.
Affected services
Updated
Mar 20, 2026 at 12:22am UTC
The immediate mitigation we deployed is working and we are seeing call success rates continuing to recover. We are still investigating root cause and closely monitoring service performance.
Affected services
Updated
Mar 19, 2026 at 11:50pm UTC
We're seeing improved call success rates, but we're still monitoring the situation.
Affected services
Created
Mar 19, 2026 at 11:17pm UTC
We're seeing elevated call failures on weekly, and the team is actively looking into it.
Affected services