API and calls were momentarily disrupted
Resolved
Aug 05 at 12:12pm PDT
IR August 4th: Call Degradation due to Pod Evictions
TL;DR
On August 4th, an incident occurred due to aggressive pod consolidation by Karpenter, which caused Redis pods to be evicted and restarted. This led to API pod failures, triggering a failover to an outdated networking component, resulting in dropped calls. The incident caused a total of 393 calls to be dropped.
Timeline (PST)
August 4th
- 11:02-11:27 AM - Core team identifies Karpenter pods in CrashLoopBackOff (OOMKilled due to high call volume), leading to aggressive pod consolidation.
- 11:27 AM - Redis pods evicted with message:
"Evicted pod: Drifted."
Redis pods restart on new nodes, causing dependent API pods to fail. - 11:28 AM - Cloudflare load balancer detects failing API pods and initiates failover to a secondary networking component.
- 11:28-11:29 AM - The secondary networking component, outdated and improperly scaled, misroutes traffic, resulting in additional call failures.
- 11:29 AM - Worker unavailability due to misrouting causes a total of 393 call drops.
- 11:30 AM - Corrective rollout completes, restoring worker availability.
- 11:31 AM - Stability restored.
Root Cause
The incident was triggered by aggressive node consolidation from Karpenter following initial resource constraints. Critical Redis pods were evicted without adhering to their PodDisruptionBudgets (PDBs), causing API pod failures. This failure initiated a Cloudflare load balancer failover to an outdated networking component, resulting in dropped calls.
Impact
- 393 total calls dropped due to worker unavailability.
- Temporary service disruption impacting Redis and API services.
What Went Well?
- Quick response by the incident response team.
What Went Poorly?
- Networking components were not maintained in parity, worsening the impact during failover.
- PodDisruptionBudgets (PDBs) for Redis pods were improperly configured, allowing unintended evictions.
- Lack of monitoring for Karpenter restarts delayed detection by several hours.
Remediation steps taken
- Increase memory limits for Karpenter in configuration management.
- Add protective annotations (
karpenter.sh/do-not-disrupt: "true"
) to critical Redis pods. - Integrate Karpenter logs with centralized logging for improved visibility.
- Implement monitoring to detect Karpenter pod restarts. If working on realtime distributed systems excites you, consider applying: https://jobs.ashbyhq.com/vapi/295f5269-1bb5-4740-81fa-9716adc32ad5
Affected services
Vapi API
Created
Aug 04 at 11:14am PDT
Around 11:00 AM, a sudden surge in call volume that caused connection failures. The same spike also disrupted the API, likely resulting in multiple 5xx errors.
Affected services
Vapi API