IR August 4th: Call Degradation due to Pod Evictions

TL;DR

On August 4th, an incident occurred due to aggressive pod consolidation by Karpenter, which caused Redis pods to be evicted and restarted. This led to API pod failures, triggering a failover to an outdated networking component, resulting in dropped calls. The incident caused a total of 393 calls to be dropped.

Timeline (PST)

August 4th

11:02-11:27 AM - Core team identifies Karpenter pods in CrashLoopBackOff (OOMKilled due to high call volume), leading to aggressive pod consolidation.
11:27 AM - Redis pods evicted with message: "Evicted pod: Drifted." Redis pods restart on new nodes, causing dependent API pods to fail.
11:28 AM - Cloudflare load balancer detects failing API pods and initiates failover to a secondary networking component.
11:28-11:29 AM - The secondary networking component, outdated and improperly scaled, misroutes traffic, resulting in additional call failures.
11:29 AM - Worker unavailability due to misrouting causes a total of 393 call drops.
11:30 AM - Corrective rollout completes, restoring worker availability.
11:31 AM - Stability restored.

Root Cause

The incident was triggered by aggressive node consolidation from Karpenter following initial resource constraints. Critical Redis pods were evicted without adhering to their PodDisruptionBudgets (PDBs), causing API pod failures. This failure initiated a Cloudflare load balancer failover to an outdated networking component, resulting in dropped calls.

Impact

393 total calls dropped due to worker unavailability.
Temporary service disruption impacting Redis and API services.

What Went Well?

Quick response by the incident response team.

What Went Poorly?

Networking components were not maintained in parity, worsening the impact during failover.
PodDisruptionBudgets (PDBs) for Redis pods were improperly configured, allowing unintended evictions.
Lack of monitoring for Karpenter restarts delayed detection by several hours.

Remediation steps taken

Increase memory limits for Karpenter in configuration management.
Add protective annotations (karpenter.sh/do-not-disrupt: "true") to critical Redis pods.
Integrate Karpenter logs with centralized logging for improved visibility.
Implement monitoring to detect Karpenter pod restarts. If working on realtime distributed systems excites you, consider applying: https://jobs.ashbyhq.com/vapi/295f5269-1bb5-4740-81fa-9716adc32ad5

API and calls were momentarily disrupted