TL;DR

After deploying recent infrastructure changes to backend-production1, Redis Sentinel pods began restarting due to failing liveness checks (/health/ping_sentinel.sh). These infra changes included adding a new IP range, causing all cluster nodes to cycle. When Redis pods restarted, they continually failed health checks, resulting in repeated restarts. A rollback restored API functionality. The entire cluster is being re-created to address DNS resolution failures before rolling forward.

Timeline

March 30th: New IP range and subnets added.
March 24th, 3:55 PM: Deployment to backend-production1 initiated.
March 24th, 4:14 PM: Deployment completed.
- Immediate increase in Redis errors observed in API pods.
- API pods scaled dramatically and restarted frequently.
- API service degraded with significant timeouts.
March 24th, 4:19 PM: Rollback initiated.
March 24th, 4:27 PM: Rollback completed; API service fully restored.

Resolution

A rollback to the previous stable configuration resolved the immediate API timeout issues. The complete cluster re-creation is underway to permanently resolve underlying DNS resolution failures related to the new IP range before future deployments.

Impact

Approximately 2.67k API requests failed (5xx responses) or timed out.
Impacted areas included logs and database write operations.
Errors included Redis AudioCache failures, API database connection issues, and aborted API requests due to timeouts.

Root Cause

The rollout caused a rotation of all cluster nodes due to subnet changes tied to the new IP range. DNS resolution failures associated with this new IP range caused Redis I/O operations to block on TCP connections, resulting in prolonged hanging TCP connections. These hanging connections intermittently caused Redis pods to fail liveness checks, resulting in continuous restarts.

API pods, maintaining open connections to Redis, experienced similar blockages, leading to extensive API request timeouts and service degradation.

The permanent resolution involves recreating the cluster entirely to address these DNS resolution issues comprehensively.

If working on realtime distributed systems excites you, consider applying: https://jobs.ashbyhq.com/vapi/295f5269-1bb5-4740-81fa-9716adc32ad5

API degradation

TL;DR

Timeline

Resolution

Impact

Root Cause