API is down

Misconfiguration on networking cluster. Resolved now.

Here's what happened:

Summary

On November 7, 2024, from 5:59 PM to 6:10 PM PT, our API service experienced an outage due to an unintended configuration change. During this period, new API calls were unable to initiate, though existing connections remained largely unaffected.

Impact

Duration: 11 minutes
Service returned 521 errors for new inbound API calls
Existing API calls remained stable
Service was fully restored at 6:10 PM PT

Root Cause

The incident occurred when a configuration intended for our staging environment was accidentally applied to production during a routine debugging session. This resulted in the deletion of a critical API gateway configuration.

Timeline

5:59 PM PT - Accidental deletion of production configuration during staging environment debugging
6:00 PM PT - Monitoring systems detected service degradation
6:08 PM PT - Engineering team identified root cause
6:09 PM PT - Fix deployed (configuration restored)
6:10 PM PT - Full service recovery confirmed

Changes we've implemented

Changing namespace to include cluster name. networking > networking-staging and networking-production. This forces you to specify the environment while running kubectl commands.
Preventing deletion of resources that would never be expected to be deleted using Kubernetes deletion webhook.

If working on realtime distributed systems excites you, consider applying: https://jobs.ashbyhq.com/vapi/295f5269-1bb5-4740-81fa-9716adc32ad5