Resolved
Oct 02 at 12:00pm PDT

Post-mortem

TL;DR

Human error on our end led us to being index-less on our biggest table calls, increasing DB CPU usage to 100%, and causing API request timeouts. This was a tactical mistake from us (the engineering team) in planning out the migration. We're sorry, we seek to do better than this. We've now engaged a Postgres scaling expert who's scaled multiple large-scale real-time systems before to ensure this never happens again.

Background Timeline

Our Postgres DB CPU usage has been steadily increasing due to scaling pressure. Until recently, it had worked to scale the PG resources and add simple indexes but that reached its limits causing the Sept 24th outage. To be specific, while scaling resources lets PG handle increased volume of requests, each request is still slow due to the nature of how fast a CPU can work to move data to RAM. This means each request holds the PG connection for a longer period increasing chances of connection starvation and lock contention.
We initiated a project to understand our query bottlenecks and find better patterns to scale from here on—sharding, partitioning, compound indexes and OLAP warehousing for analytics.
Through this project, we found that our biggest table is calls and as expected, list and aggregation queries on that were consuming majority of CPU time. We sought to add a compound index on org_id and created_at to speed them up since they followed the structure SELECT ... FROM call WHERE org_id=X ORDER BY created_at DESC.
We issued CREATE INDEX CONCURRENTLY IF NOT EXISTS call_org_id_created_at_idx ON call USING BTREE (org_id, created_at DESC) at Oct 1st 10pm PT through the Supabase SQL editor.
Noticing successful creation in the Supabase UI of the index, Oct 2nd morning at 9am, we sought to drop the simple index on (org_id) to nudge PG to use our compound index. (check remediations)
At 9am PT, our DB CPU usage spiked to 100% full throttle, causing API request timeouts and thundering herd as Kubernetes tried to restart unhealthy pods.

Incident Response

At 9:05am PT, we didn't understand that the above timeline had caused the degradation and proceeded to investigate after being paged of the degradation. (check remediation)
By 9:15am PT, per our incident response playbook, we were on our backup cluster but that didn't help and degradation was getting worse as the bottleneck of requests in the API pods deepened. We moved our investigation to the DB and noticed the spike in CPU usage.
By 9:30am, in attempt to reduce CPU usage, we released a change out to disable some of our aggregation queries that were causing most of the load. It became clear that didn't help.
By 9:45am, we discovered that in fact step #4 from the timeline actually had failed and the underlying index was INVALID. We were index-less on our biggest table calls.
By 10am, we had rebuilt the index and restored the system. As a precautionary measure, we're keeping analytics queries disabled until we sort our DB scaling fully.

Remediations and Reflections

As clear from timeline #5 and incident response #1, fundamentally, this degradation happened we didn't realize our migration could fail and did fail. This was as in our "unknown unknowns". The solution is to seek out a PG expert who's done these scaling migrations multiple times before and can help us bridge our unknown unknowns through their first-hand knowledge of different failure modes. We're on it and already have couple leads.
Secondly, it was a big tactical mistake on our part to run the migration at 9am PT, right before peak time. We felt increasing pressure on the DB that created urgency and clouded proper planning. We're sorry. We're implementing better procedures to analyze the potential impacts of a change and ease of rollback before pushing things out; the kind of type 1 and type 2 decision theory that's common in business strategy. This is being helped by finding experts in different aspects of scaling that we as the engineering org can tap into, similar to remediation #1.
Lastly, we take infrastructure reliability deathly seriously and are really sorry about this error on our part. If you or someone you know is obsessed with infrastructure reliability, we'd love to chat. You can find our JD here: https://www.ycombinator.com/companies/vapi/jobs/BnVHTaQ-founding-senior-engineer-infrastructure

Updated
Oct 02 at 10:00am PDT

The system is back up barring analytics. Post-mortem to follow soon.

Updated
Oct 02 at 09:59am PDT

We have identified the bottleneck. The system is recovering and we're continuing to monitor.

Updated
Oct 02 at 09:41am PDT

DB expanded but CPU is still maxed out, continuing to investigate.

Updated
Oct 02 at 09:38am PDT

We're expanding DB resources to resolve the CPU spike and bottleneck. Complete downtime for next 2 minutes. Port-mortem to follow soon

Created
Oct 02 at 09:15am PDT

API is experiencing degraded performance, including starting call timeouts

API is degraded

Post-mortem

TL;DR

Background Timeline

Incident Response

Remediations and Reflections