API is degraded
Resolved
Sep 24 at 01:48pm PDT
We have identified the root cause of the issue and deployed a fix. Everything is good now.
Here's what happened:
1. Most of our API pods' DB pooler's connections' came to be completely deadlocked.
2. This should have been caught by the Kubernetes health checks and/or our Uptime bot but was not (see below on remediation).
3. We immediately scaled up our backup cluster and moved the traffic over.
4. The system (api.vapi.ai
) was back to full capacity in 13m.
5. With production in clear, we got to the root cause analysis on the abandoned cluster.
6. It's unclear what triggered the deadlock simultaneously on multiple pods but our best guess is something on our DB provider side (Supabase).
7. It's also possible that one of them deadlocked and caused additional load on others which triggered the same deadlock mechanism on others.
8. Our last hypothesis was some client-side library bug (Postgres.js) but unclear why simultaneously would trigger.
9. Either way, we had enough data to build up remediations and prevent another incident of this kind.
Remediations:
1. Within our Kubernetes health checks for the API pods, we are adding a dummy query SELECT now()
to actually check the viability of the connection.
2. This does add risks to API pods becoming completely unresponsive in case of a DB outage but that's okay since DB being down would be clear RCA in that case.
3. With this check in place, Kubernetes will take the bad pods have a non-viable connection out of rotation and restart them preventing that a partial or full outage.
Affected services
Vapi API
Created
Sep 24 at 01:23pm PDT
Requests to the API are experiencing higher latency including timeouts for 30-40% of the requests resulting in a partial downtime. This includes requests to start calls. We're investigating ASAP.
Affected services
Vapi API