Sep 2024 to Nov 2024

November 2024

Nov 14, 2024

1 incident

ElevenLabs is degraded

Degraded

Resolved Nov 14 at 01:08pm PST

Should be back to normal now as per 11labs.
https://status.elevenlabs.io/

1 previous update

Nov 12, 2024

1 incident

API is degraded

Degraded

Resolved Nov 12 at 02:15pm PST

TL;DR: API pods were choked. Our probes missed it.

Root Cause

Our API experienced DB contention. Recent monitoring system changes meant our probes didn't pick up this contention and restart the pods.

Timeline

November 12th 2:00pm PT - Customer reports of API failures
November 12th 2:05pm PT - Oncall team determined cause and scaled and restarted pods
November 12th 2:10pm PT - Full functionality restored.

Changes we've implemented

Restored higher sensitivity thresholds fo...

1 previous update

Nov 11, 2024

1 incident

Phone calls are degraded

Degraded

Resolved Nov 11 at 05:03pm PST

TL;DR: API gateway rejected Websocket requests

Summary

On November 11, 2024, from 4:22 PM to 5:05 PM PST, our WebSocket-based calls experienced disruption due to a configuration issue in our API gateway. This affected both inbound and outbound phone calls in one of our production clusters.

Impact

Duration: 43 minutes
Affected services: WebSocket-based phone calls
System returned 404 errors for affected connections
Service was fully restored by routing traffic to our backup clu...

1 previous update

Nov 07, 2024

1 incident

API is down

Downtime

Resolved Nov 07 at 06:11pm PST

Misconfiguration on networking cluster. Resolved now.

Here's what happened:

Summary

On November 7, 2024, from 5:59 PM to 6:10 PM PT, our API service experienced an outage due to an unintended configuration change. During this period, new API calls were unable to initiate, though existing connections remained largely unaffected.

Impact

Duration: 11 minutes
Service returned 521 errors for new inbound API calls
Existing API calls remained stable
Service was fully restored at 6...

1 previous update

October 2024

Oct 23, 2024

1 incident

Cartesia is down, please use another Voice Provider in the meanwhile

Downtime

Resolved Oct 23 at 11:08am PDT

Back to normal. You can follow the updates here: https://status.cartesia.ai.

1 previous update

Oct 22, 2024

1 incident

Web calls creation is degraded

Downtime

Resolved Oct 22 at 01:04pm PDT

We haven't seen an error in last 15 minutes, resolving for now. This will be updated if anything changes.

3 previous updates

Oct 18, 2024

1 incident

Deepgram is degraded, please switch to Gladia or Talkscriber

Downtime

Resolved Oct 18 at 08:32am PDT

Deepgram was fully restored at 8:32am, ending close to a 2h degradation.

Summary: Deepgram was degraded from ~6:12am PT to ~8:32am PT (status.deepgram.com). Their main datacenter fell over, they routed traffic to their AWS fallback, but the latencies on their streaming endpoint were still incredibly high (>10s).

Ideally, this degradation shouldn't have happened because it's our job to ensure we have fallbacks to mitigate 3rd party risks in real-time.

As an immediate action item, we...

3 previous updates

Oct 09, 2024

2 incidents

API is degraded

Degraded

Resolved Oct 09 at 09:24am PDT

We're back.

RCA:
* At 9:15am PT: We were alerted by a big spike in request aborted.
* By 9:20am: We identified the root cause was head of line blocking on the API pods (some requests were taking too long, blocking other requests)
* By 9:25am: We scaled and restarted the api pods. Everything reverted to normal.

Action Items:
* We'll be setting a hard query timeout and returning timeout on ones that exceed. Eg. GET /assistant?limit=1000. (statement_timeout)
* We'll be making API pods aware...

1 previous update

API is degraded

Downtime

Resolved Oct 09 at 02:27am PDT

Everything is back up for now.

Here's what happened:
* At 2:05am PT: We were alerted of the cannot execute UPDATE in a read-only transaction errors by Datadog.
* By 2:15am: We determined it was unhealthy pooler state and restarted the DB to force reset all the connections.
* By 2:25am: We are back up.

We have several hypothesis on how the pooler session state got mangled. We're tracking them down right now.

UPDATE: We spent several days going back and forth with Supabase on why our DB...

1 previous update

Oct 02, 2024

1 incident

API is degraded

Downtime

Resolved Oct 02 at 12:00pm PDT

Post-mortem

TL;DR

Human error on our end led us to being index-less on our biggest table calls, increasing DB CPU usage to 100%, and causing API request timeouts. This was a tactical mistake from us (the engineering team) in planning out the migration. We're sorry, we seek to do better than this. We've now engaged a Postgres scaling expert who's scaled multiple large-scale real-time systems before to ensure this never happens again.

Background Timeline

Our Postgres DB CPU usage...

5 previous updates

September 2024

Sep 24, 2024

1 incident

API is degraded

Degraded

Resolved Sep 24 at 01:48pm PDT

We have identified the root cause of the issue and deployed a fix. Everything is good now.

Here's what happened:
1. Most of our API pods' DB pooler's connections' came to be completely deadlocked.
2. This should have been caught by the Kubernetes health checks and/or our Uptime bot but was not (see below on remediation).
3. We immediately scaled up our backup cluster and moved the traffic over.
4. The system (api.vapi.ai) was back to full capacity in 13m.
5. With production in clear, we g...

1 previous update

Previous incidents

Root Cause

Timeline

Changes we've implemented

Summary

Impact

Summary

Impact

Post-mortem

TL;DR

Background Timeline