Previous incidents

October 2024
Oct 23, 2024
1 incident

Cartesia is down, please use another Voice Provider in the meanwhile

Downtime

Resolved Oct 23 at 11:08am PDT

Back to normal. You can follow the updates here: https://status.cartesia.ai.

1 previous update

Oct 22, 2024
1 incident

Web calls creation is degraded

Downtime

Resolved Oct 22 at 01:04pm PDT

We haven't seen an error in last 15 minutes, resolving for now. This will be updated if anything changes.

3 previous updates

Oct 18, 2024
1 incident

Deepgram is degraded, please switch to Gladia or Talkscriber

Downtime

Resolved Oct 18 at 08:32am PDT

Deepgram was fully restored at 8:32am, ending close to a 2h degradation.

Summary: Deepgram was degraded from ~6:12am PT to ~8:32am PT (status.deepgram.com). Their main datacenter fell over, they routed traffic to their AWS fallback, but the latencies on their streaming endpoint were still incredibly high (>10s).

Ideally, this degradation shouldn't have happened because it's our job to ensure we have fallbacks to mitigate 3rd party risks in real-time.

As an immediate action item, we...

3 previous updates

Oct 09, 2024
2 incidents

API is degraded

Degraded

Resolved Oct 09 at 09:24am PDT

We're back.

RCA:
* At 9:15am PT: We were alerted by a big spike in request aborted.
* By 9:20am: We identified the root cause was head of line blocking on the API pods (some requests were taking too long, blocking other requests)
* By 9:25am: We scaled and restarted the api pods. Everything reverted to normal.

Action Items:
* We'll be setting a hard query timeout and returning timeout on ones that exceed. Eg. GET /assistant?limit=1000. (statement_timeout)
* We'll be making API pods aware...

1 previous update

API is degraded

Downtime

Resolved Oct 09 at 02:27am PDT

Everything is back up for now.

Here's what happened:
* At 2:05am PT: We were alerted of the cannot execute UPDATE in a read-only transaction errors by Datadog.
* By 2:15am: We determined it was unhealthy pooler state and restarted the DB to force reset all the connections.
* By 2:25am: We are back up.

We have several hypothesis on how the pooler session state got mangled. We're tracking them down right now.

UPDATE: We spent several days going back and forth with Supabase on why our DB...

1 previous update

Oct 02, 2024
1 incident

API is degraded

Downtime

Resolved Oct 02 at 12:00pm PDT

Post-mortem

TL;DR

Human error on our end led us to being index-less on our biggest table calls, increasing DB CPU usage to 100%, and causing API request timeouts. This was a tactical mistake from us (the engineering team) in planning out the migration. We're sorry, we seek to do better than this. We've now engaged a Postgres scaling expert who's scaled multiple large-scale real-time systems before to ensure this never happens again.

Background Timeline

  1. Our Postgres DB CPU usage...

5 previous updates