Previous incidents
Cartesia is down, please use another Voice Provider in the meanwhile
Resolved Oct 23 at 11:08am PDT
Back to normal. You can follow the updates here: https://status.cartesia.ai.
1 previous update
Web calls creation is degraded
Resolved Oct 22 at 01:04pm PDT
We haven't seen an error in last 15 minutes, resolving for now. This will be updated if anything changes.
3 previous updates
Deepgram is degraded, please switch to Gladia or Talkscriber
Resolved Oct 18 at 08:32am PDT
Deepgram was fully restored at 8:32am, ending close to a 2h degradation.
Summary: Deepgram was degraded from ~6:12am PT to ~8:32am PT (status.deepgram.com). Their main datacenter fell over, they routed traffic to their AWS fallback, but the latencies on their streaming endpoint were still incredibly high (>10s).
Ideally, this degradation shouldn't have happened because it's our job to ensure we have fallbacks to mitigate 3rd party risks in real-time.
As an immediate action item, we...
3 previous updates
API is degraded
Resolved Oct 09 at 09:24am PDT
We're back.
RCA:
* At 9:15am PT: We were alerted by a big spike in request aborted
.
* By 9:20am: We identified the root cause was head of line blocking on the API pods (some requests were taking too long, blocking other requests)
* By 9:25am: We scaled and restarted the api pods. Everything reverted to normal.
Action Items:
* We'll be setting a hard query timeout and returning timeout on ones that exceed. Eg. GET /assistant?limit=1000. (statement_timeout)
* We'll be making API pods aware...
1 previous update
API is degraded
Resolved Oct 09 at 02:27am PDT
Everything is back up for now.
Here's what happened:
* At 2:05am PT: We were alerted of the cannot execute UPDATE in a read-only transaction
errors by Datadog.
* By 2:15am: We determined it was unhealthy pooler state and restarted the DB to force reset all the connections.
* By 2:25am: We are back up.
We have several hypothesis on how the pooler session state got mangled. We're tracking them down right now.
UPDATE: We spent several days going back and forth with Supabase on why our DB...
1 previous update
API is degraded
Resolved Oct 02 at 12:00pm PDT
Post-mortem
TL;DR
Human error on our end led us to being index-less on our biggest table call
s, increasing DB CPU usage to 100%, and causing API request timeouts. This was a tactical mistake from us (the engineering team) in planning out the migration. We're sorry, we seek to do better than this. We've now engaged a Postgres scaling expert who's scaled multiple large-scale real-time systems before to ensure this never happens again.
Background Timeline
- Our Postgres DB CPU usage...
5 previous updates