Previous incidents
ElevenLabs is degraded
Resolved Nov 14 at 01:08pm PST
Should be back to normal now as per 11labs.
https://status.elevenlabs.io/
1 previous update
API is degraded
Resolved Nov 12 at 02:15pm PST
TL;DR: API pods were choked. Our probes missed it.
Root Cause
Our API experienced DB contention. Recent monitoring system changes meant our probes didn't pick up this contention and restart the pods.
Timeline
- November 12th 2:00pm PT - Customer reports of API failures
- November 12th 2:05pm PT - Oncall team determined cause and scaled and restarted pods
- November 12th 2:10pm PT - Full functionality restored.
Changes we've implemented
- Restored higher sensitivity thresholds fo...
1 previous update
Phone calls are degraded
Resolved Nov 11 at 05:03pm PST
TL;DR: API gateway rejected Websocket requests
Summary
On November 11, 2024, from 4:22 PM to 5:05 PM PST, our WebSocket-based calls experienced disruption due to a configuration issue in our API gateway. This affected both inbound and outbound phone calls in one of our production clusters.
Impact
- Duration: 43 minutes
- Affected services: WebSocket-based phone calls
- System returned 404 errors for affected connections
- Service was fully restored by routing traffic to our backup clu...
1 previous update
API is down
Resolved Nov 07 at 06:11pm PST
Misconfiguration on networking cluster. Resolved now.
Here's what happened:
Summary
On November 7, 2024, from 5:59 PM to 6:10 PM PT, our API service experienced an outage due to an unintended configuration change. During this period, new API calls were unable to initiate, though existing connections remained largely unaffected.
Impact
- Duration: 11 minutes
- Service returned 521 errors for new inbound API calls
- Existing API calls remained stable
- Service was fully restored at 6...
1 previous update
Cartesia is down, please use another Voice Provider in the meanwhile
Resolved Oct 23 at 11:08am PDT
Back to normal. You can follow the updates here: https://status.cartesia.ai.
1 previous update
Web calls creation is degraded
Resolved Oct 22 at 01:04pm PDT
We haven't seen an error in last 15 minutes, resolving for now. This will be updated if anything changes.
3 previous updates
Deepgram is degraded, please switch to Gladia or Talkscriber
Resolved Oct 18 at 08:32am PDT
Deepgram was fully restored at 8:32am, ending close to a 2h degradation.
Summary: Deepgram was degraded from ~6:12am PT to ~8:32am PT (status.deepgram.com). Their main datacenter fell over, they routed traffic to their AWS fallback, but the latencies on their streaming endpoint were still incredibly high (>10s).
Ideally, this degradation shouldn't have happened because it's our job to ensure we have fallbacks to mitigate 3rd party risks in real-time.
As an immediate action item, we...
3 previous updates
API is degraded
Resolved Oct 09 at 09:24am PDT
We're back.
RCA:
* At 9:15am PT: We were alerted by a big spike in request aborted
.
* By 9:20am: We identified the root cause was head of line blocking on the API pods (some requests were taking too long, blocking other requests)
* By 9:25am: We scaled and restarted the api pods. Everything reverted to normal.
Action Items:
* We'll be setting a hard query timeout and returning timeout on ones that exceed. Eg. GET /assistant?limit=1000. (statement_timeout)
* We'll be making API pods aware...
1 previous update
API is degraded
Resolved Oct 09 at 02:27am PDT
Everything is back up for now.
Here's what happened:
* At 2:05am PT: We were alerted of the cannot execute UPDATE in a read-only transaction
errors by Datadog.
* By 2:15am: We determined it was unhealthy pooler state and restarted the DB to force reset all the connections.
* By 2:25am: We are back up.
We have several hypothesis on how the pooler session state got mangled. We're tracking them down right now.
UPDATE: We spent several days going back and forth with Supabase on why our DB...
1 previous update
API is degraded
Resolved Oct 02 at 12:00pm PDT
Post-mortem
TL;DR
Human error on our end led us to being index-less on our biggest table call
s, increasing DB CPU usage to 100%, and causing API request timeouts. This was a tactical mistake from us (the engineering team) in planning out the migration. We're sorry, we seek to do better than this. We've now engaged a Postgres scaling expert who's scaled multiple large-scale real-time systems before to ensure this never happens again.
Background Timeline
- Our Postgres DB CPU usage...
5 previous updates