Aug 2024 to Oct 2024

October 2024

Oct 23, 2024

1 incident

Cartesia is down, please use another Voice Provider in the meanwhile

Downtime

Resolved Oct 23 at 11:08am PDT

Back to normal. You can follow the updates here: https://status.cartesia.ai.

1 previous update

Oct 22, 2024

1 incident

Web calls creation is degraded

Downtime

Resolved Oct 22 at 01:04pm PDT

We haven't seen an error in last 15 minutes, resolving for now. This will be updated if anything changes.

3 previous updates

Oct 18, 2024

1 incident

Deepgram is degraded, please switch to Gladia or Talkscriber

Downtime

Resolved Oct 18 at 08:32am PDT

Deepgram was fully restored at 8:32am, ending close to a 2h degradation.

Summary: Deepgram was degraded from ~6:12am PT to ~8:32am PT (status.deepgram.com). Their main datacenter fell over, they routed traffic to their AWS fallback, but the latencies on their streaming endpoint were still incredibly high (>10s).

Ideally, this degradation shouldn't have happened because it's our job to ensure we have fallbacks to mitigate 3rd party risks in real-time.

As an immediate action item, we...

3 previous updates

Oct 09, 2024

2 incidents

API is degraded

Degraded

Resolved Oct 09 at 09:24am PDT

We're back.

RCA:
* At 9:15am PT: We were alerted by a big spike in request aborted.
* By 9:20am: We identified the root cause was head of line blocking on the API pods (some requests were taking too long, blocking other requests)
* By 9:25am: We scaled and restarted the api pods. Everything reverted to normal.

Action Items:
* We'll be setting a hard query timeout and returning timeout on ones that exceed. Eg. GET /assistant?limit=1000. (statement_timeout)
* We'll be making API pods aware...

1 previous update

API is degraded

Downtime

Resolved Oct 09 at 02:27am PDT

Everything is back up for now.

Here's what happened:
* At 2:05am PT: We were alerted of the cannot execute UPDATE in a read-only transaction errors by Datadog.
* By 2:15am: We determined it was unhealthy pooler state and restarted the DB to force reset all the connections.
* By 2:25am: We are back up.

We have several hypothesis on how the pooler session state got mangled. We're tracking them down right now.

UPDATE: We spent several days going back and forth with Supabase on why our DB...

1 previous update

Oct 02, 2024

1 incident

API is degraded

Downtime

Resolved Oct 02 at 12:00pm PDT

Post-mortem

TL;DR

Human error on our end led us to being index-less on our biggest table calls, increasing DB CPU usage to 100%, and causing API request timeouts. This was a tactical mistake from us (the engineering team) in planning out the migration. We're sorry, we seek to do better than this. We've now engaged a Postgres scaling expert who's scaled multiple large-scale real-time systems before to ensure this never happens again.

Background Timeline

Our Postgres DB CPU usage...

5 previous updates

September 2024

Sep 24, 2024

1 incident

API is degraded

Degraded

Resolved Sep 24 at 01:48pm PDT

We have identified the root cause of the issue and deployed a fix. Everything is good now.

Here's what happened:
1. Most of our API pods' DB pooler's connections' came to be completely deadlocked.
2. This should have been caught by the Kubernetes health checks and/or our Uptime bot but was not (see below on remediation).
3. We immediately scaled up our backup cluster and moved the traffic over.
4. The system (api.vapi.ai) was back to full capacity in 13m.
5. With production in clear, we g...

1 previous update

August 2024

Aug 14, 2024

1 incident

Call transfers are degraded

Degraded

Resolved Aug 14 at 06:30am PDT

We have identified the root cause of the issue and a fix has been deployed. The cause of the issue was an edge case causing infinite loop on tool.messages.

We had a secondary issue that caused delay in resolution. Usually, we're able to move to our backup cluster with last known working state ASAP. But, we had unknowingly hit our AWS account limits so the backup cluster couldn't scale to handle full volume. It took some time to get hold of AWS and get more quota. We're auditing and setting u...

1 previous update

Previous incidents

Post-mortem

TL;DR

Background Timeline