Deepgram is degraded, pleas...

Resolved
Oct 18, 2024 at 3:32pm UTC

Deepgram was fully restored at 8:32am, ending close to a 2h degradation.

Summary: Deepgram was degraded from ~6:12am PT to ~8:32am PT (status.deepgram.com). Their main datacenter fell over, they routed traffic to their AWS fallback, but the latencies on their streaming endpoint were still incredibly high (>10s).

Ideally, this degradation shouldn't have happened because it's our job to ensure we have fallbacks to mitigate 3rd party risks in real-time.

As an immediate action item, we're bringing back standby onprem deepgram into our clusters which would have let us cut this degradation to a couple minutes.

To give more detail:
We could have run Deepgram on-prem before, giving us control over any changes to the transcription model. Unfortunately, we had phased that out couple months ago because we saw better performance from their SaaS service:
1. They run on better GPUs including H100s (and soon H200s). AWS limits the GPUs we can get and scaling is unpredictable.
2. They are continually upgrading their Nvidia inference stack, including proprietary optimizations.
3. They ship continual updates and bug fixes to their SaaS offering compared to monthly updates to onprem.

This degradation alongside another from ElevenLabs earlier in the week (status.elevenlabs.io) has made it clear we need to prioritize redundancy further.
1. We need to have a tiered approach to falling back every piece of the stack.
2. We do this well with the assistant.model but assistant.voice and assistant.transcriber need it too.
3. This need will only get more acute with speech to speech models being the single point of failure.
4. We've been cautious with automated fallbacks because of how complex it is to get right (picking up exactly where the failure happened, etc.). But, it's now clear given our positioning as an orchestrator and critical infrastructure, we bear final accountability.

Reliability is our #1 priority, and this incident only makes us more committed to prioritizing it above all else.

Updated
Oct 18, 2024 at 3:20pm UTC

We have gotten an update from Deepgram that their main datacenter (S31) is back up. They expect ~20 more minutes of backlog batch work to transcribe and then things should be back to completely normal.

Updated
Oct 18, 2024 at 3:03pm UTC

Deepgram is still degraded. We're still waiting on Deepgram for more accurate estimates and information.

Meanwhile, we're spinning up a new cluster with onprem Deepgram but it will take ~30m to come up.

Created
Oct 18, 2024 at 1:31pm UTC

Deepgram is extremely degraded, https://status.deepgram.com

Please switch to Gladia or Talkscriber in the meanwhile.

We're spinning up remediations on our side, too.

Deepgram is degraded, please switch to Gladia or Talkscriber