Previous incidents

June 2025
Jun 01, 2025
1 incident

API was down

Downtime

Resolved Jun 01 at 11:00pm PDT

API was down due to user error in routine maintenance. Service has since been restored

1 previous update

May 2025
May 26, 2025
1 incident

Users Unable to Sign In to Dashboard

Degraded

Resolved May 26 at 06:37pm PDT

Summary
Users experienced login issues with our dashboard due to an unintended deployment of a staging version to the production environment.

Timeline (in PST):

  • 3:17 PM: Internal engineers identified issues affecting developer workflows.
  • 4:19 PM: Breaking change is introduced and unintentionally deployed to production
  • 4:38 PM: First customer reports surfaced; engineering team immediately escalated internally.
  • 4:43 PM: Public status page updated to notify customers.
  • 4:54 PM: Correc...

2 previous updates

May 18, 2025
1 incident

Cartesia voices are degraded

Degraded

Resolved May 18 at 01:56pm PDT

Everything is functional. We're still working with Cartesia to get to bottom. We'll change back to degraded if the issue raises again during investigation.

2 previous updates

May 13, 2025
4 incidents

Vapifault Worker Timeouts

Resolved May 13 at 10:31am PDT

RCA: Vapifault Worker Timeouts

TL;DR

On May 12, approximately 335 concurrent calls were either web-based or exceeded 15 minutes in duration, surpassing the prescaled worker limit of 250 on the weekly environment. Due to infrastructure constraints, Lambda functions could not supplement the increased call load. Kubernetes call-worker pods could not scale quickly enough to meet demand, resulting in worker timeout issues. The following day, this issue reoccurred due to the prescaling limit...

providerfault-transport errors

Resolved May 13 at 10:29am PDT

RCA: Providerfault-transport-never-connected

Summary

During a surge in inbound call traffic, two distinct errors were observed: "vapifault-transport-worker-not-available" and "providerfault-transport-never-connected." This report focuses on the root cause analysis of the "providerfault-transport-never-connected" errors occurring during the increased call volume.

Timeline of Events (PT)

  • 10:26 AM: Significant spike in inbound call volume.
  • 10:26 – 10:40 AM: Intermittent H...

SIP calls abruptly closing after 30 seconds

Resolved May 13 at 10:27am PDT

RCA: SIP Calls Ending Abruptly

TL;DR

A SIP node was rotated, and the associated Elastic IP (EIP) was reassigned to the new node. However, the SIP service was not restarted afterward, causing the SIP service to use an incorrect (private) IP address when sending SIP requests. Consequently, users receiving these SIP requests attempted to respond to the wrong IP address, resulting in ACK timeouts.

Timeline (PT)

  • May 12, ~9:00 pm: SIP node rotated and Elastic IP reassigned, but SI...

Stale data for Weekly users

Resolved May 13 at 10:22am PDT

RCA: Phone Number Caching Error in Weekly Environment

TL;DR

Certain code paths allowed caching functions to execute without an associated organization ID, preventing correct lookup of the organization's channel. This unintentionally enabled caching for the weekly environment, specifically affecting inbound phone call paths. Users consequently received outdated server URLs after updating phone numbers.

Timeline (PT)

  • May 10, 1:26 am: Caching re-enabled for users in daily envir...
May 10, 2025
1 incident

Voice issues due to 11labs quota

Resolved May 10 at 10:34am PDT

RCA: 11Labs Voice Issue

TL;DR

Calls began failing due to exceeding the 11Labs voice service quota, resulting in errors (vapifault-eleven-labs-quota-exceeded).

Timeline of Events (PT)

  • 12:04 PM: Calls begin failing due to 11Labs quota being exceeded.
  • 12:16 PM: Customer reports the issue as a production outage.
  • 12:24 PM: Contacted 11Labs support regarding quota exhaustion.
  • 12:25 PM: 11Labs support recommends enabling usage-based billing.
  • 12:26 PM: Usag...
May 02, 2025
1 incident

API degradation

Degraded

Resolved May 02 at 06:43pm PDT

RCA for May 2nd User error in manual rollout

Root cause:

  • User error in kicking off a manual rollout, driven by unblocking a release
  • Due to this, load balancer was pointed at an invalid backend cluster

Timeline

  • 5:24pm PT: Engineer flagged blocked rollout, Infra engineer identified transient error that auto-blocked rollout
  • 5:31pm PT: Infra engineer triggered manual rollout on behalf of engineer, to unblock release
  • 5:43pm PT: On-call was paged with issue in rollout manager, e...

2 previous updates

May 01, 2025
2 incidents

WebCalls from Dashboard are unavailable

Degraded

Resolved May 01 at 04:14pm PDT

We've applied a fix to the dashboard web call feature. Root cause was misconfiguration on the authentication strategy used for this feature.

1 previous update

Issues with KYC verification

Degraded

Resolved May 01 at 04:10pm PDT

The team has applied a fix to frontend issue. Root cause was a change related to authentication handlers. Additionally, our KYC vendors are extending observability features in their SDK

3 previous updates

April 2025
Apr 29, 2025
1 incident

Call Recordings May Fail For Some Users

Degraded

Resolved Apr 29 at 11:59pm PDT

We have resolved the issue. Will upload RCA 04/30 noon PST.

TL;DR: Recordings weren't uploaded to object storage due to some invalid credentials. We generated and applied new keys.

1 previous update

Apr 22, 2025
1 incident

Increased 404 Errors Related to Phone Numbers Found

Degraded

Resolved Apr 22 at 04:39am PDT

We have determined the issue and resolved. We will update by noon PST with an RCA.

TL;DR: Adding a new CIDR range to our SIP cluster caused issues where the servers were unable to discover each other.

1 previous update

Apr 04, 2025
2 incidents

Degradation in phone calls stuck in queued state.

Degraded

Resolved Apr 04 at 12:00pm PDT

Resolved the issue, blocked offending user and reviewed rate limits

1 previous update

Degradation in API

Degraded

Resolved Apr 04 at 09:00am PDT

API rollback completed and errors subsided

1 previous update

Apr 02, 2025
1 incident

Intermittent 503s in api

Degraded

Resolved Apr 03 at 11:00am PDT

Improvements shipped reliably fixed the issue. Team has commenced medium-term, and is investigating long-term scalability improvements

2 previous updates

Apr 01, 2025
1 incident

Experiencing Anthropic rate limits on model calls

Degraded

Resolved Apr 01 at 08:04pm PDT

Anthropic rate limiting is resolved after raising quota

1 previous update