All services are online
Previous incidents
Vapifault Worker Timeouts
Resolved May 13 at 10:31am PDT
RCA: Vapifault Worker Timeouts
TL;DR
On May 12, approximately 335 concurrent calls were either web-based or exceeded 15 minutes in duration, surpassing the prescaled worker limit of 250 on the weekly environment. Due to infrastructure constraints, Lambda functions could not supplement the increased call load. Kubernetes call-worker pods could not scale quickly enough to meet demand, resulting in worker timeout issues. The following day, this issue reoccurred due to the prescaling limit...
providerfault-transport errors
Resolved May 13 at 10:29am PDT
RCA: Providerfault-transport-never-connected
Summary
During a surge in inbound call traffic, two distinct errors were observed: "vapifault-transport-worker-not-available" and "providerfault-transport-never-connected." This report focuses on the root cause analysis of the "providerfault-transport-never-connected" errors occurring during the increased call volume.
Timeline of Events (PT)
- 10:26 AM: Significant spike in inbound call volume.
- 10:26 – 10:40 AM: Intermittent H...
SIP calls abruptly closing after 30 seconds
Resolved May 13 at 10:27am PDT
RCA: SIP Calls Ending Abruptly
TL;DR
A SIP node was rotated, and the associated Elastic IP (EIP) was reassigned to the new node. However, the SIP service was not restarted afterward, causing the SIP service to use an incorrect (private) IP address when sending SIP requests. Consequently, users receiving these SIP requests attempted to respond to the wrong IP address, resulting in ACK timeouts.
Timeline (PT)
- May 12, ~9:00 pm: SIP node rotated and Elastic IP reassigned, but SI...
Stale data for Weekly users
Resolved May 13 at 10:22am PDT
RCA: Phone Number Caching Error in Weekly Environment
TL;DR
Certain code paths allowed caching functions to execute without an associated organization ID, preventing correct lookup of the organization's channel. This unintentionally enabled caching for the weekly environment, specifically affecting inbound phone call paths. Users consequently received outdated server URLs after updating phone numbers.
Timeline (PT)
- May 10, 1:26 am: Caching re-enabled for users in daily envir...