SIP calls are degraded

Nov 7th 2025 SIP service degradation

Summary

On Friday, November 7th, 2025, one of our SIP gateway experienced a failure, causing inbound and outbound Vapi SIP calls to be disrupted between 10:30 AM and 12:15 PM PST

Context

All Vapi SIP calls go through our SIP infrastructure which handles SIP trunking, authentication, and registration. When an inbound SIP call arrives, the SIP SBC authenticates and validates it, making a webhook call to our API server for call registration. Once calls are registered, SBC establishes a bidirectional websocket connection (via websocket proxy) to call workers for real-time call processing and audio streaming.

Root Cause

Our SIP gateway runs on dedicated infrastructure which runs stateful workloads. This part of our infrastructure was missing log archival configuration. Over time, application logs accumulated and filled the available disk space, causing the server to crash and become unresponsive.This issue was compounded by the absence of disk space monitoring and alerting, which delayed our detection and response.

Resolution

Once the issue was identified, our engineering team took the following actions:
Cleared accumulated logs to restore available disk space

Restarted SIP gateway services and validated recovery
Implemented immediate log rotation on the affected host
Verified all SIP services were operational before resuming normal operations

What We’re Doing to Prevent This

Immediate Actions (Completed)
- Deployed disk space monitoring with alerts at 75% utilization
- Fixed SIP gateway metrics-based alerts to detect node failures and missing metrics
- Added volume-based alerts for all stateful SIP instances

Expected results: Early detection of issues affecting SIP gateway instances including high disk usage, node failures, or no metrics, so that any disruption to call processing can be identified and resolved before impacting customers.

Short-Term Actions (In Progress – 30 Days)

Implement comprehensive per-node health monitoring with automated alerting
Enhance our synthetic phone health checks to test individual SIP nodes for stateful service health
Deploy hot standby SIP instances for immediate failover capability

Expected results: Capture all functional issues at the individual SIP instance level, and ensure that in the event of a failure, we can immediately failover manually to a standby SIP gateway instance to remediate quickly.

Long-Term Improvements (Next 60 Days)

High Availability:

Implement automated SIP failover based on instance health checks
Perform quarterly automated failover tests to verify reliability

Expected results: Failed SIP instances are automatically removed and replaced with healthy nodes, ensuring minimal or no manual intervention and uninterrupted service continuity.