Back to overview
Degraded

sip.vapi.ai degradation

Mar 13 at 04:18pm PDT
Affected services
Vapi SIP

Resolved
Mar 17 at 09:00pm PDT

RCA: SIP 480 Failures (March 13-14)

Summary
Between March 13-14, SIP calls intermittently failed due to recurring 480 errors. This issue was traced to our SIP SBC service failing to communicate with the SIP inbound service. As a temporary mitigation, restarting the SBC service resolved the issue. However, a long-term fix is planned, involving a transition to a more stable Auto Scaling Group (ASG) deployment.

Incident Timeline
(All times in PT)

March 13, 2025
07:00 AM – SIP SBC pod starts showing symptoms of failure to connect to the SIP inbound pod, resulting in intermittent 480 errors.
01:19 PM – A customer reported an increase in 480 SIP errors, prompting escalation to the infrastructure team.
01:30 PM – The infrastructure team took corrective action, and service was restored.

March 14, 2025
07:30 AM – Similar issue recurred, triggering monitoring alerts.
08:30 AM – The infrastructure team was engaged for remediation as failures persisted.
08:43 AM – The affected SIP SBC pod was deleted, restoring service.
09:43 AM – The issue reappeared, requiring repeated manual intervention.
Additional occurrences throughout the day:
11:10 AM – 11:17 AM
12:03 PM – 12:09 PM
01:04 PM – 01:22 PM
02:08 PM – 02:37 PM

Challenges Identified
The failures appear due to broken connection between services, there were no health checks to keep the connections intact.
Increased frequency – The number of occurrences was higher than usual, impacting a lot customers.
Delayed response on Day 1 – The application remained in a somewhat degraded state for six hours before customer escalation prompted action.

Positive Takeaways
Effective monitoring – Alerts triggered as expected, enabling swift identification of the issue.
Improved response time on Day 2 – The team responded more promptly to subsequent incidents.

Remediation Actions Taken
Enhance alerting mechanisms – Modified alerts to periodically refire when in an alarm state, ensuring timely on-call responses.
Transition to ASG-based deployment – Move SIP workloads from Kubernetes to an ASG-based infrastructure for improved stability.
Health check - Add health check between the 2 services so that the system is able to auto heal incase issue reoccurs.

Updated
Mar 13 at 04:29pm PDT

Incident was resolved at 1:30pm PT

One of the 2 ips behind sip.vapi.ai was failing to connect to an internal service resulting in 480 error.

Created
Mar 13 at 04:18pm PDT

Intermittent "480 temporarily unavailable" errors while connecting calls to sip.vapi.ai.
Started happening at 7am PT.