SIP call failures to connect
Resolved
Apr 07 at 09:56pm PDT
RCA for SIP Degradation for sip.vapi.ai
TLDR;
Vapi sip service (sip.vapi.ai) was intermittently throwing errors and not able to connect to calls. We had some major flaws in our SIP infrastructure which was resolved by rearchitecting the whole thing from scratch.
Impact
- Call to Vapi SIP uri or Vapi phone numbers were failing to connect with 480/487/503 errors
- Inbound calls to Vapi getting connected but audio not coming out, eventually causing silence timeouts or customer-did-not-answer
- Outbound calls from Vapi numbers or custom SIP trunks were mostly unimpacted due to whole migration but we did add some rate limiting recently which could have caused 429's failing Vapi call creation.
- Around 1% calls were failing intermittently with failure rate going up to 10% at times briefly.
Root Cause
- In order to scale out our SIP infrastructure, Vapi moved to a Kubernetes based SIP deployment back in mid January.
- SIP networking in kubernetes was complex to get right and we released multiple fixes throughout February and mid March and operated the service on a satisfactory level but with intermittent failures.
- Periods of degraded experience during this time were specifically due to networking errors between different components of our SIP infrastructure.
Most of the time we were able to resolve issues as they occur by restarting services, releasing patches, blocking malicious traffic, scaling out more, etc.
- By mid march we realised that the kubernetes deployment is not going to be stable and started devising a new infrastructure for SIP.
We started migration for SIP to a more stable autoscaling group based deployment on 31st March, and continued doing so over the next day or two.
- The team monitored the new deployment very closely, and kept releasing patches for every small failure that we saw.
- The new deployment has been looking great so far
What went poorly?
- We took a lot of time in deciding to pull the plug on our kubernetes deployment.
- Users were impacted intermittently and the SIP reliability was not what we aspire for
Remediations
- SIP infrastructure was revamped to an autoscaling group based deployment which is more stable.
- Audit of each error case and apply immediate fixes where needed
- Add better monitoring and telemetry across the SIP infrastructure to make sure we catch issues and act on them preemptively.
Affected services
Vapi SIP
Updated
Apr 07 at 03:48pm PDT
SIP infrastructure has been upgraded on our side. So far seeing good performance for it.
Affected services
Vapi SIP
Updated
Mar 17 at 02:30pm PDT
Degarading sip.vapi.ai instead of api.vapi.ai as only sip part is currently impacted.
Affected services
Vapi SIP
Updated
Mar 14 at 06:23pm PDT
SIP service has faced partial degradation multiple times in the last day. Things are looking stable now, but we are keeping the incident open until we rollout a major infra level change which is going to solve it for good.
We apologise for this inconvenience and are working with urgency to solve the issue permanently.
Here's the timeline of the issue for today (in Pacific Time):
7:30am SBC pod not able to connect to sbc inbound pod resulting in 480. Our monitoring picks it up.
8:30am Infra team is pulled in for remediation as the failures dont stop for a while.
8:43am The faulty SIP sbc pod was deleted and the service was restored.
9:43am The same issue pops up again and a manual action is taken to restore the service everytime.
More instances for the same issue pop up multiple time throughout the day.
11:10 - 11:17am
12:03pm - 12:09pm
1:04pm - 1:22pm
2:08pm - 2:37pm
Affected services
Vapi SIP
Updated
Mar 14 at 04:36pm PDT
We have released a temporary fix to the problem and the issue hasn't been reported again in the last 2 hours.
We are still working on a more permanent fix for it.
Affected services
Vapi SIP
Created
Mar 14 at 02:30pm PDT
sip.vapi.ai is not responding intermittently. We are investigating the failures and will be coming up with a fix soon.
Affected services
Vapi SIP