SIP is degraded
Resolved
Jun 30 at 09:43pm PDT
TLDR: A temporary slowdown caused by saturation in our API gateway layer increased response times until they exceeded the edge-network timeout, causing a 524 HTTP response for some API requests.
Timeline in PST
01:00 AM First elevated 524 error responses detected
06:35 AM Rolled back recent backend release (no improvement)
07:19 AM Rolled back related network changes (no improvement).
08:22 AM Scaled up API gateway
09:36 AM Scaled up API gateway further
10:00 AM Reverted the previous night's SIP gateway update; error rate returned to normal.
Impact
- Based on our telemetry, a total of 58,769 requests were affected.
- Distribution, grouped by request path:
- /phone-number/status - 33,855
- /phone-number/hook - 8,277
- /phone-number/sip - 6,719
- /phone-number/inbound - 3,836
- 6082 across 25 other endpoints
What went poorly?
- Delayed root-cause isolation. Initial rollbacks focused on application and network layers, but the underlying issue originated elsewhere, leading to a longer mitigation window.
- Saturation metrics for the API gateway layer were not being tracked, which slowed down error diagnosis.
- Reverting changes to our SIP gateway is not a swift process, unlike rolling back our clusters.
- On call should have escalated issue quicker.
What went well?
- Only SIP calls saw degradation, other customer traffic remained largely unaffected.
Remediations
- [x] Increase observability in the API gateway, specifically metrics
- [x] Blue green deployments for our SIP gateway for quicker change reversion
- [ ] Collaborate with our SIP gateway provider to investigate potential issues on the SIP gateway end
If working on realtime distributed systems excites you, consider applying: https://jobs.ashbyhq.com/vapi/295f5269-1bb5-4740-81fa-9716adc32ad5
Affected services
Vapi SIP
Updated
Jun 25 at 10:28am PDT
We rolled back a version change to our SIP infrastructure earlier today around 10am PT and since then have seen stability.
We will update here with a more complete timeline and RCA tomorrow.
Affected services
Vapi SIP
Updated
Jun 25 at 10:17am PDT
The issue has come up again. We are working with our SIP infrastructure provider to resolve.
Affected services
Vapi SIP
Updated
Jun 25 at 07:35am PDT
We cutover to a previous deployment and are seeing improvement. We are continuing to monitor and will provide an RCA later today.
Affected services
Vapi SIP
Created
Jun 25 at 06:04am PDT
We are investigating an issue with our SIP gateway. We will update this thread with more information.
Affected services
Vapi SIP