Back to overview
API returning 413 (payload too large) due to networking misconfiguration
Feb 21 at 11:24am PST
Affected services
Vapi API
Resolved
Feb 21 at 11:24am PST
TL;DR
A change in the cluster-router networking filter caused an increase in 413 (request entity too large) errors. API requests to POST /call, /assistant, and /file were impacted.
Timeline
- February 20th 9:54pm PST: A change to the cluster-router is released and traffic is cut over to prod1.
- 10:19pm PST: 413 responses from Cloudflare begin appearing in increased Datadog logs.
- February 21st ~8:50am: Users in Discord flag requests failing with 413 errors.
- 9:58am PST: The IR team rolls back the networking cluster to the previous deployment without the filter change; service is restored and the 413 errors subside.
Impact
- During the time of impact, POST requests to /call, /assistant, and /file failed with a 413 error code.
Root Cause
- A change in the cluster-router filter added buffering of POST requests for all endpoints (previously only applied to /status, /inbound, and /inbound_call).
- The envoy filter was configured with a stream window size of approximately 65Kb, so request bodies larger than that received a 413 response.
Changes we've made
- Monitor to catch 4xx and 5xx errors from Cloudflare.
Changes we will make
- Improve change testing for the networking cluster.
- Implement a percentage-based cutover of traffic for networking rollouts instead of a 100% switch.
What went well
- The cause was identified quickly by investigating changes in Cloudflare responses.
What went poorly
- There was a 12-hour delay between identifying the cause and remediation due to the lack of alerts for this error.
- The issue was initially flagged by the Discord community rather than through internal monitoring.
If working on realtime distributed systems excites you, consider applying: https://jobs.ashbyhq.com/vapi/295f5269-1bb5-4740-81fa-9716adc32ad5
Affected services
Vapi API