Back to overview

API returning 413 (payload too large) due to networking misconfiguration

Feb 21 at 11:24am PST
Affected services
Vapi API

Resolved
Feb 21 at 11:24am PST

TL;DR

A change in the cluster-router networking filter caused an increase in 413 (request entity too large) errors. API requests to POST /call, /assistant, and /file were impacted.

Timeline

  1. February 20th 9:54pm PST: A change to the cluster-router is released and traffic is cut over to prod1.
  2. 10:19pm PST: 413 responses from Cloudflare begin appearing in increased Datadog logs.
  3. February 21st ~8:50am: Users in Discord flag requests failing with 413 errors.
  4. 9:58am PST: The IR team rolls back the networking cluster to the previous deployment without the filter change; service is restored and the 413 errors subside.

Impact

  • During the time of impact, POST requests to /call, /assistant, and /file failed with a 413 error code.

Root Cause

  • A change in the cluster-router filter added buffering of POST requests for all endpoints (previously only applied to /status, /inbound, and /inbound_call).
  • The envoy filter was configured with a stream window size of approximately 65Kb, so request bodies larger than that received a 413 response.

Changes we've made

  • Monitor to catch 4xx and 5xx errors from Cloudflare.

Changes we will make

  • Improve change testing for the networking cluster.
  • Implement a percentage-based cutover of traffic for networking rollouts instead of a 100% switch.

What went well

  • The cause was identified quickly by investigating changes in Cloudflare responses.

What went poorly

  • There was a 12-hour delay between identifying the cause and remediation due to the lack of alerts for this error.
  • The issue was initially flagged by the Discord community rather than through internal monitoring.

If working on realtime distributed systems excites you, consider applying: https://jobs.ashbyhq.com/vapi/295f5269-1bb5-4740-81fa-9716adc32ad5