TL;DR

A change in the cluster-router networking filter caused an increase in 413 (request entity too large) errors. API requests to POST /call, /assistant, and /file were impacted.

Timeline

February 20th 9:54pm PST: A change to the cluster-router is released and traffic is cut over to prod1.
10:19pm PST: 413 responses from Cloudflare begin appearing in increased Datadog logs.
February 21st ~8:50am: Users in Discord flag requests failing with 413 errors.
9:58am PST: The IR team rolls back the networking cluster to the previous deployment without the filter change; service is restored and the 413 errors subside.

Impact

During the time of impact, POST requests to /call, /assistant, and /file failed with a 413 error code.

Root Cause

A change in the cluster-router filter added buffering of POST requests for all endpoints (previously only applied to /status, /inbound, and /inbound_call).
The envoy filter was configured with a stream window size of approximately 65Kb, so request bodies larger than that received a 413 response.

Changes we've made

Monitor to catch 4xx and 5xx errors from Cloudflare.

Changes we will make

Improve change testing for the networking cluster.
Implement a percentage-based cutover of traffic for networking rollouts instead of a 100% switch.

What went well

The cause was identified quickly by investigating changes in Cloudflare responses.

What went poorly

There was a 12-hour delay between identifying the cause and remediation due to the lack of alerts for this error.
The issue was initially flagged by the Discord community rather than through internal monitoring.

If working on realtime distributed systems excites you, consider applying: https://jobs.ashbyhq.com/vapi/295f5269-1bb5-4740-81fa-9716adc32ad5

API returning 413 (payload too large) due to networking misconfiguration