WebRTC routing is becoming critical for voice AI, where audio stream continuity and minimal latency are essential. We analyze how the reworking of routing changes system behavior under load.

The problem does not manifest immediately — until the moment the system scales to global real-time traffic. In the classic WebRTC model of “one port per session,” pressure builds on the infrastructure: a wide range of UDP ports is required, which is difficult to expose and secure in Kubernetes. At the same time, ICE and DTLS remain stateful, and each session requires a stable owner. At this stage, a third factor is introduced — global routing. The first hop begins to define the user experience: extra milliseconds turn into pauses, interruptions, and a “choppy” dialogue.

The choice of architecture hinges on the form of the load. SFU (selective forwarding unit) works well for multiparty scenarios, where centralized management of streams, codecs, and RTCP is needed. However, in 1:1 interactions, where each RTT affects the feeling of a “live” conversation, SFU adds an unnecessary layer. In this context, a transceiver approach was chosen: WebRTC terminates at the edge service, after which media is converted to internal protocols for inference. This is a compromise solution. It removes WebRTC complexity from backend services but requires careful packet routing and state management at the edge.

Implementation began with a monolithic Go service based on Pion, which handled signaling and media termination. However, the “one port per session” model quickly became a bottleneck. The alternative — one port per server — solves the port issue but does not address routing in a distributed environment: the first packet may not reach the correct instance. The solution is to separate roles. A stateless relay is responsible for receiving packets and routing them, while a stateful transceiver remains the sole point of session ownership (ICE, DTLS, SRTP).

A key point is deterministic routing of the first packet. Here, ICE ufrag (username fragment) is used as a built-in routing mechanism. The ufrag encodes minimal information about the target transceiver. The relay reads the first STUN packet, extracts the ufrag, and directs the traffic to the appropriate backend. After this, a lightweight in-memory session is created, and subsequent RTP/RTCP packets follow the already established route without re-analysis. In the event of a relay failure, the route can be restored through the next STUN packet or via a cache (Redis), where the mapping of client IP:port → transceiver is stored.

This scheme changes the economics of the infrastructure. Instead of thousands of open UDP ports, a small fixed set of endpoints remains. This simplifies security and load balancing. Moreover, it enables global ingress through a distributed relay layer. A packet enters the network closer to the user, reducing latency, jitter, and packet loss before reaching the backbone. Meanwhile, signaling is geo-distributed separately, allowing the session to be tied to a specific cluster and reducing connection establishment time.

The results are primarily architectural in nature. Specific metrics are not provided, but a reduction in latency on the first hop and more stable behavior of real-time sessions is described. More importantly, backend services no longer need to be WebRTC peers. This simplifies the scaling of inference, orchestration, and speech pipelines. The system becomes closer to a typical microservice model, where WebRTC is confined to the edge layer.

The overall conclusion appears pragmatic. Complexity does not disappear — it shifts to a thin layer of routing. But it is better controlled there. The use of a protocol-native mechanism (ICE ufrag) allows for avoiding external lookups and keeping routing on the packet path. As a result, the system maintains standard WebRTC behavior for clients, but operates internally according to rules that better align with high-load and cloud-native infrastructure.

Read

WebRTC routing is becoming critical for voice AI, where audio stream continuity and minimal latency are essential. We analyze how the reworking of routing changes system behavior under load.

🚀 Deploy the Blocks