B2B Engineering Insights & Architectural Teardowns

Latency-aware proxy vs DNS: how to balance S3 load

DNS round-robin stops working under load when clients start caching responses. Agoda faced this issue at the object storage level and moved the balancing to a separate layer.

The problem manifested during the increase in data workloads. S3-compatible endpoints used DNS round-robin to distribute traffic. In practice, clients cached DNS responses and continued to hit the same backend. As a result, the system lost balance: some nodes became overloaded while others were idle. This is a classic case where DNS-level balancing ceases to be manageable and observable.

The solution is to remove balancing from DNS and move it to a managed layer. Agoda introduced a reverse proxy Storefront between services and the object storage. The proxy makes routing decisions based on the current state of the backends. Initially, the least-in-flight requests algorithm was used, but under real load, it was refined by adding latency-aware scoring. This is a trade-off: more logic and state in the proxy, but predictable distribution and control.

The implementation is built on Rust and the Pingora framework. The proxy not only routes requests but also addresses operational issues:
– IO timeouts protect connection pools from clients that do not read responses completely
– traffic between data centers is isolated into separate backend pools
– optimized handling of HTTP Expect: 100-continue to reduce latency during uploads
– credential-less authentication through Kubernetes pod identification has been added

The last point changes the access model. Services no longer manage credentials directly. Control is centralized in the proxy, reducing the risk of leaks and simplifying compliance. This shifts responsibility from applications to the infrastructure layer.

As a result, Storefront has become not just a proxy but a point of access management and observability. Through OpenTelemetry, the system provides metrics on latency, load, traffic patterns, and S3 API usage. Specific numerical improvements are not disclosed, but the architectural key issue has been resolved: balancing has become deterministic and manageable, rather than a side effect of DNS.

Read more

×

🚀 Deploy the Blocks

Controls: ← → to move, ↑ to rotate, ↓ to drop.
Mobile: use buttons below.