Time series storage in high load requires not only throughput but also degradation control. An analysis of a system that processes 50M samples/sec and 2.5PB of data.

At the point when metrics stop being “observability” and turn into a stream that is hard to control, the system was generating about 1.3 billion active time series and ingesting up to 50 million samples per second. Transitioning from an external provider to an in-house backend revealed a fundamental limitation: to store and serve such volumes of data reliably. Without strict guardrails, any error in one service could overload the ingestion or query layer. At this scale, degradation begins locally but quickly spreads throughout the system.

The key solution is a multi-tenant architecture with isolation and load control. The unit of tenant was chosen as a service rather than a team. This reduces churn in configuration and provides precise attribution of metric growth. On top of this, shuffle sharding was added — each tenant writes and reads only from a subset of nodes. This is a compromise: resources are not utilized in the most efficient manner, but the system gains predictable isolation and fault tolerance. Additionally, limits were introduced at the tenant level: rate limiting, number of series, query parameters.

The implementation required a centralized control plane. It automatically onboards new services and applies configurations without manual operations. Some limits are set explicitly, while others are calculated. For example, the ingestion rate is derived from the limit on the number of series. This reduces the risk of inconsistent settings. For writes, benchmarking was conducted, and per-replica limits were set to manage load and scaling. For reads, the situation is more complex due to the variability of requests, so query sharding and separate limits on samples were applied. Critical queries (evaluation) were isolated from ad-hoc load.

Particular attention was given to fault tolerance. Stateful components were made zone-aware and distributed across three zones. This reduces the impact of zone-level failures and operations such as rolling deploys. For large tenants, compaction is sharded to avoid read latencies. Autoscaling was added only on the read path, where the load is more variable.

After stabilizing the single-cluster architecture, the system hit a blast radius. The solution was to transition to a multi-cluster setup. Tenants are distributed among clusters, and specialized loads (e.g., infrastructure) are isolated separately. This reduces the blast radius and provides flexibility across regions. However, a new complexity arises: managing tenant placement and configuration consistency.

This problem was addressed through tooling that maintains the mapping of tenant → cluster and serves as a source of truth. The deployment of stateful services was automated through Kubernetes operators, which eliminated manual rollout processes and reduced configuration drift. For cross-cluster queries, Promxy was used with enhancements: fanout optimization and support for additional data types. This allowed for a unified query layer over a distributed architecture.

The result is a system that scales through isolation rather than just horizontal expansion. It has managed to reduce the impact of “noisy” tenants, improve load predictability, and enhance fault tolerance. Exact metrics of improvements are not provided, but architectural changes are clearly aimed at controlling the blast radius and stability under high load.

Ultimately, the key principles remain engineering simple: isolation, automation, limit control, and separation of critical paths. At this scale, observability is no longer about metric collection but about managing a system that can easily become a source of incidents.

Read

Time series storage in high load requires not only throughput but also degradation control. An analysis of a system that processes 50M samples/sec and 2.5PB of data.

🚀 Deploy the Blocks