Tracing in the actor model without degradation through Envelope
In actor systems, there is no built-in channel for trace context. Discord solved this without changing the architecture and without stopping production.
Highload on ThecoreGrid focuses on designing and operating systems that handle massive scale, traffic, and data under strict reliability requirements.
We explore architectures and patterns for horizontal scaling, load distribution, fault tolerance, and performance optimization in distributed environments. Topics include sharding, replication, caching strategies, queueing systems, backpressure handling, and latency reduction under peak load. We analyze real-world trade-offs between consistency, availability, and cost, along with failure scenarios and recovery strategies. Content is grounded in BigTech practices, including incident post-mortems and lessons from operating systems at global scale. You’ll find deep dives into infrastructure behavior, traffic management, autoscaling, and resilience engineering. Instead of simplified guides, the Highload tag delivers practical engineering insights for backend engineers, architects, platform teams, and SREs responsible for building and maintaining systems that must perform reliably under extreme demand.
In actor systems, there is no built-in channel for trace context. Discord solved this without changing the architecture and without stopping production.
DNS round-robin stops working under load when clients start caching responses. Agoda faced this issue at the object storage level and moved the balancing to a separate layer. The problem manifested during the increase in data workloads. S3-compatible endpoints used DNS round-robin to distribute traffic. In practice, clients cached DNS responses and continued to hit … Read more
Request timeouts do not always indicate a problem in the database. Often, degradation is hidden in the path between the application and the DB. The problem manifests when database metrics appear stable, but clients experience timeouts. At the observation level, this looks like a contradiction: latency increases while database time remains the same. The reason … Read more
A long restart of a stateful service rarely appears to be a security configuration issue. However, this is how the safe default in Kubernetes turned into 30 minutes of downtime for each restart. The problem manifested at scale. Atlantis, which manages Terraform through GitLab MR, operates as a singleton StatefulSet and stores state in a … Read more
The profiler in kernel space only sees addresses. Useful insights emerge only after symbolization—and in Go, this stage is structured differently than in other languages. The problem arises when the profile has already been collected, but it cannot be interpreted. The eBPF profiler captures stack traces at the kernel level and obtains a set of … Read more
In live streaming, an error is not a degradation but an instant user-facing incident. Netflix addresses this by moving quality control and prioritization directly into the origin layer. The main limitation arises where VOD approaches stop working. In live, there is no time buffer: a segment must be encoded, delivered, and cached within seconds. Any … Read more
Agent-based systems are not limited by prompts, but rather by the economics and infrastructure of inference. Cloudflare is attempting to bridge this gap by integrating large open-source models directly into its edge platform.
Controls: ← → to move, ↑ to rotate, ↓ to drop.
Mobile: use buttons below.