In actor systems, there is no built-in channel for trace context. Discord solved this without changing the architecture and without stopping production.

The problem manifests at the boundary of the model. In HTTP, trace context is passed in headers. In the Elixir actor system, messages are arbitrary structures without metadata. Standard OpenTelemetry works within the service but loses coherence between processes. At the scale of Discord, this means blind spots: millions of concurrent users, fanout to thousands of recipients, and a lack of end-to-end tracing. When attempting to add tracing naively, the system hits CPU and data volume limits.

The solution is to not break the model but to wrap it. The team introduced the Envelope primitive: message + serialized trace context. The Transport library intercepts GenServer calls (call/cast) and automatically adds context. At the receiver, a single normalization point processes both “old” messages and new ones with Envelope. This is a key compromise: minimal intrusion into the code against the necessity to support a dual format during the migration period. In return, there is the ability to roll out changes without restarts and synchronous updates to the cluster.

The implementation encountered two bottlenecks: fanout and the cost of (de)serialization. When broadcasting in a guild with thousands of recipients, a single request generates an avalanche of spans. The team introduced dynamic sampling based on fanout size: 100% for single messages, 10% at ~100 recipients, and 0.1% at 10k+. Next, aggressive CPU savings were implemented:
– Context is passed only for sampled operations. Unsampled messages are sent without trace context.
– The session service does not initiate new traces on fanout but continues existing ones. This reduced CPU load from approximately 55% to 45%.
– In the gRPC connection with Python, 75% of the time was spent unpacking context. A fast filter was added: reading the sampling flag without full deserialization and omitting context transmission if it is not needed.

The result is observability that scales with load. Traces have become suitable for real incidents: for example, a user connection delay of up to 16 minutes was identified, along with a cascading effect on guild availability—issues that were not visible through metrics and logs. Quantitative SLA improvements are not provided, but qualitatively, the system gained diagnostic signals where there was previously silence. The approach appears pragmatic: to retain the advantages of the actor model while adding tracing through a wrapper and strict cost control.

Read more – InfoQ

🚀 Deploy the Blocks