When LLM becomes part of production infrastructure, traditional monitoring is no longer sufficient. The bottleneck is no longer the application code, but the routing and model selection layer — and that’s exactly where observability is needed.

In LLM systems, degradation doesn’t start with HTTP endpoint failures, but with the accumulation of subtle effects: increased latency on specific models, cost spikes due to routing, timeouts for particular prompts, or rate limits from the provider. If an application uses multiple models and providers, the traditional approach of logs and metrics inside the service quickly loses coherence. An engineer sees the symptoms but lacks the context for decision-making — which model handled the request, why a fallback occurred, and how much it cost.

OpenRouter has focused on moving observability to the infrastructure layer — where routing and load balancing between models occur. The key idea is not to force teams to manually instrument every call, but to generate tracing automatically at the API level. This reduces operational overhead and eliminates discrepancies between the system’s actual behavior and what is instrumented in the code. The trade-off is clear: less flexibility in custom telemetry, but significantly higher consistency and completeness of data.

The implementation is built around OpenTelemetry. The Broadcast feature in OpenRouter automatically creates a trace for each request and sends it to external systems, such as Grafana Cloud, via an OTLP endpoint. An important point — there is no SDK or changes required in the application code: tracing is enabled at the account level. The trace includes attributes specific to LLM load: model, provider, tokens, cost, latency, as well as prompt and completion (unless privacy mode is enabled). On the Grafana Cloud side, this is processed through Tempo and becomes available in TraceQL, allowing you to work with LLM traces just like with distributed systems.

The practical value is evident in several scenarios. First, cost transparency: you can break down expenses by model, API key, or user attributes without separate billing analytics. Second, latency management — comparing p50/p95/p99 across models provides a basis for SLA-based routing instead of guesswork. Third, debugging: the trace immediately shows whether it was a rate limit, provider error, or an issue in the request itself. Finally, planning — aggregated data on tokens and call frequency allows you to forecast load and revise your model selection strategy.

Metrics in the classical sense are secondary here — the value lies in the coherence of the data. Even without formalized KPIs, teams gain a holistic view of LLM load behavior: where money is spent, where time is lost, and where the system behaves unstably. This shifts observability from the level of “viewing charts” to the level of making architectural decisions.

Read the original

🚀 Deploy the Blocks