B2B Engineering Insights & Architectural Teardowns

LLM evaluation at scale on Apache Spark

LLM evaluation at scale becomes a bottleneck as datasets grow. Spark-LLM-Eval offers a distributed architecture focused on statistical rigor and cost control.

The problem of large-scale LLM evaluation does not manifest immediately — until the set of examples exceeds tens of thousands. Standard LLM evaluation tools operate on a single machine and assume a limited volume of data. In production scenarios, this cannot handle the load: rare cases arise, user segmentation occurs, and there is a need for regression testing on hundreds of thousands or millions of requests. Additionally, the requirement for statistical significance (confidence intervals, significance tests) complicates the situation, increasing computational load and exacerbating the bottleneck.

In Spark-LLM-Eval, evaluation is treated as a data-parallel task. Each example is processed independently, and metric aggregation is performed at the cluster level. The architecture is built around Apache Spark: a DataFrame with examples goes through stages of prompt preparation, distributed inference, and metric computation. A key element is the Pandas UDF, which processes batches and reduces overhead compared to row-wise UDFs. API limitations (rate limiting) are addressed through a token bucket at the executor level, where each worker is allocated a share of the global limit (RPM, TPM). This prevents provider overload but introduces a compromise: under uneven load, some executors may remain idle.

The system demonstrates linear scaling up to the API limits. As the number of executors increases, throughput rises to ~9800 examples per minute, after which the limitation becomes external. This results in up to 21× acceleration compared to sequential processing. However, performance depends not only on compute but also on API latency, making the system I/O-bound. For large datasets (10k+), Spark’s overhead becomes negligible, confirming the approach’s applicability for high-load evaluation.

An additional layer is response caching via Delta Lake. The key is calculated as SHA-256 from the query parameters (prompt, model, temperature, etc.), ensuring determinism. This solution simplifies reproducibility and allows for separating inference from metric calculation. In replay mode, API calls can be completely eliminated, and metrics can be recalculated on already obtained responses. The practical effect is a cost reduction of up to 75% and time reduction of up to 69% in metric iteration scenarios. The compromise is evident: there is no semantic cache, so any changes to the prompt lead to a cache miss.

An interesting layer is built-in statistics. Instead of point estimates, the system immediately calculates bootstrap confidence intervals and selects significance tests depending on the type of metric. For example, McNemar’s test is used for binary metrics, while paired t-test or Wilcoxon is used for continuous ones. BCa bootstrap shows more accurate coverage on small samples (around 95%) than percentile bootstrap or analytical methods. This is important: with large volumes of data, even small differences can be statistically significant but practically useless. Therefore, effect size (e.g., Cohen’s d) is additionally calculated.

Metric support covers several classes: from simple lexical (Exact Match, F1, BLEU) to semantic (BERTScore, embeddings) and LLM-as-judge. The latter is particularly sensitive to systemic biases: positional bias, tendency towards long answers, self-preference. The framework does not correct this but merely makes the behavior observable. For RAG scenarios, metrics such as faithfulness and context relevance have been added, allowing for the evaluation of not only the answer but also the retrieval layer.

For the industry, this appears as a pragmatic development of existing tools. The key idea is not in new metrics but in making them applicable to real data volumes. Spark acts here as an infrastructure layer rather than an ML innovation. The main trade-offs remain: dependence on API limits, lack of adaptive rate limiting, and limited cache efficiency. Nevertheless, the approach fits well into existing data pipelines and can be integrated into CI/CD or regression testing for LLM.

In practical application, the system is especially useful where the dynamics of quality are important rather than a one-time benchmark. For example, when comparing models, tracking degradations, or testing prompt changes. The main takeaway remains simple: LLM evaluation at scale is already an infrastructure and statistics task, not just a model quality issue.

Information source

arXiv is the largest open preprint repository (since 1991, under the auspices of Cornell), where researchers quickly post working versions of papers; the materials are publicly accessible but do not undergo full peer review, so results should be considered preliminary and, where possible, checked against updated versions or peer‑reviewed journals. arxiv.org

View the original research PDF

×

🚀 Deploy the Blocks

Controls: ← → to move, ↑ to rotate, ↓ to drop.
Mobile: use buttons below.