Agent-based systems are not limited by prompts, but rather by the economics and infrastructure of inference. Cloudflare is attempting to bridge this gap by integrating large open-source models directly into its edge platform.

The problem becomes apparent when scaling agent-based scenarios. A single agent can process hundreds of thousands of tokens per hour. As the number of agents grows, the cost of inference becomes the primary bottleneck. In the serverless model, another factor comes into play—unpredictable resource availability and competition for GPUs. At the same time, technical complexity increases: large models require optimizations (parallelism, memory layout, scheduling), without which throughput and latency degrade. A specific bottleneck is the prefill stage: with long context lengths (up to 256k tokens), the GPU idles while waiting for input processing to complete, which increases Time to First Token (TTFT).

The chosen solution is to integrate a large open-source model (Kimi K2.5) directly into Workers AI. This is a pragmatic compromise: to reduce costs compared to proprietary models while maintaining quality at a level sufficient for production tasks. The key idea is not simply to “host the model,” but to embed it into the platform’s existing primitives (Durable Objects, Workflows, sandbox execution) to cover the agent’s full lifecycle. The trade-off is clear: serverless offers flexibility and pay-per-token pricing, but requires complex orchestration and does not guarantee instant processing under load.

The implementation relies on optimizing the entire inference stack. Custom kernels are used on top of Infire’s proprietary engine to improve GPU utilization. Industry-standard techniques are applied—data/tensor/expert parallelism and disaggregated prefill (distributing the prefill and generation stages across machines). A separate optimization layer is prefix caching: previously processed context is reused between requests. This reduces prefill computations and directly impacts TTFT and throughput (tokens per second). To improve the cache hit rate, a session affinity mechanism (the x-session-affinity header) has been introduced, which directs related requests to the same model instance.

Additionally, the asynchronous execution model has been redesigned. Serverless inference is capacity-constrained, so synchronous requests may fail during overload. The new async API uses a pull-based queue and executes tasks when capacity becomes available, monitoring GPU utilization. This shifts the system toward an “eventual execution” model rather than strict latency SLAs. The approach is suitable for non-interactive tasks (e.g., code analysis) but does not replace real-time scenarios.

Practical results are described through internal use cases. The model is used in development and automated code review. In one scenario, an agent processes over 7 billion tokens per day and identifies vulnerabilities in the code. The key benefit is cost reduction: a 77% cost reduction is claimed compared to a proprietary model. However, exact metrics for latency, SLAs, or stability are not disclosed, leaving the question of predictability under load open.

Ultimately, Workers AI is moving toward a managed compromise: shifting the complexity of optimizations within the platform so that the user works with an API rather than with GPU infrastructure. This lowers the barrier to entry but does not eliminate the fundamental limitations of serverless inference—competition for resources and latency variability.

Source

🚀 Deploy the Blocks