NetAgentBench offers a state-centric approach to evaluating LLM in network configuration, bridging the gap between static tests and the real behavior of systems.
The problem in evaluating AI agents for network configuration does not become apparent immediately — until the agent encounters a system state rather than configuration text. Most existing benchmarks use static verification: comparison with a “golden config” or text matching. This ignores a key feature of networks — state and temporal dynamics. As a result, configurations that are correct on paper can fail upon reapplication or during protocol convergence, while the models themselves exhibit behavior that cannot be captured in one-shot tests.
NetAgentBench addresses this issue by formalizing the process as an interaction of finite state machines (Finite State Machine, FSM). The architecture is divided into three layers: an infrastructure FSM for deterministic topology deployment, a SUT FSM (System Under Test) for modeling a live network with state transitions, and a benchmark controller that manages scenarios. The agent acts iteratively: it reads the state, applies commands, gathers observations, and repeats the cycle until it achieves its goal or reaches a limit. A key solution is modeling protocol convergence as event transitions, which allows for accounting delays and intermediate unstable states.
An interesting detail is the introduction of bounded convergence and a finite state space through abstractions. This eliminates the infinity of real network parameters (timers, counters) and makes the benchmark deterministic. Metrics go beyond “worked / did not work”: completeness, robustness through idempotency, and syntactic correctness are evaluated. For example, a configuration may be deemed successful but fail the reapplication test — a typical scenario for automation with drift correction.
Experiments show that it is the dynamic model that reveals system failures. Even the best result among tested models is about 24% successful executions. As tasks become more complex (from basic RIP to OSPF and BGP), success rates drop sharply to nearly zero. The main failure patterns are “exploration meltdown” (command looping), “coherence collapse” (destruction of an already achieved state), and diagnostic stagnation. Importantly, these effects only arise in multi-turn scenarios and are not visible in static tests.
For practice, this means a shift in the approach to validating AI in infrastructure. Configuration verification must consider not only the final state but also the trajectory of its achievement. The FSM approach provides reproducibility and control over the environment, which is critical for SRE and platform engineering. However, there are still limitations: idealized observability and sensitivity to convergence timings. Nevertheless, the approach appears to be a pragmatic step towards the engineering evaluation of agent systems rather than their demonstration.
Information source
arXiv is the largest open preprint repository (since 1991, under the auspices of Cornell), where researchers quickly post working versions of papers; the materials are publicly accessible but do not undergo full peer review, so results should be considered preliminary and, where possible, checked against updated versions or peer‑reviewed journals. arxiv.org