AutoB2G — a framework for automatic building-grid co-simulation with LLM. We explore how DAG and multi-agent orchestration reduce complexity and enhance the accuracy of pipelines.
The problem arises at the intersection of two models: building simulation and energy network analysis. Most environments for RL control optimize metrics on the building side — cost, consumption peaks, comfort. Meanwhile, the impact on the network remains poorly formalized. The second point of tension is the experimental process itself. It requires manual configuration, knowledge of the API, and assembling a pipeline from disparate modules. As the complexity of scenarios increases, the system encounters dependency errors and incorrect configurations.
AutoB2G addresses both issues through a combination of architectural solutions. The foundation is the extended CityLearn V2, supplemented with a network model on Pandapower. At each step, the aggregated load of buildings is sent to the power flow calculation, while network states (e.g., node voltages) are returned to the agent’s observations. On top of this, an LLM layer is added: the user specifies a task in natural language, and the system constructs an executable pipeline. A key element is the codebase organized as a DAG (directed acyclic graph), where dependencies and execution order of modules are explicitly defined. This limits the error space during generation.
The generator itself is not a single LLM but a multi-agent system called SOCIA. Roles are divided: code generation, execution, validation, error analysis, and feedback. Iterations are built around the Textual Gradient Descent mechanism. Instead of numerical gradients, the system forms textual “gradients” — structured instructions on which constraints are violated and how to fix them. The code is then patched and re-validated. This brings the process closer to constraint-based optimization, where the goal is to bring the program into a feasible set without manual debugging.
An additional layer is agentic retrieval. Instead of passing the entire codebase into context, the agent selects relevant modules through the DAG. If the chain is incomplete, the validator returns dependency errors, and the agent refines the selection. This reduces noise and minimizes the risk of including unnecessary components. Experimentally, this is reflected in the code score metric: with high task complexity, the basic LLM drops to 0.44, while the combination of SOCIA + retrieval reaches 0.88. A similar dynamic is observed in the success rate: for complex scenarios, the increase is from 0.53 to 0.83.
Importantly, the improvement is achieved not by “smarter models” but through execution structure. The DAG sets strict dependencies. The multi-agent cycle breaks the task into verifiable steps. Retrieval limits the context. Together, this reduces the likelihood of hidden errors — for example, when all modules are present, but the order of calls is violated or interfaces are incompatible.
From a practical standpoint, this appears as an evolutionary improvement in DevEx for simulations. The engineer sets a goal: to train an RL agent, add grid-aware rewards, run an N–1 analysis. The system assembles the pipeline itself: data generation through EnergyPlus, training in CityLearn, calculations in Pandapower, aggregation of results. Network metrics are also included — voltage feasibility, thermal limits of lines, fault tolerance, short-circuit currents. This eliminates the bias towards building-only optimization.
Limitations are also evident. Strong coupling of modules remains a source of failures: even with a correct set of components, minor discrepancies can break the entire pipeline. The second issue is the ambiguity of natural language. If requirements are stated implicitly, the agent may add unnecessary steps or misinterpret the goal. These errors are not always caught in early iterations.
For the industry, this is a signal: LLM automation of complex engineering pipelines requires not only RAG but also an explicit model of dependencies and a validation mechanism. DAG + multi-agent orchestration is a pragmatic pattern that can be transferred to other domains: data engineering, simulation platforms, CI/CD for scientific calculations. Without this, LLM remains a code generator. With it, it becomes part of an executable system.
Information source
arXiv is the largest open preprint repository (since 1991, under the auspices of Cornell), where researchers quickly post working versions of papers; the materials are publicly accessible but do not undergo full peer review, so results should be considered preliminary and, where possible, checked against updated versions or peer‑reviewed journals. arxiv.org