AI code review is integrated into CI/CD and eliminates the review bottleneck. We discuss how the orchestration of agents reduces latency and noise.
The problem does not manifest immediately — until the queue of merge requests begins to grow faster than the team can handle. Traditional code review effectively catches bugs but does not scale well: the reviewer makes a context switch, leaves comments, the author responds, and the cycle repeats. As a result, the latency of the first response is measured in hours. The initial attempt to speed up the process through AI code review took a naive approach: one LLM, a large prompt, analyzing the diff. This produced the expected effect — noise, false positives, and context-free advice. In complex codebases, this approach degrades because the model cannot maintain task boundaries and does not understand priorities.
The solution shifted from “one smart model” to the orchestration of specialized agents within the CI/CD pipeline. Instead of a universal review, a set of narrow checks is used: security, performance, code quality, documentation, compliance. The key element is a coordinator that aggregates output, eliminates duplicates, and assesses severity. This is a compromise: more components and orchestration complexity, but less noise and higher accuracy. Additionally, cost control emerges — different tasks are assigned to models of varying levels, reducing overall token consumption without sacrificing quality at critical stages.
The implementation relies on a plugin architecture. The entry point delegates configuration to plugins, each of which adds part of the behavior through a common ConfigureContext. This eliminates tight coupling: VCS, AI provider, and internal rules are isolated. The lifecycle is divided into three stages: bootstrap (non-fatal parallel steps), configure (critical sequential dependencies), and postConfigure (asynchronous fine-tuning). This design reduces the likelihood of a complete pipeline failure due to secondary errors and simplifies component replacement.
Orchestration is performed in two layers. Outside, the process coordinator launches OpenCode and passes data through stdin, bypassing ARG_MAX limitations for large merge requests. Inside — a runtime plugin that raises parallel agent sessions. Each agent operates in isolation, reads only relevant patch files, and returns structured results. This is crucial for throughput: there is no need to pass the full diff in each prompt, which significantly reduces tokenization and latency.
A separate engineering detail is the use of JSONL instead of regular JSON for logging. In long CI tasks, the stream can break, and unclosed JSON becomes useless. JSONL solves this: each line is a valid object. This allows for real-time event reading, tracking usage, errors, and retry triggers without buffering the entire output. For example, if the model truncates a response due to max_tokens, the system automatically restarts the step.
A significant issue is user perception. Powerful models can “think” for a long time, which appears as a stalled job. A simple heartbeat message every 30 seconds almost completely eliminates false cancellations. This is an example of how UX impacts the reliability of the pipeline as much as architecture.
The quality of the review is ensured not so much by the models as by the constraints. Each agent receives a strict prompt indicating what to ignore. For example, the security check only marks vulnerabilities that are actually exploitable. Without this, the system turns into a generator of hypothetical risks that the team quickly begins to ignore. The output is standardized in XML with severity levels: critical, warning, suggestion. This allows for a direct link between the result and CI actions — from approval to blocking the merge.
The final decision is made by the coordinator. It performs deduplication, redistributes categories, and discards false positives. If there is uncertainty, it re-reads the code. The policy leans towards passing: individual warnings do not block merges. At the same time, there remains an escape hatch — a break glass comment forcibly allows changes. This is important for production incidents, where latency is more critical than formal correctness.
The results are described without specific metrics, but behaviorally the system does three things: automatically approves clean changes, accurately identifies real bugs, and blocks dangerous merges. Additionally, review latency is reduced, and cognitive load on engineers is decreased through noise filtering. This approach appears as an evolutionary improvement in CI/CD: LLM does not replace the review but becomes a filter and prioritizer.