Stripe has advanced LLM agents to the point of generating production-ready pull requests without human involvement in the code. The key question is how to maintain reliability as autonomy increases.

The problem manifests at the intersection of scale and responsibility. The system generates code changes that serve payment infrastructure with high demands for correctness and compliance. As the proportion of automatically generated code increases, the risk of hidden defects grows. This is especially true in an environment with numerous dependencies and integrations. Simple interactive assistants are not suitable here: they require constant engineer involvement and do not cover end-to-end tasks.

Stripe chose a model of autonomous agents that execute a task in its entirety based on a single input description. This is a compromise between speed and control. On one hand, it minimizes manual labor. On the other hand, it requires mandatory human review and strict pipeline checks. Unlike tools such as Copilot, the agents act not as prompters, but as executors. This shifts the boundary of responsibility: the system must account not only for code generation, but also for its verification, structure, and completeness (including tests and documentation).

The implementation is built around blueprints—workflows defined in code. A blueprint determines the breakdown of a task into subtasks and decides where to use deterministic logic and where to use agentic loops. This reduces the uncertainty of the LLM and keeps the system within predictable bounds. Task sources can include Slack, bug reports, or feature requests. The agent generates the code, tests, and documentation, after which a pull request is created. Reliability is ensured by the standard stack: CI/CD, automated tests, and static analysis. Additionally, the system is constrained by task types—it performs best on well-defined changes such as configuration, dependency updates, and minor refactoring. This is an explicit limitation of its scope of application.

As a result, the system has reached over 1,300 pull requests per week (up from 1,000). All code undergoes human review but contains no manual changes. Metrics regarding quality or defect rates are not disclosed, so the direct impact on reliability cannot be assessed. However, the architecture itself demonstrates a pragmatic approach: autonomy is increasing, but control remains at the pipeline and review process level. This aligns with a broader industry trend—integrating LLM agents directly into CI/CD with an emphasis on verifiability rather than “magical” code generation.

Source

🚀 Deploy the Blocks