The security of AI agents in Kubernetes requires a reevaluation of fundamental assumptions. Dynamic behavior disrupts established models of RBAC, network policy, and resource limits.

The problem does not manifest immediately — until the autonomous agent begins to behave like a system without fixed boundaries. The classic Kubernetes model assumes a predictable set of dependencies, a stable load profile, and limited access to external APIs. An AI agent does not have this. It determines for itself which data sources to query, which hypotheses to test, and which steps to take. This leads to blurred trust boundaries, the inability to set correct network policies, and a lack of a baseline for anomaly detection. Additionally, the risk increases: one container may simultaneously have access to logs, metrics, network monitoring, and external APIs.

The solution turned out to be pragmatic: shift the unit of isolation from the service level to the level of individual investigations. Instead of a long-running Deployment — one Kubernetes Job for each agent task. This is a compromise between startup speed and control. Yes, there is an overhead for starting (a few seconds), but it is negligible compared to execution, which can take minutes. However, there is strict isolation in terms of resources, failures, and state. Concurrently, Vault is used for managing secrets with a short lifespan to limit the blast radius in case of compromise.

Implementation hinges on orchestration and access management details. Each Job receives its own CPU and memory limits, eliminating competition between “heavy” and “light” investigations. Failure isolation operates at the Kubernetes level: the failure of one Job does not affect the others. A clean container state eliminates context leaks and artifact accumulation. Logs and metrics are tied to a specific Job, simplifying tracing and auditing.

Secrets become a separate architectural challenge. The agent needs keys to several domains: logging, metrics, network, LLM API. Statically storing this is risky. Vault is used: authentication via service account, issuance of temporary credentials for the duration of the Job, automatic revocation upon completion. This reduces the attack window and removes the need for rotation through deployment. At the same time, a compromise is chosen: a single identity for the agent with domain policies, instead of a unique identity for each Job. This simplifies operations but reduces attribution accuracy.

A separate layer is the trust model. The transition to autonomy is divided into phases: from read-only analysis to full automatic remediation. The key point is that decisions are made not based on a roadmap but on operational signals. The trust metric is not accuracy but operator behavior: how often they change the agent’s output. This approach reduces the risk of premature automation and provides controlled evolution.

The results are qualitative. The system becomes more predictable under unpredictable workloads. Isolation through Jobs eliminates a class of problems related to resource competition and hanging processes. Vault reduces the blast radius and simplifies secret rotation. However, there are trade-offs: increased orchestration complexity, a rise in the number of objects in the cluster, and the need for more mature observability. Performance metrics are not provided, but the architectural effects align with expectations for such patterns.

The main conclusion: AI agents are a new class of workload. They require a reevaluation of fundamental principles of Kubernetes security and management. Attempting to fit them into the microservices model leads to hidden failures that manifest in production.

Read more – InfoQ

The security of AI agents in Kubernetes requires a reevaluation of fundamental assumptions. Dynamic behavior disrupts established models of RBAC, network policy, and resource limits.

🚀 Deploy the Blocks