AI self-healing networks reduce MTTR from hours to minutes. We analyze how Telstra implemented autonomous recovery in a production environment.

The problem manifests at the moment of infrastructure failure, when telco systems rely on manual response. When a key component degrades, an engineer spends hours on triage: collecting signals, correlating events, and selecting a recovery strategy. This increases recovery latency and directly impacts user experience. In such systems, the limit is reached not at the hardware level, but at the operational process level.

Telstra has opted for an AI self-healing network, where incident resolution becomes part of an automated loop. The choice favors a multivendor architecture with an AI-native layer. This is a compromise: integration complexity increases, but independence from specific vendors and the ability to scale automation emerge. The key idea is to combine AI (decision layer) and automation (execution layer) to eliminate the manual step from the critical path.

The architecture is divided into three layers. OpenShift serves as the execution platform for cloud-native network functions (CNFs). OpenShift AI plays the role of the intelligence layer, where the AI agent analyzes anomalies and selects a strategy. The Ansible Automation Platform executes actions, ensuring deterministic application of changes. Interaction is built through cross-platform integrations: the AI agent accesses external systems and the Knowledge DB via MCP servers. After selecting a scenario, it initiates automatic remediation—such as migrating workloads to healthy infrastructure. Control is ensured through Policy as Code (PaC), RBAC, and audit trails, reducing the risk of uncontrolled changes.

A separate phase is the transition from assistive AI to a fully autonomous cycle. In the first phase, generative AI aggregates data from various vendors and assists the operator through a unified interface. In the second phase, the system closes the loop: detection → analysis → decision → execution without human involvement. This is a key shift because it eliminates the delay between detection and action.

The result manifests not in the architecture, but in the system’s behavior under load. In Telstra’s demonstration, hardware failure affecting CNFs did not lead to noticeable service degradation. The AI agent detected the issue early and initiated traffic migration to healthy nodes within minutes. Metrics are not disclosed, but a reduction in recovery time from hours to minutes and an increase in platform stability are claimed. For the end user, this means the incident remains virtually unnoticed.

Importantly, this architecture changes not only MTTR but also the operational model itself. The engineer is no longer at the center of the incident. Their role shifts to defining policies, monitoring models, and validating automation. This reduces operational burden but increases the requirements for the quality of rules and training data.

From an engineering perspective, this is an evolutionary improvement rather than a radical shift. All components—AI, automation, policy control—are already known in the industry. The novelty lies in their integration and in transferring decision-making inside the system. The main risk is trust in automated actions in production. Therefore, the presence of an audit trail and strict policies is not an option but a fundamental requirement.

Ultimately, Telstra demonstrates a practical model where AI self-healing networks become part of real operations rather than a laboratory experiment. Limitations and metrics remain outside the scope, but the architectural approach already sets the direction for telco and other high-load systems.

Read

AI self-healing networks reduce MTTR from hours to minutes. We analyze how Telstra implemented autonomous recovery in a production environment.

🚀 Deploy the Blocks