Root cause analysis (RCA) hinges on scale and the human factor. Meta’s approach with DrP demonstrates how to turn debugging into a reproducible engineering process.

The problem does not manifest immediately — until the system reaches organizational scale. Incidents begin to recur, but each time they are investigated anew. Knowledge of where to look for the cause resides in the minds of specific engineers. Runbooks become outdated faster than they are updated. Ad-hoc scripts help locally but do not stand the test of time: they are not tested, do not cross service boundaries, and become yet another form of “closed knowledge.” As a result, root cause analysis turns into a manual and unpredictable process with a high MTTR (mean time to resolve).

Meta approached this problem from an engineering perspective. Instead of improving documentation or coordinating people, they focused on the level of investigation. The key idea is to formalize the debugging process as code. The DrP platform introduces the entity analyzer — a programmable investigation workflow. This is not just a script. The analyzer undergoes code review, is tested through backtesting, and is delivered via CI/CD. This approach creates a clear trade-off: more engineering work at the development stage, but less chaotic load during an incident.

The implementation is built around an SDK and a set of standard primitives. The engineer describes what data to collect, what anomalies to look for, and what dependencies to check. Built-in libraries cover basic patterns: anomaly detection, time series correlation, dimension analysis. This reduces duplication and makes the behavior of analyzers predictable. An important point is backtesting. Before deployment, it is possible to check whether the analyzer would have detected past incidents. This turns debugging into a verifiable hypothesis rather than intuition.

The key difference from ordinary scripts manifests at the architectural level. DrP supports chaining between services. In a microservices system, the symptom almost never matches the cause. The analyzer of an API service can pass context to the analyzer of the storage layer and receive confirmation of the root cause. This eliminates boundaries between teams at the diagnostic level. The system also integrates into the alert lifecycle: the investigation is triggered automatically, and the result is attached to the alert until human intervention.

A typical scenario demonstrates the behavior of the system. As the error rate increases, an alert is triggered, and the analyzer is launched. It segments metrics by regions and isolates the problem. Then it correlates the spike with events — deployments or configuration changes. Upon finding a match, it calls a downstream analyzer and passes the context. As a result, the system returns a structured output: cause, timestamp, affected region, and a link to the change. Post-processing can create a rollback task. The engineer no longer searches for the problem — they validate the found solution.

The results show a pragmatic effect. According to Meta, DrP reduces MTTR by 20–80% and performs tens of thousands of automatic analyses daily. Over 2000 analyzers are used in production. While metrics are important, they are secondary compared to the architectural shift. Investigation becomes part of the system rather than a side process.

There are also limitations. The analyzer is code that needs to be maintained. When the system changes, it needs to be updated. Full automation is not intentionally achieved: the engineer remains in the decision-making loop. This is a conscious compromise between speed and control. Moreover, implementation requires process maturity: without code review and CI/CD, the model loses its meaning.

In a broader context, this reflects the evolution of approaches. The industry has moved from tribal knowledge to runbooks, then to scripts, and now to composable analysis systems. Many teams are still at the level of documentation or local automation. The DrP approach shows the next step — to make debugging part of the architecture.

The practical takeaway is simple. If knowledge about failures is not formalized in the system, it degrades. Root cause analysis as code allows for the preservation, verification, and scaling of this knowledge. The question is not about the tool but about the model: does your debugging process remain in people’s heads or become part of the infrastructure?

Read

Root cause analysis (RCA) hinges on scale and the human factor. Meta’s approach with DrP demonstrates how to turn debugging into a reproducible engineering process.

🚀 Deploy the Blocks