× Install ThecoreGrid App
Tap below and select "Add to Home Screen" for full-screen experience.
B2B Engineering Insights & Architectural Teardowns

Multi-region HA and sovereign fault domains

Multi-region architecture changes the high availability model. It is no longer about the failure of an AZ, but about the failure of an entire region as a single domain.

Before a region ceases to be a reliable failure boundary, the classical high availability model is built around multi-AZ: failures of hardware, network, or data centers are isolated within the region. However, this model assumes that a region fails only for technical reasons and is managed by the provider. Geopolitics breaks this assumption. Internet outages, sanctions, data transfer restrictions, or physical damage to infrastructure create correlated failures, where the entire region becomes unavailable at once and without predictable recovery.

In this context, a more precise model emerges — sovereign fault domain (SFD). This is a failure boundary defined not by architecture, but by jurisdiction, physical infrastructure, and political context. Unlike AZs, SFDs cannot be “designed” or controlled. They exist independently of the system. This changes the framing of the problem: instead of “what if an AZ goes down?” the question becomes “what if a region becomes legally or physically inaccessible?”. For many systems, there is simply no answer to this question, as the tooling and runbooks are not designed for it.

The solution is to transition to multi-region architecture as the baseline level of fault tolerance for systems sensitive to such risks. Here, there is a choice between active-passive and active-active. Active-passive provides a simpler model and control over consistency but increases RTO due to failover (DNS, health checks, promotion of replicas). Active-active reduces RTO to nearly zero but requires dealing with eventual consistency and complicates the operational model. This is a classic trade-off: latency and simplicity versus availability and recovery speed.

Implementation hinges on details that are often underestimated. Failover is not a single action but a chain of delays. First, detection through health checks, which can take tens of seconds. Then, DNS propagation, dependent on TTL and resolver behavior. Finally, database promotion, where the delay depends on replication lag. In real-world conditions, it is this last stage that becomes a source of surprises. Tests with zero lag rarely reflect behavior under load.

An additional layer of complexity is data. Geo-distributed systems make the CAP trade-off explicit at the regional level. Strong consistency requires synchronous replication, which increases latency proportionally to distance. Therefore, the typical compromise is strong consistency within a region and eventual consistency between regions. But this works only if the system explicitly considers jurisdictional boundaries. Otherwise, replication, intended as a reliability mechanism, becomes a source of compliance risks.

Practical implementation requires “awareness of sovereignty” in the data layer. This can be through database capabilities (e.g., placement policies or locality constraints) or at the application level through routing records with consideration for jurisdiction tags. The key idea is that a record must be confirmed within its jurisdiction, not in an abstract “global” system. Systems that do not differentiate these levels typically encounter problems at the moment of an incident, rather than at the design stage.

The result of such a shift is a more realistic failure model. Multi-AZ is no longer a sufficient answer to the question of high availability in global systems. Multi-region becomes necessary, but along with this, the cost, complexity, and consistency requirements increase. The metrics of gain depend on implementation and are not specified in the original material, but qualitatively, the main change is that the system begins to consider the failure of a region as a normal scenario, rather than as an exception.

Read more – InfoQ

×

🚀 Deploy the Blocks

Controls: ← → to move, ↑ to rotate, ↓ to drop.
Mobile: use buttons below.