× Install ThecoreGrid App
Tap below and select "Add to Home Screen" for full-screen experience.
B2B Engineering Insights & Architectural Teardowns

Cross-site replication in PXC without loss of resilience

Cross-site replication for Percona XtraDB Cluster addresses the DR challenge without the risk of overloading the primary cluster. We analyze where limitations arise and how they are circumvented in Kubernetes.

The problem does not manifest immediately — until the requirements for fault tolerance exceed the boundaries of a single data center. This is particularly sensitive for Percona XtraDB Cluster (PXC), as the cluster employs virtually synchronous replication. Adding a remote DR site directly into the same circuit increases latency and may cause flow control issues. In Kubernetes, this is complicated by network abstraction and dynamic endpoints. As a result, the architecture faces a compromise: either consistency and synchronization, or geo-distributed resilience.

The chosen approach is to separate into two clusters: primary (DC) and backup (DR), with asynchronous replication between them. This alleviates pressure on the primary cluster and isolates network delays. The key tool becomes the Percona Operator for MySQL, which simplifies the setup of cross-site replication through Kubernetes CR (custom resource). The trade-off is clear: replication lag is introduced, but the system gains in stability and manageability. Additionally, an automatic asynchronous replication connection failover mechanism is utilized, allowing DR to switch between DC nodes during failures.

Implementation begins with the DC cluster: three PXC nodes, each with an external endpoint (EXTERNAL IP) accessible for DR. The CR file includes parameters for inter-cluster replication. Next, a backup is performed, which is used as the starting point for DR. On the DR side, a similar PXC cluster is deployed in a separate Kubernetes environment. After data restoration, a replication channel is added specifying all EXTERNAL IPs of the source.

An important detail arises in practice: authentication. When using caching_sha2_password, a secure connection (SSL/TLS) is required. Without it, replication does not start. The workaround involves using GET_SOURCE_PUBLIC_KEY or SOURCE_PUBLIC_KEY_PATH, which includes RSA key exchange. After this, DR connects correctly to DC. Within the DR cluster, synchronization between nodes occurs via Galera, thus remaining synchronous.

The failover mechanism operates at the level of asynchronous replication. If one DC node is unavailable, DR switches to another, based on priority (weight) and node order. This reduces the risk of complete loss of connectivity. However, it is important to understand: this does not eliminate replication delays, but merely enhances channel availability.

The result is a functional DR architecture for the Kubernetes environment with manageable complexity. The primary cluster does not suffer from remote nodes, while DR receives data with an acceptable delay. Metrics in the original data are not provided, so it is not possible to quantitatively assess lag or throughput. Nevertheless, the scheme itself reflects a typical industry compromise: sacrificing full synchronization for resilience and scalability.

It is also worth noting the choice of initial data loading method. The example uses a physical backup via XtraBackup, but logical options such as mysqldump or mydumper are also acceptable. This affects recovery time and consistency but does not change the architectural principle.

In conclusion, the architecture appears as an evolutionary solution: PXC remains synchronous within the DC, while an asynchronous layer is introduced between DC and DR. This reduces degradation risks and provides a controlled fault tolerance model for Kubernetes.

Read

×

🚀 Deploy the Blocks

Controls: ← → to move, ↑ to rotate, ↓ to drop.
Mobile: use buttons below.