Configure Apache Cassandra for disaster recovery

How do you configure Apache Cassandra to provide disaster recovery so that one of the two data centers fails?

The DataStax documentation talks about using a replication strategy that ensures at least one replication is written to each of your two data centers. But I donโ€™t see how this helps when a catastrophe actually happened. If you switch to the remaining data center, all your records will fail because these records will not be able to replicate to another data center.

I think you would like your software to operate in two modes: normal mode, for which records should be copied both in data centers and in distress mode, for which they do not need. But changing the replication strategy is not possible.

What I really want is two data centers that are in reserve, and during normal operations, use the resources of both data centers, but use the resources of only one remaining data center (with reduced performance), when there is only one data center data.

+4
source share
1 answer

The trick is to change the consistency setting provided by the API for writing, instead of changing the replication rate. Use the LOCAL_QUORUM parameter to record during a disaster when only one data center is available. During normal operation, use EACH_QUORUM so that both data centers have a copy of the data. Reading can use LOCAL_QUORUM all the time.

The following is a brief overview of the Datastax documentation for several data centers and the earlier but still conceptually meaningful disaster recovery (0.7) .

Make a recipe to meet your needs with the two consistencies LOCAL_QUORUM and EACH_QUORUM .

Here, โ€œlocalโ€ means local to one data center, and โ€œeachโ€ means that consistency is strictly maintained at the same level in each data center.

Suppose you have 2 data centers, one of which is used exclusively for disaster recovery, then you can set the replication rate ...

3 for the main write / read center and two for the bounce center

Now, depending on how important it is that your data is actually written to the failover nodes, you can use EACH_QUORUM or LOCAL_QUORUM. Assuming you are using NetworkTopologyStrategy (NTS) replication strategy,

LOCAL_QUORUM when writing will only delay the client for writing locally to DC1 and write asynchronously to your recovery node (s) in DC2.

EACH_QUORUM guarantees that all data will be replicated, but will postpone records until both DCs confirm successful operation.

For reading, it is best to use LOCAL_QUORUM to avoid inter-data center latency .

There are approaches to this approach! If you decide to use EACH_QUORUM for your records, you will increase potential points of failure (DC2 does not work, DC1-DC2 communication does not work, DC1 quorum cannot be executed).

Bonus - after your DC1 drops, you have a valid DC2 disaster recovery. Also note that the 2nd link is about snitch user settings for the correct routing of your IP addresses.

+8
source

All Articles