Configuring Redis HA on AWS with sentries - redis nodes that see different sentries end in an endless loop

Our setting

  • 3x redis sentinels, one for each AWS Sydney AZ
  • 2 to 500 redis nodes with a master and several slaves that scale automatically horizontally using AWS Auto-scaling group policies.
  • 1x Write an ELB that redirects traffic to the main
  • 1x ELB Read that pushes traffic to slaves
  • 1x Sentinel ELB that pushes traffic to the clock
  • 1x Facilitator (more on this below)

This setting is replicated in two clusters for what we call metadata and cache. We want to expand more clusters.

Leading

Is a python daemon created that subscribes to the hourly pubs / subtitles and listens for messages + switch-master . Here are the actions that the broker takes:

  • Detects and triggers emergency mode caused by + switch key
  • Sentinel requests for a new master using SENTINEL get-master-addr-by-name mymaster
  • Old master tags with RedisClusterNodeRole = slave
  • Tags New Wizard with RedisClusterNodeRole = Wizard
  • Adds a new master to our ELB record.
  • Removes a new master from our readable ELB
  • Removes the old master from our ELB record.
  • Trying to add the old master to our read ELB (this will fail if the server is down, which is good)

Problem

Since slaves can come and go several times a day depending on traffic, this happens when we get some slaves belonging to the sides of both clusters fighting for the same slave. This is due to the fact that IP pools are distributed between clusters, and as far as we know, slave identifiers are their IP addresses.

Here's how to replicate:

  • The cluster cache has a master with IP 172.24.249.152
  • The cluster cache has a master transceiver that supports switching by IP address 172.24.246.142. Node with IP 172.24.249.152 is now off
  • Cluster metadata is scaled and DHCP assigns IP 172.24.249.152 (previous cluster cache master)
  • The cluster cache will see that the previous master has already risen and will try to reconfigure it as slaveof 172.24.246.142 (the new cache cluster master)
  • Cluster metadata will invoke + sdown on 172.24.246.142, and after a while a -sdown followed by + slave-reconf-sent to it to try to reconfigure it as a subordinate metadata cluster
  • The cluster cache will try to do the same as the cluster metadata at point 5.

Guardians are stuck in this endless loop fighting forever for this resource. Even when we only have a sentinel group that manage both redis groups with different wizard names, this happens. This leads us to believe that the parties do not know the resources between different clusters, but they simply do what is logical for each cluster separately.

The solutions we tried

  • Running a SENTINEL reset mymaster after the + sdown event to try to make the sentinels forget about this node. The problem is that it can generate a race condition if this cluster performs a master transition to another resource. We successfully reproduced this assumption and left it to the left of the observers out of synchronization, where one points to one leader and the other two points to the other.
  • Divide the network into pools of IP addresses, one per cluster. This works because IP addresses are never reused, and also make things much less agile and complicated when we need a new cluster. This is the decision in which we ended, but we would like to avoid it, if possible.

Ideal Solution (s)

  • Redis sentinel to provide SENTINEL removeslave 172.24.246.142 mymaster , which we can fire every time there is a + sdown event on the slave. This will make this cluster forget that the slave ever existed without creating the side effects that SENTINEL reset mymaster .

  • Stop uniqueness identification of slaves exclusively by IP. Perhaps add a redis server start timestamp or any other token that will prevent shutdown of stopped slaves, and new ones that returned with the same IP address as the same node.

Question

Can you guys think of any other solution that is not related to changing the redis sentinel code, and that there is no need to split IP pools between clusters?

+6
source share

All Articles