-
Notifications
You must be signed in to change notification settings - Fork 114
Description
We had an issue recently where one of the replica nodes in our Redis cluster became unresponsive. Radix seemed to do a pretty good job at detecting that the connections it had open were failing and cleaning them out of the pool, but since Redis was still reporting the node in the output from "CLUSTER SLOTS" it understandably kept the pool around.
This led to an issue where all the commands we were running via DoSecondary on that shard slowed down considerably. We have two replicas for each shard in our cluster, and because the pool for the unresponsive node was still around Radix had a 50% chance of sending commands to it (which would inevitably fail after some timeout). We retry those failures, and we'd then have another 50% chance of getting the bad node again, etc.
An improvement we're suggesting is updating the code for selecting which replica to use to prefer any replicas which have non-empty pools. I've created an implementation of this that in some limited local testing removed all the errors we were seeing beforehand by making Radix more consistently choose the good node during this kind of scenario. If all nodes are bad (or just loaded up at that moment) then it should effectively perform the same behaviour as before of just selecting one at random.
The commit with the changes on it can be found here: woodsbury@62c0519
I wanted to present it here before creating a pull request for it in case there's any issues you can see with the general concept. The current changes do access the connections inside a pool directly as that was the easiest way to implement it, but there may be ways to clean that up more.