Connection manager failure after configuration of RT replication

I've identified another issue using https://github.com/basho/riak_test/pull/470:

If you remove the wait_for_ring_convergence call after enabling and starting realtime replication, you can run into a situation which appears to be the following:

1.) Node 1 is leader, and knows about cluster B.
2.) Node 1 connection manger is killed, triggering cluster manager to restart because of the rest_for_all supervision configuration.
3.) Node 1 comes back online, and is re-elected as leader (but it appears as a new election as cluster manager has just started for the first time, triggering the notify fun to be called immediately after registration through riak_repl2_leader).
4.) Node 1 REPL ring contains no information about remote clusters it was previously connected to.

Both @kellymclaughlin and I have been able to reproduce this via the test by removing the wait calls, but it's unclear as to what the actual root cause is.  We've decided to hold this issue back and push 1.4.4 along without attempting to fix it.

cc: @Vagabond @metadave @jonmeredith @jaredmorrow 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Connection manager failure after configuration of RT replication #474

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Connection manager failure after configuration of RT replication #474

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions