Skip to content

Connection manager failure after configuration of RT replication #474

@cmeiklejohn

Description

@cmeiklejohn

I've identified another issue using basho/riak_test#470:

If you remove the wait_for_ring_convergence call after enabling and starting realtime replication, you can run into a situation which appears to be the following:

1.) Node 1 is leader, and knows about cluster B.
2.) Node 1 connection manger is killed, triggering cluster manager to restart because of the rest_for_all supervision configuration.
3.) Node 1 comes back online, and is re-elected as leader (but it appears as a new election as cluster manager has just started for the first time, triggering the notify fun to be called immediately after registration through riak_repl2_leader).
4.) Node 1 REPL ring contains no information about remote clusters it was previously connected to.

Both @kellymclaughlin and I have been able to reproduce this via the test by removing the wait calls, but it's unclear as to what the actual root cause is. We've decided to hold this issue back and push 1.4.4 along without attempting to fix it.

cc: @Vagabond @metadave @jonmeredith @jaredmorrow

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions