Skip to content

Node unable to persist log, but keeps being elected #614

@k-jingyang

Description

@k-jingyang

Hello,

Recently, we faced an issue where we encounter a case where the leader node had a bad persistent store:

This cycle caused unstable leadership during the period. For us, this cycle persisted for 10 mins until another node was finally elected leader.

Wondering if there are recommendations or good practices for handling such cases? Given that Hashicorp runs your own cloud offerings too.

Also, wondering if there is an optimisation that we can do here in the library? I understand there's some nuances to this.

  • Based on my understanding, the current way to fend against this is that heartbeat timeouts has a form of randomness.
  • Given a cluster: node A (leader), node B, node C:
    • When node A demotes itself, because of randomness, node C has a chance to timeout earlier and becomes a candidate before node A becomes one
    • However, if node B doesn't timeout, it will still think that node A is the leader, and will always reject node C's vote request.
    • In such cases, node A has the natural advantage in winning elections. This is not preferred when node A has a persistent store issue

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions