Node unable to persist log, but keeps being elected

Hello,

Recently, we faced an issue where we encounter a case where the leader node had a bad persistent store:
- leader node was unable to persist logs -> demotes itself -> starts election -> becomes leader -> unable to persists logs -> ... (the cycle repeats)
  - Demotion is caused by https://github.com/hashicorp/raft/blob/main/raft.go#L1271

This cycle caused unstable leadership during the period. For us, this cycle persisted for 10 mins until another node was finally elected leader.

Wondering if there are recommendations or good practices for handling such cases? Given that Hashicorp runs your own cloud offerings too.

Also, wondering if there is an optimisation that we can do here in the library? I understand there's some nuances to this. 
- Based on my understanding, the current way to fend against this is that heartbeat timeouts has a form of randomness. 
- Given a cluster: node A (leader), node B, node C:
  - When node A demotes itself, because of randomness, node C has a chance to timeout earlier and becomes a candidate before node A becomes one
  - However, if node B doesn't timeout, it will still think that node A is the leader, and will always reject node C's vote request. 
  - In such cases, node A has the natural advantage in winning elections. This is not preferred when node A has a persistent store issue



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Node unable to persist log, but keeps being elected #614

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Node unable to persist log, but keeps being elected #614

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions