Skip to content

Conversation

@KyrinCode
Copy link

Conductor Failover Monitor

Overview

conductor-failover-monitor.py is a monitoring daemon that watches the health of voting nodes in an op-conductor cluster and automatically promotes a non-voter to voter when all voters become unhealthy. This provides an automated disaster recovery mechanism for sequencer clusters.

Problem Statement

In an op-conductor HA cluster, if all voting nodes become unhealthy simultaneously (e.g., due to regional outage or infrastructure failure), the cluster loses quorum and cannot elect a new leader. Manual intervention is required to promote a non-voter, which increases downtime.

Solution

This monitor continuously checks the health of all voter nodes. When it detects that all voters are unhealthy, it automatically:

  1. Floods conductor_addServerAsVoter requests to all voter nodes (in case one recovers and can process the request)
  2. Promotes the first configured non-voter to become a voter
  3. Verifies the promoted node becomes the leader
  4. Exits successfully after failover completion

Configuration

Uses the same config.toml format as op-conductor-ops

Usage

poetry install
poetry run python conductor-failover-monitor.py -v

Command Line Options

Option Default Description
-c, --config ./config.toml Path to configuration file
-i, --interval 10 Health check interval in seconds
--promote-retry-interval 2 Retry interval during promotion in seconds
--max-retries 30 Maximum promotion retry attempts
-v, --verbose false Enable debug logging
--cert - SSL certificate file path

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants