Skip to content

docs: Add comprehensive guides for cluster scale operations and maint…#812

Draft
ksaur wants to merge 1 commit intoNVIDIA:mainfrom
ksaur:docs/cluster-scale-operations-guide
Draft

docs: Add comprehensive guides for cluster scale operations and maint…#812
ksaur wants to merge 1 commit intoNVIDIA:mainfrom
ksaur:docs/cluster-scale-operations-guide

Conversation

@ksaur
Copy link
Contributor

@ksaur ksaur commented Feb 6, 2026

Summary

This PR adds comprehensive documentation to address circuit breaker trips during legitimate cluster operations such as node scale-up, initial GPU bringup, autoscaling events, and maintenance scenarios.

Problem Statement

Currently, NVSentinel's circuit breaker trips during legitimate operations when GPU nodes appear "unhealthy" during their initialization phase. This occurs because:

  1. Services inside the cluster cannot distinguish between scale-up and failure: A controller running in-cluster sees nodes coming online but doesn't know if they're new nodes or existing nodes recovering from failure.

  2. Circuit breaker trips during common scenarios:

    • Initial cluster bringup (CPU-only → add GPU nodes)
    • Autoscaling events (Karpenter, Cluster Autoscaler)
    • Manual capacity expansion
    • Multi-zone node additions
    • Driver/GPU Operator upgrades
    • Node maintenance operations (OSMO, hardware work)
  3. Current workaround is manual and error-prone: Operators must remember to manually label nodes with k8saas.nvidia.com/ManagedByNVSentinel=false before operations, then remove the label after - this is often forgotten during time-pressured situations.

Related: HIPPO-5214 (Circuit breaker tripped during EKS cluster bringup)

Solution

This PR provides comprehensive documentation following a phased approach:

Phase 1: Enhanced Documentation (This PR)

Create detailed runbooks and guides that operators can use immediately:

  1. Cluster Scale Operations Runbook - Step-by-step procedures for preventing circuit breaker trips during node scale-up and bringup
  2. Maintenance Operations Guide - Comprehensive guide covering all maintenance scenarios with best practices
  3. ADR-027 - Architecture decision record proposing future automation paths

Phase 2 & 3: Future Automation (Documented in ADR)

  • Phase 2: Orchestration layer integration (runtime controllers, GitOps)
  • Phase 3: Optional admission controller for user-managed autoscaling

Changes

New Documentation

  • docs/runbooks/cluster-scale-operations.md: Detailed guide for scale operations

    • Why circuit breaker trips during scale-up
    • Step-by-step procedures with ManagedByNVSentinel label pattern
    • Common scenarios (initial bringup, autoscaling, multi-zone)
    • Integration examples (Terraform, Ansible, GitOps)
    • Troubleshooting and quick reference
  • docs/runbooks/maintenance-operations.md: Comprehensive maintenance guide

    • 7 operation types covered (scale-up, driver upgrades, OSMO, hardware, infrastructure, GPU Operator updates, testing)
    • Best practices (before, during, after maintenance)
    • Automation strategies
    • Decision tree for when labeling is needed
    • Quick reference commands
  • docs/designs/027-automated-node-labeling.md: ADR for automation

    • Root cause analysis (information asymmetry problem)
    • Phased approach with rationale
    • Implementation details with code examples
    • Alternatives considered and rejected

Updated Documentation

  • docs/runbooks/circuit-breaker.md: Added common scenarios section and cross-references
  • docs/runbooks/README.md: Reorganized with Maintenance Operations section at top
  • README.md: Added comprehensive Documentation section with quick reference

Key Pattern Documented

The core pattern for all maintenance operations:

# 1. Before operation: Disable NVSentinel management
kubectl label node <NODE_NAME> k8saas.nvidia.com/ManagedByNVSentinel=false

# 2. Perform operation (scale-up, upgrade, maintenance, etc.)

# 3. Verify node health

# 4. After operation: Re-enable NVSentinel management
kubectl label node <NODE_NAME> k8saas.nvidia.com/ManagedByNVSentinel-

Summary

Type of Change

  • 🐛 Bug fix
  • ✨ New feature
  • 💥 Breaking change
  • 📚 Documentation
  • 🔧 Refactoring
  • 🔨 Build/CI

Component(s) Affected

  • Core Services
  • Documentation/CI
  • Fault Management
  • Health Monitors
  • Janitor
  • Other: bringup and scaling

Testing

  • Tests pass locally
  • Manual testing completed
  • No breaking changes (or documented)

Checklist

  • Self-review completed
  • Documentation updated (if needed)
  • Ready for review

…enance

Add new documentation to address circuit breaker trips during legitimate
cluster operations such as node scale-up, initial GPU bringup, and
maintenance scenarios.

New Documentation:
- docs/runbooks/cluster-scale-operations.md: Detailed guide for preventing
  circuit breaker trips during node scale-up and initial cluster bringup
  (CPU → GPU transitions). Covers autoscaling, multi-zone expansion, and
  integration with IaC tools (Terraform, Ansible, GitOps).

- docs/runbooks/maintenance-operations.md: Comprehensive guide covering all
  maintenance scenarios (scale operations, driver upgrades, OSMO, hardware
  maintenance, infrastructure work, etc.). Includes best practices, automation
  strategies, troubleshooting, and decision tree for when labeling is needed.

- docs/designs/027-automated-node-labeling.md: ADR proposing phased approach
  for automating node labeling during scale operations. Phase 1: enhanced docs,
  Phase 2: orchestration layer integration, Phase 3: optional admission
  controller.

Updated Documentation:
- docs/runbooks/circuit-breaker.md: Added common scenarios section covering
  scale operations, driver upgrades, and maintenance. Added cross-references
  to new guides.

- docs/runbooks/README.md: Reorganized with Maintenance Operations section,
  highlighting new guides as essential reading.

- README.md: Added comprehensive Documentation section with links to all
  operational guides and quick reference for the ManagedByNVSentinel label
  pattern.

Addresses Issue:
Circuit breaker trips during legitimate operations when GPU nodes appear
unhealthy during initialization. Solution uses k8saas.nvidia.com/ManagedByNVSentinel=false
label to temporarily exclude nodes from NVSentinel management during transitional
states.

Related: HIPPO-5214
@copy-pr-bot
Copy link

copy-pr-bot bot commented Feb 6, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 6, 2026

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

  • 🔍 Trigger a full review
✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@ksaur
Copy link
Contributor Author

ksaur commented Feb 6, 2026

This is a brainstorm of ideas to start a discussion. It needs to be paired waaaaay back and edited heavily

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant