docs: Add comprehensive guides for cluster scale operations and maint… by ksaur · Pull Request #812 · NVIDIA/NVSentinel

ksaur · 2026-02-06T22:17:42Z

Summary

This PR adds comprehensive documentation to address circuit breaker trips during legitimate cluster operations such as node scale-up, initial GPU bringup, autoscaling events, and maintenance scenarios.

Problem Statement

Currently, NVSentinel's circuit breaker trips during legitimate operations when GPU nodes appear "unhealthy" during their initialization phase. This occurs because:

Services inside the cluster cannot distinguish between scale-up and failure: A controller running in-cluster sees nodes coming online but doesn't know if they're new nodes or existing nodes recovering from failure.
Circuit breaker trips during common scenarios:
- Initial cluster bringup (CPU-only → add GPU nodes)
- Autoscaling events (Karpenter, Cluster Autoscaler)
- Manual capacity expansion
- Multi-zone node additions
- Driver/GPU Operator upgrades
- Node maintenance operations (OSMO, hardware work)
Current workaround is manual and error-prone: Operators must remember to manually label nodes with k8saas.nvidia.com/ManagedByNVSentinel=false before operations, then remove the label after - this is often forgotten during time-pressured situations.

Related: HIPPO-5214 (Circuit breaker tripped during EKS cluster bringup)

Solution

This PR provides comprehensive documentation following a phased approach:

Phase 1: Enhanced Documentation (This PR)

Create detailed runbooks and guides that operators can use immediately:

Cluster Scale Operations Runbook - Step-by-step procedures for preventing circuit breaker trips during node scale-up and bringup
Maintenance Operations Guide - Comprehensive guide covering all maintenance scenarios with best practices
ADR-027 - Architecture decision record proposing future automation paths

Phase 2 & 3: Future Automation (Documented in ADR)

Phase 2: Orchestration layer integration (runtime controllers, GitOps)
Phase 3: Optional admission controller for user-managed autoscaling

Changes

New Documentation

docs/runbooks/cluster-scale-operations.md: Detailed guide for scale operations
- Why circuit breaker trips during scale-up
- Step-by-step procedures with ManagedByNVSentinel label pattern
- Common scenarios (initial bringup, autoscaling, multi-zone)
- Integration examples (Terraform, Ansible, GitOps)
- Troubleshooting and quick reference
docs/runbooks/maintenance-operations.md: Comprehensive maintenance guide
- 7 operation types covered (scale-up, driver upgrades, OSMO, hardware, infrastructure, GPU Operator updates, testing)
- Best practices (before, during, after maintenance)
- Automation strategies
- Decision tree for when labeling is needed
- Quick reference commands
docs/designs/027-automated-node-labeling.md: ADR for automation
- Root cause analysis (information asymmetry problem)
- Phased approach with rationale
- Implementation details with code examples
- Alternatives considered and rejected

Updated Documentation

docs/runbooks/circuit-breaker.md: Added common scenarios section and cross-references
docs/runbooks/README.md: Reorganized with Maintenance Operations section at top
README.md: Added comprehensive Documentation section with quick reference

Key Pattern Documented

The core pattern for all maintenance operations:

# 1. Before operation: Disable NVSentinel management
kubectl label node <NODE_NAME> k8saas.nvidia.com/ManagedByNVSentinel=false

# 2. Perform operation (scale-up, upgrade, maintenance, etc.)

# 3. Verify node health

# 4. After operation: Re-enable NVSentinel management
kubectl label node <NODE_NAME> k8saas.nvidia.com/ManagedByNVSentinel-

Summary

Type of Change

Component(s) Affected

Testing

Tests pass locally
Manual testing completed
No breaking changes (or documented)

Checklist

Self-review completed
Documentation updated (if needed)
Ready for review

…enance Add new documentation to address circuit breaker trips during legitimate cluster operations such as node scale-up, initial GPU bringup, and maintenance scenarios. New Documentation: - docs/runbooks/cluster-scale-operations.md: Detailed guide for preventing circuit breaker trips during node scale-up and initial cluster bringup (CPU → GPU transitions). Covers autoscaling, multi-zone expansion, and integration with IaC tools (Terraform, Ansible, GitOps). - docs/runbooks/maintenance-operations.md: Comprehensive guide covering all maintenance scenarios (scale operations, driver upgrades, OSMO, hardware maintenance, infrastructure work, etc.). Includes best practices, automation strategies, troubleshooting, and decision tree for when labeling is needed. - docs/designs/027-automated-node-labeling.md: ADR proposing phased approach for automating node labeling during scale operations. Phase 1: enhanced docs, Phase 2: orchestration layer integration, Phase 3: optional admission controller. Updated Documentation: - docs/runbooks/circuit-breaker.md: Added common scenarios section covering scale operations, driver upgrades, and maintenance. Added cross-references to new guides. - docs/runbooks/README.md: Reorganized with Maintenance Operations section, highlighting new guides as essential reading. - README.md: Added comprehensive Documentation section with links to all operational guides and quick reference for the ManagedByNVSentinel label pattern. Addresses Issue: Circuit breaker trips during legitimate operations when GPU nodes appear unhealthy during initialization. Solution uses k8saas.nvidia.com/ManagedByNVSentinel=false label to temporarily exclude nodes from NVSentinel management during transitional states. Related: HIPPO-5214

copy-pr-bot · 2026-02-06T22:17:46Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-02-06T22:17:52Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

🔍 Trigger a full review

✨ Finishing touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

ksaur · 2026-02-06T22:20:57Z

This is a brainstorm of ideas to start a discussion. It needs to be paired waaaaay back and edited heavily

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: Add comprehensive guides for cluster scale operations and maint…#812

docs: Add comprehensive guides for cluster scale operations and maint…#812
ksaur wants to merge 1 commit intoNVIDIA:mainfrom
ksaur:docs/cluster-scale-operations-guide

ksaur commented Feb 6, 2026

Uh oh!

copy-pr-bot bot commented Feb 6, 2026

Uh oh!

coderabbitai bot commented Feb 6, 2026

Review skipped

Uh oh!

ksaur commented Feb 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ksaur commented Feb 6, 2026

Summary

Problem Statement

Solution

Phase 1: Enhanced Documentation (This PR)

Phase 2 & 3: Future Automation (Documented in ADR)

Changes

New Documentation

Updated Documentation

Key Pattern Documented

Summary

Type of Change

Component(s) Affected

Testing

Checklist

Uh oh!

copy-pr-bot bot commented Feb 6, 2026

Uh oh!

coderabbitai bot commented Feb 6, 2026

Review skipped

Uh oh!

ksaur commented Feb 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant