docs: Add comprehensive guides for cluster scale operations and maint…#812
Draft
ksaur wants to merge 1 commit intoNVIDIA:mainfrom
Draft
docs: Add comprehensive guides for cluster scale operations and maint…#812ksaur wants to merge 1 commit intoNVIDIA:mainfrom
ksaur wants to merge 1 commit intoNVIDIA:mainfrom
Conversation
…enance Add new documentation to address circuit breaker trips during legitimate cluster operations such as node scale-up, initial GPU bringup, and maintenance scenarios. New Documentation: - docs/runbooks/cluster-scale-operations.md: Detailed guide for preventing circuit breaker trips during node scale-up and initial cluster bringup (CPU → GPU transitions). Covers autoscaling, multi-zone expansion, and integration with IaC tools (Terraform, Ansible, GitOps). - docs/runbooks/maintenance-operations.md: Comprehensive guide covering all maintenance scenarios (scale operations, driver upgrades, OSMO, hardware maintenance, infrastructure work, etc.). Includes best practices, automation strategies, troubleshooting, and decision tree for when labeling is needed. - docs/designs/027-automated-node-labeling.md: ADR proposing phased approach for automating node labeling during scale operations. Phase 1: enhanced docs, Phase 2: orchestration layer integration, Phase 3: optional admission controller. Updated Documentation: - docs/runbooks/circuit-breaker.md: Added common scenarios section covering scale operations, driver upgrades, and maintenance. Added cross-references to new guides. - docs/runbooks/README.md: Reorganized with Maintenance Operations section, highlighting new guides as essential reading. - README.md: Added comprehensive Documentation section with links to all operational guides and quick reference for the ManagedByNVSentinel label pattern. Addresses Issue: Circuit breaker trips during legitimate operations when GPU nodes appear unhealthy during initialization. Solution uses k8saas.nvidia.com/ManagedByNVSentinel=false label to temporarily exclude nodes from NVSentinel management during transitional states. Related: HIPPO-5214
Contributor
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the
✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Contributor
Author
|
This is a brainstorm of ideas to start a discussion. It needs to be paired waaaaay back and edited heavily |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR adds comprehensive documentation to address circuit breaker trips during legitimate cluster operations such as node scale-up, initial GPU bringup, autoscaling events, and maintenance scenarios.
Problem Statement
Currently, NVSentinel's circuit breaker trips during legitimate operations when GPU nodes appear "unhealthy" during their initialization phase. This occurs because:
Services inside the cluster cannot distinguish between scale-up and failure: A controller running in-cluster sees nodes coming online but doesn't know if they're new nodes or existing nodes recovering from failure.
Circuit breaker trips during common scenarios:
Current workaround is manual and error-prone: Operators must remember to manually label nodes with
k8saas.nvidia.com/ManagedByNVSentinel=falsebefore operations, then remove the label after - this is often forgotten during time-pressured situations.Related: HIPPO-5214 (Circuit breaker tripped during EKS cluster bringup)
Solution
This PR provides comprehensive documentation following a phased approach:
Phase 1: Enhanced Documentation (This PR)
Create detailed runbooks and guides that operators can use immediately:
Phase 2 & 3: Future Automation (Documented in ADR)
Changes
New Documentation
docs/runbooks/cluster-scale-operations.md: Detailed guide for scale operationsManagedByNVSentinellabel patterndocs/runbooks/maintenance-operations.md: Comprehensive maintenance guidedocs/designs/027-automated-node-labeling.md: ADR for automationUpdated Documentation
docs/runbooks/circuit-breaker.md: Added common scenarios section and cross-referencesdocs/runbooks/README.md: Reorganized with Maintenance Operations section at topREADME.md: Added comprehensive Documentation section with quick referenceKey Pattern Documented
The core pattern for all maintenance operations:
Summary
Type of Change
Component(s) Affected
Testing
Checklist