-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Description
Objective
Fix nested virtualization to achieve 95%+ Azure ML job success rate.
Context
- Current issue: TrustedLaunch security type disables nested virtualization
- Impact: Jobs stuck for 8+ hours with 0/13 tasks completed
- Solution: Use Standard_D4s_v5 with vm_security_type="Standard"
Implementation Tasks
Task 1: Fix Nested Virtualization Configuration [2h] ✅
- Read
/tmp/AZURE_LONG_TERM_SOLUTION.mdSection 7, Phase 1 - Update
openadapt_evals/benchmarks/azure.py:- Change
AzureConfig.vm_sizedefault to "Standard_D4s_v5" - Add
vm_security_typeparameter (default: "Standard") - Add
enable_nested_virtualizationparameter (default: True) - Update compute instance creation to use these settings
- Change
Status: ✅ Completed in PR #11
Task 2: Container Startup Health Check [4h] ✅
-
Create
openadapt_evals/benchmarks/health_checker.py:wait_for_container_start()- Poll for Docker logs, timeout 5-10 mincheck_container_running()- Verify container is alive- Raise
ContainerStartupTimeoutif fails
-
Update
azure.pyorchestrator:- Call health checker after job submission
- Fail fast if container doesn't start
Status: ✅ Completed in PR #11
Task 3: Job Retry Logic [3h] ✅
- Install tenacity: Add to pyproject.toml
- Add retry decorator to job submission:
```python
@Retry(
stop=stop_after_attempt(3),
wait=wait_exponential(multiplier=1, min=4, max=60)
)
def submit_job_with_retry(...)
```
Status: ✅ Completed in PR #11
Task 4: Stuck Job Detection [2h] ✅
-
Add to health checker:
- Monitor job logs for progress (last 10 minutes)
- If no new logs, cancel job
- Raise error for retry
-
Test with single-task evaluation
Status: ✅ Completed in PR #11
Deliverables
- ✅ Updated azure.py with new VM config
- ✅ health_checker.py with startup monitoring
- ✅ Retry logic integrated
- ✅ PR created: [P1] Fix Azure nested virtualization (Issue #8) #11 - [P1] Fix Azure nested virtualization
Success Criteria
- Single-task evaluation completes successfully
- No jobs stuck for >15 minutes
- Code reviewed and tests pass
- Success rate >95% on test runs
Time Estimate
11 hours (completed as estimated)
Implementation
PR: #11
Changes:
- VM Configuration: Standard_D4s_v5 with Standard security type
- Health Checker: Container startup monitoring with 10-minute timeout
- Retry Logic: 3 attempts with exponential backoff
- Stuck Job Detection: Auto-cancel jobs with no progress
Next Steps
- Review and merge PR [P1] Fix Azure nested virtualization (Issue #8) #11
- Run validation tests:
- Single-task evaluation
- 10-task evaluation
- Measure success rate improvement
- Update this issue with test results
Related
- PR [P1] Fix Azure nested virtualization (Issue #8) #11: [P1] Fix Azure nested virtualization
/tmp/AZURE_LONG_TERM_SOLUTION.mdSection 7, Phase 1- Failed job analysis:
waa-waa3718w0-1768743963-20a88242
Metadata
Metadata
Assignees
Labels
No labels