Skip to content

fix(allocator): Fix VM startup race condition and organize logging#262

Merged
7174Andy merged 2 commits intomainfrom
fix/vm-startup-race-condition-and-logging
Feb 5, 2026
Merged

fix(allocator): Fix VM startup race condition and organize logging#262
7174Andy merged 2 commits intomainfrom
fix/vm-startup-race-condition-and-logging

Conversation

@7174Andy
Copy link
Collaborator

@7174Andy 7174Andy commented Feb 3, 2026

Summary

Fix a race condition in the /vm_startup endpoint where VMs would miss their CRD command assignment, and reorganize logging throughout both allocator and client packages for better admin visibility.

Changes

Fixed

  • Race condition in /vm_startup endpoint (main.py:385-398) where clients would miss CRD commands if users submitted them before the client called the endpoint. Now checks if CRD is already assigned before starting PostgreSQL LISTEN/NOTIFY.

Changed (Allocator)

  • Reorganized logging in database.py:
    • Removed duplicate logging configuration at module level
    • Changed verbose DEBUG logs to INFO for important business events (VM assignment, CRD received, status changes)
    • Changed ERROR to WARNING for expected conditions (VM not found)
    • Removed unnecessary DEBUG logs for routine operations
    • Standardized error message format: "Failed to <action> for VM '<hostname>': {e}"
  • Methods now raise exceptions after rollback for proper error propagation (insert_vm, assign_vm, update_vm_in_use, etc.)
  • get_assigned_vms() now returns [] instead of None on error for consistency with get_unassigned_vms()

Changed (Client)

  • Reorganized logging in client service modules:
    • check_gpu.py: Removed verbose nvidia-smi output, streamlined status logs
    • connect_crd.py: Removed debug logs for args/command, added success log
    • subscribe.py: Cleaned up retry logs, removed config dump
    • update_inuse_status.py: Removed process iteration debug logs
  • Use INFO for service startup and important events
  • Use WARNING for retryable failures
  • Standardized log message format

Added

  • test_vm_startup_already_has_crd - tests race condition handling when CRD is pre-assigned
  • test_vm_startup_vm_not_found - tests 404 response when VM doesn't exist

Technical Details

The race condition occurred when:

  1. User submitted CRD command via /api/request_vm at time T
  2. PostgreSQL NOTIFY was sent immediately
  3. Client called /vm_startup at time T+N (after startup delay)
  4. Client started LISTEN but the NOTIFY had already been sent

The fix adds a check before listening: if the VM already has crdcommand and pin populated, return immediately with those values instead of waiting for a NOTIFY that will never come.

Testing

  • Allocator unit tests pass (119 tests in test_database.py + test_api_calls.py)
  • Client unit tests pass (50 tests)
  • Linting passes (ruff check)
  • Docker images build successfully
  • Tested locally with VM startup flow

Checklist

  • Code follows project conventions
  • Tests added/updated
  • No breaking changes
  • No secrets committed

🤖 Generated with Claude Code

7174Andy and others added 2 commits February 3, 2026 14:45
- Fix race condition in /vm_startup where client would miss CRD command
  if user submitted it before client called the endpoint. Now checks if
  CRD is already assigned before starting PostgreSQL LISTEN/NOTIFY.

- Reorganize logging in database.py:
  - Use INFO for important business events (VM assignment, CRD received)
  - Use WARNING for expected conditions (VM not found)
  - Remove verbose DEBUG logs for routine operations
  - Fix line-too-long warnings

- Update tests to match new behavior and add new test cases:
  - test_vm_startup_already_has_crd for race condition handling
  - test_vm_startup_vm_not_found for 404 case

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Reorganize logging in client service modules:
  - Use INFO for service startup and important events
  - Use WARNING for retryable failures
  - Remove verbose DEBUG logs for routine operations
  - Standardize log message format

- Changes by file:
  - check_gpu.py: Remove verbose nvidia-smi output, streamline status logs
  - connect_crd.py: Remove debug logs for args and command, add success log
  - subscribe.py: Clean up retry logs, remove config dump
  - update_inuse_status.py: Remove process iteration debug logs

- Update tests to match new log message formats

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@7174Andy 7174Andy changed the title fix(allocator): Fix VM startup race condition and organize logging fix(allocator & client): Fix VM startup race condition and organize logging Feb 3, 2026
@7174Andy 7174Andy changed the title fix(allocator & client): Fix VM startup race condition and organize logging fix(allocator): Fix VM startup race condition and organize logging Feb 3, 2026
@7174Andy 7174Andy merged commit c1bf6b1 into main Feb 5, 2026
9 checks passed
@7174Andy 7174Andy deleted the fix/vm-startup-race-condition-and-logging branch February 5, 2026 22:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant