Skip to content

Conversation

@aparcar
Copy link
Owner

@aparcar aparcar commented Jan 25, 2026

Comprehensive research document analyzing how to integrate KernelCI
as a backend for OpenWrt testing infrastructure while preserving
the existing labgrid-based test framework.

Key findings:

  • KernelCI's new pull-mode architecture enables secure lab federation
  • Labgrid adapter approach (used by Pengutronix) is recommended
  • KCIDB-ng provides standardized results submission API
  • Phased implementation starting with results integration

Document includes:

  • Current infrastructure analysis (7 labs, 38+ devices)
  • KernelCI architecture overview (Maestro, KCIDB, Events)
  • Four integration options with trade-offs
  • Detailed 4-phase implementation plan
  • Technical specifications and code examples

claude added 13 commits January 24, 2026 20:14
Comprehensive research document analyzing how to integrate KernelCI
as a backend for OpenWrt testing infrastructure while preserving
the existing labgrid-based test framework.

Key findings:
- KernelCI's new pull-mode architecture enables secure lab federation
- Labgrid adapter approach (used by Pengutronix) is recommended
- KCIDB-ng provides standardized results submission API
- Phased implementation starting with results integration

Document includes:
- Current infrastructure analysis (7 labs, 38+ devices)
- KernelCI architecture overview (Maestro, KCIDB, Events)
- Four integration options with trade-offs
- Detailed 4-phase implementation plan
- Technical specifications and code examples
Major update to the KernelCI integration document focusing on
self-hosted deployment for OpenWrt firmware testing.

Key additions:
- Complete Docker Compose deployment stack
  - MongoDB, Redis, MinIO for storage
  - KernelCI API (Maestro) and Pipeline services
  - Dashboard with OpenWrt-specific views
  - Traefik reverse proxy with TLS

- Multi-source firmware management
  - Official OpenWrt releases (snapshot, stable, oldstable)
  - GitHub PR artifact integration
  - Custom developer upload API
  - Buildbot webhook integration

- Comprehensive health check system
  - Periodic device health monitoring
  - Automatic device disable on failures
  - GitHub issue creation/closure
  - Visual fleet status dashboard

- OpenWrt-specific adaptations
  - Custom firmware schema (replaces kernel builds)
  - Test plan definitions matching existing pytest suite
  - Feature-based job scheduling
  - Device capability mapping

- Labgrid adapter for pull-mode operation
  - Labs stay behind firewalls
  - Job polling from central KernelCI
  - Preserves existing 38+ device targets

- 5-phase implementation plan with clear deliverables
Implements the self-hosted KernelCI infrastructure for OpenWrt testing:

Docker Compose Stack:
- MongoDB 7.0 for data storage with initialization script
- Redis 7 for pub/sub messaging
- MinIO for S3-compatible artifact storage
- KernelCI API (Maestro) for job management
- Traefik reverse proxy with automatic TLS
- Pipeline services (trigger, scheduler, health, results)
- Dashboard for result visualization

Configuration:
- api-config.toml: KernelCI API settings with OpenWrt customizations
- pipeline.yaml: Firmware sources, test plans, scheduler settings
- mongo-init.js: Database collections and indexes
- .env.example: Environment variable template

Pipeline Core Modules:
- models.py: Pydantic models for firmware, jobs, results, devices
- config.py: Configuration loading from env and YAML
- api_client.py: Async HTTP client for KernelCI API

Key Features:
- Multi-source firmware support (official, PR, custom, buildbot)
- Test plan definitions matching existing pytest suite
- Device type mapping to OpenWrt targets
- Health check configuration
- JWT authentication
- S3 artifact storage
Implements firmware source modules for multi-source firmware ingestion:

Official Release Source (official.py):
- Scans downloads.openwrt.org for profiles.json files
- Supports snapshot, stable, and oldstable releases
- Extracts firmware metadata and artifact URLs
- Calculates checksums for verification
- Configurable target filtering for efficiency

GitHub PR Source (github_pr.py):
- Monitors PRs with trigger labels (ci-test-requested)
- Extracts firmware from workflow run artifacts
- Parses target info from artifact names
- Supports PR status updates and comments
- Automatic artifact download and extraction

Custom Upload Handler (custom.py):
- FastAPI router for firmware uploads
- Validates file size and extensions
- Stores firmware in MinIO
- Generates unique firmware IDs
- Auto-detects firmware type from filename

Firmware Trigger Service (firmware_trigger.py):
- Main orchestration service
- Initializes and manages all sources
- Periodic scanning with configurable intervals
- Creates firmware entries in KernelCI API
- Publishes events for job scheduling
- Includes health check endpoint
- FastAPI server for upload API

Base Classes:
- FirmwareSource abstract base class
- Consistent interface for all source types
- Async generator pattern for scanning
Implements the bridge between KernelCI and labgrid test labs using
pull-mode architecture where labs fetch jobs from the central API.

Labgrid Adapter (kernelci/labgrid-adapter/):
- Dockerfile with QEMU and serial tools
- Pull-mode job poller (poller.py)
  - Registers lab with KernelCI API
  - Sends periodic heartbeats
  - Polls for pending jobs matching device capabilities
  - Claims and dispatches jobs to executor
- Test executor (executor.py)
  - Downloads firmware artifacts with caching
  - Builds pytest command with labgrid integration
  - Captures console logs and test output
  - Parses pytest JSON results
  - Uploads logs to MinIO storage
- Main service (service.py)
  - Discovers devices from target YAML files
  - Extracts features from labgrid configs
  - Coordinates poller and executor
  - Handles graceful shutdown
- Configuration via environment variables

Test Scheduler (openwrt-pipeline/test_scheduler.py):
- Listens for new firmware events
- Finds compatible devices based on target/subtarget
- Creates test jobs with appropriate test plans
- Feature-based test plan assignment
- Priority-based scheduling (PR > snapshot > stable)
- Handles job monitoring and timeouts

Key Features:
- Labs stay behind firewalls (pull-mode)
- Automatic device discovery from target files
- Feature-based test filtering
- Firmware caching for efficiency
- Console log capture and upload
- pytest JSON result parsing
Implements comprehensive device health monitoring with automated
notifications and device management.

Device Registry (health/device_registry.py):
- Tracks health status for all devices
- Status levels: healthy, failing, disabled, unknown
- Configurable failure thresholds (warning, disable)
- Last check and consecutive failure tracking
- Bulk status queries and summary generation
- Automatic status transitions based on results

Notification Manager (health/notifications.py):
- GitHub issue creation for disabled devices
- Auto-close issues when devices recover
- Issue caching to prevent duplicates
- Formatted issue body with device details
- Console log links in issues
- Resolution steps documentation

Health Check Scheduler (health/scheduler.py):
- Periodic check scheduling based on interval
- High-priority health check job creation
- Job completion monitoring
- Result processing with status updates
- Recovery detection and notification
- Manual health check trigger API
- Status reporting endpoint

Key Features:
- Devices automatically disabled after threshold failures
- GitHub issues track device problems
- Automatic issue closure on recovery
- Minimal tests (shell + SSH) for quick checks
- Skip firmware flash for health checks
- Concurrent schedule and monitor loops
Add React TypeScript components for the KernelCI dashboard:

- DeviceFleetStatus: Visual overview of devices across all labs with
  health status indicators, feature tags, and quick actions

- FirmwareMatrix: Matrix view showing test results with devices as rows
  and firmware versions as columns, with drill-down to individual tests

- HealthCheckDashboard: Device health monitoring with summary stats,
  device status table, health check history timeline, and manual controls

- PRStatusView: GitHub PR testing status with PR list, test progress,
  job details, and direct links to GitHub and artifacts

Components are designed to integrate with KernelCI dashboard or can be
deployed as a custom dashboard extension.
Update labgrid adapter configuration to use the modern gRPC-based
coordinator instead of the legacy Crossbar/WAMP protocol:

- Rename lg_crossbar config to lg_coordinator (host:port format)
- Set LG_COORDINATOR environment variable for pytest execution
- Add grpcio dependencies to requirements.txt
- Remove unused imports across all modules
- Fix f-strings without placeholders (use plain strings for structlog)
- Rename ambiguous variable 'l' to 'lbl' in github_pr.py
- Remove unused local variables
- Sort imports with isort rules
- Apply consistent code formatting
- Add ruff and isort configuration to pyproject.toml
- Configure ruff to handle import sorting (I rules)
- Remove test_lan_interface_has_neighbor which fails inconsistently
  (IPv6 multicast ping doesn't always return DUP! responses)
- Update test plan configs to remove the flaky test
- Break long f-strings across multiple lines
- Extract long shell commands into variables
- Wrap long docstrings at 88 characters
- Fix commented code line lengths
Remove custom dashboard components - use the standard KernelCI dashboard
instead (ghcr.io/kernelci/dashboard). The dashboard connects to the
same API and provides all needed visualization.

Move health check from pipeline to labgrid-adapter:
- Health checks are a lab maintenance concern, not public-facing
- Lab maintainers run checks locally, not via KernelCI
- Add standalone health_check.py tool for lab maintainers

Removed:
- kernelci/dashboard/ (custom React components)
- kernelci/openwrt-pipeline/openwrt_pipeline/health/ (pipeline health)
- pipeline-health and pipeline-results services from docker-compose

Added:
- labgrid_kci_adapter/health_check.py (lab-side tool)
Add automatic health check functionality to the labgrid adapter:

- Health checks run every 24 hours by default (configurable via
  HEALTH_CHECK_INTERVAL environment variable)
- Devices that fail health checks are removed from the job pool
- Devices that recover are automatically re-added
- Initial health check runs at startup before accepting jobs

Configuration options:
- HEALTH_CHECK_INTERVAL: seconds between checks (default: 86400 = 24h)
- HEALTH_CHECK_ENABLED: set to false to disable (default: true)

This ensures only working devices receive test jobs from KernelCI,
and lab maintainers are informed via logs when devices fail.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants