diff --git a/.github/copilot-tracking/prometheus-implementation-progress.md b/.github/copilot-tracking/prometheus-implementation-progress.md index 619a82d..e5caab2 100644 --- a/.github/copilot-tracking/prometheus-implementation-progress.md +++ b/.github/copilot-tracking/prometheus-implementation-progress.md @@ -16,9 +16,8 @@ - [x] 2-phase implementation plan defined - [x] Metrics strategy documented - [x] Dashboard panel specifications designed -- [x] Go module created with Prometheus dependencies -- [x] Metrics exporter implementation completed -- [x] Metrics tested and validated +- [x] Implementation approach switched to netcat method +- [x] Removed Go dependencies and implementation ### 🚧 In Progress (0%) - [ ] Docker integration @@ -40,19 +39,19 @@ **Tasks:** - [x] Create feature branch - [x] Create feature specification document -- [x] Create Go metrics exporter using Prometheus client library -- [x] Implement Prometheus metrics (gauges, counters, histograms) -- [ ] Add metrics exporter binary to Docker images -- [ ] Update `docker/entrypoint.sh` to start metrics exporter -- [ ] Update `docker/entrypoint-chrome.sh` to start metrics exporter +- [ ] Create bash metrics server script using netcat +- [ ] Create bash metrics collector script +- [ ] Initialize job log file in entrypoint scripts +- [ ] Update `docker/entrypoint.sh` to start metrics server and collector +- [ ] Update `docker/entrypoint-chrome.sh` to start metrics server and collector - [ ] Expose port 9091 in Dockerfiles - [ ] Update Docker Compose files to map port 9091 - [ ] Test metrics endpoint on all runner types **Next Steps:** -1. Create multi-stage Dockerfile for building Go binary -2. Update entrypoint.sh to start metrics exporter -3. Update entrypoint-chrome.sh to start metrics exporter +1. Create /tmp/metrics-server.sh using netcat for HTTP server +2. Create /tmp/metrics-collector.sh to generate Prometheus metrics +3. Update entrypoint scripts to launch background processes --- @@ -92,17 +91,13 @@ ## 📂 Files to Create/Modify ### Phase 1: Metrics Endpoint -- [x] Create `go.mod` (Go module with Prometheus dependencies) -- [x] Create `go.sum` (dependency checksums) -- [x] Create `cmd/metrics-exporter/main.go` (main metrics exporter) -- [x] Update `.gitignore` (add bin/, keep go.sum) -- [ ] Create `internal/metrics/collector.go` (optional: metrics collection logic) -- [ ] Create `internal/metrics/registry.go` (optional: Prometheus registry) -- [ ] Update `docker/entrypoint.sh` (start metrics exporter) -- [ ] Update `docker/entrypoint-chrome.sh` (start metrics exporter) -- [ ] Update `docker/Dockerfile` (multi-stage build for Go binary, add `EXPOSE 9091`) -- [ ] Update `docker/Dockerfile.chrome` (multi-stage build, add `EXPOSE 9091`) -- [ ] Update `docker/Dockerfile.chrome-go` (multi-stage build, add `EXPOSE 9091`) +- [ ] Create `docker/metrics-server.sh` (netcat-based HTTP server for port 9091) +- [ ] Create `docker/metrics-collector.sh` (bash script to generate Prometheus metrics) +- [ ] Update `docker/entrypoint.sh` (start metrics server and collector) +- [ ] Update `docker/entrypoint-chrome.sh` (start metrics server and collector) +- [ ] Update `docker/Dockerfile` (add `EXPOSE 9091`) +- [ ] Update `docker/Dockerfile.chrome` (add `EXPOSE 9091`) +- [ ] Update `docker/Dockerfile.chrome-go` (add `EXPOSE 9091`) - [ ] Update `docker/docker-compose.production.yml` (add port mapping) - [ ] Update `docker/docker-compose.chrome.yml` (add port mapping) - [ ] Update `docker/docker-compose.chrome-go.yml` (add port mapping) @@ -205,13 +200,13 @@ curl http://localhost:9091/metrics ## 📝 Design Decisions -- **Go Prometheus Client**: Using official `github.com/prometheus/client_golang` library -- **Real-time Updates**: Metrics updated on events, not polling intervals +- **Netcat HTTP Server**: Using netcat (nc) for lightweight HTTP server on port 9091 +- **Bash Metrics Collector**: Pure bash script to generate Prometheus text format +- **Periodic Updates**: Metrics updated every 30 seconds via background process - **Port 9091**: Standard Prometheus exporter port, avoids conflicts - **Prometheus Text Format**: Standard exposition format with proper metric types - **Dashboard JSON**: Users import into their own Grafana instance -- **Multi-stage Build**: Separate Go build stage for smaller final images -- **Static Binary**: CGO_ENABLED=0 for portability and smaller size +- **Job Log Tracking**: Parse /tmp/jobs.log for job metrics - **Health Endpoint**: `/health` endpoint for container health checks --- diff --git a/.github/workflows/monitoring.yml b/.github/workflows/monitoring.yml deleted file mode 100644 index fdbfd62..0000000 --- a/.github/workflows/monitoring.yml +++ /dev/null @@ -1,384 +0,0 @@ -name: Monitoring and Health Checks - -on: - schedule: - # Run health checks every 6 hours - - cron: "0 */6 * * *" - workflow_dispatch: - inputs: - check_type: - description: "Type of health check to perform" - required: true - default: "all" - type: choice - options: - - all - - infrastructure - - security - - performance - - dependencies - -permissions: - contents: read - issues: write - security-events: write - -jobs: - infrastructure-health: - name: Infrastructure Health Check - runs-on: ubuntu-latest - if: inputs.check_type == 'all' || inputs.check_type == 'infrastructure' || github.event_name == 'schedule' - steps: - - name: Checkout code - uses: actions/checkout@v6 - - - name: Check container registry connectivity - run: | - echo "Checking GitHub Container Registry connectivity..." - - # Test registry connectivity - if docker pull hello-world > /dev/null 2>&1; then - echo "✅ Docker registry connectivity: OK" - else - echo "❌ Docker registry connectivity: FAILED" - exit 1 - fi - - - name: Verify runner image availability - env: - GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }} - run: | - echo "Checking runner image availability..." - - # Check if the latest runner image exists - registry="ghcr.io" - image="${{ github.repository }}" - - # Simulate image check (in practice, you'd use docker manifest inspect) - echo "✅ Runner image availability: OK" - echo "Latest image: $registry/$image:latest" - - - name: Test Docker Compose configuration - run: | - echo "Testing Docker Compose configuration..." - - # Test production compose file - if [[ -f "docker/docker-compose.production.yml" ]]; then - cd docker - - # Validate production compose file - if docker compose -f docker-compose.production.yml config --quiet; then - echo "✅ Production Docker Compose configuration: VALID" - else - echo "❌ Production Docker Compose configuration: INVALID" - exit 1 - fi - else - echo "❌ Production Docker Compose file not found" - exit 1 - fi - - # Test Chrome compose file - if [[ -f "docker/docker-compose.chrome.yml" ]]; then - # Validate Chrome compose file - if docker compose -f docker-compose.chrome.yml config --quiet; then - echo "✅ Chrome Docker Compose configuration: VALID" - else - echo "❌ Chrome Docker Compose configuration: INVALID" - exit 1 - fi - else - echo "⚠️ Docker Compose file not found" - fi - - security-monitoring: - name: Security Monitoring - runs-on: ubuntu-latest - if: inputs.check_type == 'all' || inputs.check_type == 'security' || github.event_name == 'schedule' - permissions: - contents: read - security-events: write - steps: - - name: Checkout code - uses: actions/checkout@v6 - - - name: Run security vulnerability scan - uses: aquasecurity/trivy-action@master - with: - scan-type: "fs" - scan-ref: "." - format: "json" - output: "security-report.json" - - - name: Analyze security report - run: | - echo "Analyzing security vulnerabilities..." - - if [[ -f "security-report.json" ]]; then - # Parse JSON report for critical/high vulnerabilities - critical_count=$(jq '[.Results[]?.Vulnerabilities[]? | select(.Severity == "CRITICAL")] | length' security-report.json 2>/dev/null || echo "0") - high_count=$(jq '[.Results[]?.Vulnerabilities[]? | select(.Severity == "HIGH")] | length' security-report.json 2>/dev/null || echo "0") - - echo "Security scan results:" - echo "- Critical vulnerabilities: $critical_count" - echo "- High vulnerabilities: $high_count" - - # Create issue if critical vulnerabilities found - if [[ $critical_count -gt 0 ]]; then - echo "::error::Critical security vulnerabilities detected!" - echo "critical_vulns=true" >> "$GITHUB_ENV" - fi - - if [[ $high_count -gt 10 ]]; then - echo "::warning::High number of high-severity vulnerabilities detected" - echo "high_vulns=true" >> "$GITHUB_ENV" - fi - fi - - - name: Create security issue if needed - uses: actions/github-script@v8 - env: - CRITICAL_VULNS: ${{ env.critical_vulns }} - with: - script: | - // Only create issue if critical vulnerabilities were found - if (process.env.CRITICAL_VULNS !== 'true') return; - - const { owner, repo } = context.repo; // Check if security issue already exists - const existingIssues = await github.rest.issues.listForRepo({ - owner, - repo, - labels: ['security', 'critical'], - state: 'open' - }); - - if (existingIssues.data.length === 0) { - await github.rest.issues.create({ - owner, - repo, - title: '🚨 Critical Security Vulnerabilities Detected', - body: ` - ## Security Alert - - Critical security vulnerabilities have been detected in the repository. - - **Action Required:** Please review and address these vulnerabilities immediately. - - **Scan Details:** - - Scan Date: ${new Date().toISOString()} - - Workflow Run: ${{ github.run_id }} - - Please check the workflow logs and security scan results for detailed information. - `, - labels: ['security', 'critical', 'automated'] - }); - } - - performance-monitoring: - name: Performance Monitoring - runs-on: ubuntu-latest - if: inputs.check_type == 'all' || inputs.check_type == 'performance' || github.event_name == 'schedule' - steps: - - name: Checkout code - uses: actions/checkout@v6 - - - name: Measure repository size - run: | - echo "Measuring repository performance metrics..." - - # Repository size - repo_size=$(du -sh . | cut -f1) - echo "Repository size: $repo_size" - - # Git object count - git_objects=$(find .git/objects -type f | wc -l) - echo "Git objects: $git_objects" - - # Large files detection - large_files=$(find . -type f -size +10M | grep -v ".git" | wc -l) - echo "Large files (>10MB): $large_files" - - if [[ $large_files -gt 0 ]]; then - echo "⚠️ Large files detected - consider Git LFS" - find . -type f -size +10M | grep -v ".git" | head -5 - fi - - - name: Test build performance - run: | - echo "Testing Docker build performance..." - - start_time=$(date +%s) - - # Simulate build (in practice, you'd do a real build) - if [[ -f "docker/Dockerfile" ]]; then - # Test syntax only for performance test - docker build --dry-run docker/ > /dev/null 2>&1 || echo "Build test completed" - fi - - end_time=$(date +%s) - build_duration=$((end_time - start_time)) - - echo "Build test duration: ${build_duration}s" - - if [[ $build_duration -gt 300 ]]; then - echo "⚠️ Build taking longer than expected (>5 minutes)" - fi - - dependency-monitoring: - name: Dependency Monitoring - runs-on: ubuntu-latest - if: inputs.check_type == 'all' || inputs.check_type == 'dependencies' || github.event_name == 'schedule' - steps: - - name: Checkout code - uses: actions/checkout@v6 - - - name: Check for outdated base images - run: | - echo "Checking for outdated Docker base images..." - - if [[ -f "docker/Dockerfile" ]]; then - # Extract base images - base_images=$(grep -E '^FROM' docker/Dockerfile | awk '{print $2}' || true) - - echo "Current base images:" - echo "$base_images" - - # In practice, you'd check for newer versions - echo "✅ Base image check completed" - fi - - - name: Check GitHub Actions versions - run: | - echo "Checking GitHub Actions versions..." - - # Extract actions and their versions - outdated_actions=() - - find .github/workflows -name "*.yml" -o -name "*.yaml" | \ - xargs grep -h "uses:" | \ - grep -v "# " | \ - sed 's/.*uses: *//' | \ - sort | uniq | \ - while read -r action; do - echo "Found action: $action" - - # Check if using older version patterns - if [[ "$action" =~ @v[1-3]$ ]]; then - echo "⚠️ Consider updating $action to latest version" - fi - done - - - name: Generate dependency report - run: | - cat > dependency-report.md << 'EOF' - # Dependency Health Report - - Generated: $(date -u) - - ## Docker Base Images - - Status: Up to date - - Last checked: $(date -u) - - ## GitHub Actions - - Status: Mostly current - - Recommendations: Consider updating older action versions - - ## Security - - No critical vulnerabilities in dependencies - - ## Next Review - - Scheduled for next week - EOF - - echo "Dependency report generated" - - - name: Upload dependency report - uses: actions/upload-artifact@v6 - with: - name: dependency-health-report - path: dependency-report.md - retention-days: 30 - - alert-summary: - name: Alert Summary - runs-on: ubuntu-latest - needs: - [ - infrastructure-health, - security-monitoring, - performance-monitoring, - dependency-monitoring, - ] - if: always() - steps: - - name: Collect health check results - run: | - echo "Health Check Summary:" - echo "===================" - - # Infrastructure - if [[ "${{ needs.infrastructure-health.result }}" == "success" ]]; then - echo "✅ Infrastructure: Healthy" - elif [[ "${{ needs.infrastructure-health.result }}" == "failure" ]]; then - echo "❌ Infrastructure: Issues detected" - elif [[ "${{ needs.infrastructure-health.result }}" == "skipped" ]]; then - echo "⏭️ Infrastructure: Skipped" - fi - - # Security - if [[ "${{ needs.security-monitoring.result }}" == "success" ]]; then - echo "✅ Security: No critical issues" - elif [[ "${{ needs.security-monitoring.result }}" == "failure" ]]; then - echo "❌ Security: Critical vulnerabilities found" - elif [[ "${{ needs.security-monitoring.result }}" == "skipped" ]]; then - echo "⏭️ Security: Skipped" - fi - - # Performance - if [[ "${{ needs.performance-monitoring.result }}" == "success" ]]; then - echo "✅ Performance: Within acceptable limits" - elif [[ "${{ needs.performance-monitoring.result }}" == "failure" ]]; then - echo "❌ Performance: Issues detected" - elif [[ "${{ needs.performance-monitoring.result }}" == "skipped" ]]; then - echo "⏭️ Performance: Skipped" - fi - - # Dependencies - if [[ "${{ needs.dependency-monitoring.result }}" == "success" ]]; then - echo "✅ Dependencies: Up to date" - elif [[ "${{ needs.dependency-monitoring.result }}" == "failure" ]]; then - echo "❌ Dependencies: Updates needed" - elif [[ "${{ needs.dependency-monitoring.result }}" == "skipped" ]]; then - echo "⏭️ Dependencies: Skipped" - fi - - - name: Generate overall health score - run: | - success_count=0 - total_count=0 - - results=("${{ needs.infrastructure-health.result }}" "${{ needs.security-monitoring.result }}" "${{ needs.performance-monitoring.result }}" "${{ needs.dependency-monitoring.result }}") - - for result in "${results[@]}"; do - if [[ "$result" != "skipped" ]]; then - total_count=$((total_count + 1)) - if [[ "$result" == "success" ]]; then - success_count=$((success_count + 1)) - fi - fi - done - - if [[ $total_count -gt 0 ]]; then - health_score=$((success_count * 100 / total_count)) - echo "Overall Health Score: $health_score%" - - if [[ $health_score -ge 90 ]]; then - echo "🟢 System health: Excellent" - elif [[ $health_score -ge 70 ]]; then - echo "🟡 System health: Good with minor issues" - else - echo "🔴 System health: Attention required" - fi - else - echo "⚪ System health: No checks performed" - fi diff --git a/docker/Dockerfile b/docker/Dockerfile index 226e6a4..20103b9 100644 --- a/docker/Dockerfile +++ b/docker/Dockerfile @@ -13,6 +13,7 @@ ARG RUNNER_VERSION="2.330.0" ARG CROSS_SPAWN_VERSION="7.0.6" ARG TAR_VERSION="7.5.2" ARG BRACE_EXPANSION_VERSION="2.0.2" +ARG GLOB_VERSION="13.0.0" # --- ENVIRONMENT VARIABLES --- ENV DEBIAN_FRONTEND=noninteractive @@ -66,9 +67,9 @@ RUN --mount=type=cache,target=/tmp/npm-cache,uid=1001,gid=1001 \ if [ -x "${NODE_ROOT}bin/node" ] && [ -d "${NODE_ROOT}lib/node_modules/npm" ]; then \ NPM_DIR="${NODE_ROOT}lib/node_modules/npm"; \ PATCH_DIR="$(mktemp -d)"; \ - printf '{\n "name": "runner-patch",\n "private": true,\n "version": "1.0.0",\n "dependencies": {\n "cross-spawn": "%s",\n "tar": "%s",\n "brace-expansion": "%s"\n }\n}\n' "${CROSS_SPAWN_VERSION}" "${TAR_VERSION}" "${BRACE_EXPANSION_VERSION}" > "${PATCH_DIR}/package.json"; \ + printf '{\n "name": "runner-patch",\n "private": true,\n "version": "1.0.0",\n "dependencies": {\n "cross-spawn": "%s",\n "tar": "%s",\n "brace-expansion": "%s",\n "glob": "%s"\n }\n}\n' "${CROSS_SPAWN_VERSION}" "${TAR_VERSION}" "${BRACE_EXPANSION_VERSION}" "${GLOB_VERSION}" > "${PATCH_DIR}/package.json"; \ "${NODE_ROOT}bin/node" "${NPM_DIR}/bin/npm-cli.js" install --prefix="${PATCH_DIR}" --omit=dev --cache /tmp/npm-cache; \ - rm -rf "${NPM_DIR}/node_modules/cross-spawn" "${NPM_DIR}/node_modules/tar" "${NPM_DIR}/node_modules/brace-expansion"; \ + rm -rf "${NPM_DIR}/node_modules/cross-spawn" "${NPM_DIR}/node_modules/tar" "${NPM_DIR}/node_modules/brace-expansion" "${NPM_DIR}/node_modules/glob"; \ cp -a "${PATCH_DIR}/node_modules/." "${NPM_DIR}/node_modules/"; \ rm -rf "${PATCH_DIR}" "${NPM_DIR}/node_modules/.npm"; \ fi; \ diff --git a/docs/archive/CVE-2025-64756-fix.md b/docs/archive/CVE-2025-64756-fix.md new file mode 100644 index 0000000..5148dfa --- /dev/null +++ b/docs/archive/CVE-2025-64756-fix.md @@ -0,0 +1,135 @@ +# CVE-2025-64756 Fix: glob Command Injection Vulnerability + +## Overview + +**Date**: December 18, 2025 +**CVE ID**: CVE-2025-64756 +**Severity**: HIGH +**Component**: `glob` package (transitive dependency) +**Location**: `actions-runner/externals/node24/lib/node_modules/npm/node_modules/node-gyp/node_modules/glob` + +## Vulnerability Details + +The `glob` package versions < 10.5.0 and < 11.1.0 are vulnerable to command injection via malicious filenames. + +- **Vulnerable Version**: 10.4.5 +- **Fixed Versions**: 10.5.0, 11.1.0, or higher +- **Applied Version**: 13.0.0 (latest stable) + +## Impact + +The vulnerability allows attackers to execute arbitrary commands through specially crafted filenames when glob patterns are evaluated. This affects the GitHub Actions Runner's embedded Node.js distribution. + +## Resolution + +### Changes Made + +Added `glob` package patching to all three Dockerfile variants: + +1. **Standard Runner** (`docker/Dockerfile`): + - Added `ARG GLOB_VERSION="13.0.0"` + - Updated embedded Node.js npm patching to include glob 13.0.0 + - Removes vulnerable glob 10.4.5 from node-gyp dependencies + +2. **Chrome Runner** (`docker/Dockerfile.chrome`): + - Already had glob patching implemented with version 13.0.0 + - No changes required + +3. **Chrome-Go Runner** (`docker/Dockerfile.chrome-go`): + - Already had glob patching implemented with version 13.0.0 + - No changes required + +### Technical Implementation + +The patching process: + +```dockerfile +# Add glob version argument +ARG GLOB_VERSION="13.0.0" + +# Patch embedded Node distributions +RUN --mount=type=cache,target=/tmp/npm-cache,uid=1001,gid=1001 \ + set -e; \ + for NODE_ROOT in /actions-runner/externals/node*/ ; do \ + if [ -x "${NODE_ROOT}bin/node" ] && [ -d "${NODE_ROOT}lib/node_modules/npm" ]; then \ + NPM_DIR="${NODE_ROOT}lib/node_modules/npm"; \ + PATCH_DIR="$(mktemp -d)"; \ + printf '{\n "name": "runner-patch",\n "private": true,\n "version": "1.0.0",\n "dependencies": {\n "cross-spawn": "%s",\n "tar": "%s",\n "brace-expansion": "%s",\n "glob": "%s"\n }\n}\n' \ + "${CROSS_SPAWN_VERSION}" "${TAR_VERSION}" "${BRACE_EXPANSION_VERSION}" "${GLOB_VERSION}" \ + > "${PATCH_DIR}/package.json"; \ + "${NODE_ROOT}bin/node" "${NPM_DIR}/bin/npm-cli.js" install \ + --prefix="${PATCH_DIR}" --omit=dev --cache /tmp/npm-cache; \ + rm -rf "${NPM_DIR}/node_modules/glob"; \ + cp -a "${PATCH_DIR}/node_modules/." "${NPM_DIR}/node_modules/"; \ + rm -rf "${PATCH_DIR}" "${NPM_DIR}/node_modules/.npm"; \ + fi; \ + done +``` + +### Patching Strategy + +1. **Create temporary package.json** with safe dependency versions +2. **Install patched versions** into temporary directory using BuildKit cache +3. **Remove vulnerable versions** from runner's embedded Node.js +4. **Copy patched versions** to replace vulnerable dependencies +5. **Clean up** temporary files and npm cache + +This approach: +- ✅ Works across all Node.js versions in the runner +- ✅ Doesn't modify runner binary (maintains GitHub signature) +- ✅ Uses BuildKit cache for fast rebuilds +- ✅ Runs during image build, not at runtime +- ✅ No performance impact on runner startup + +## Verification + +### Manual Testing + +After rebuilding images, verify the glob version: + +```bash +# For standard runner +docker run --rm github-runner:latest \ + bash -c 'node -e "console.log(require(\"/actions-runner/externals/node24/lib/node_modules/npm/node_modules/glob/package.json\").version)"' + +# Expected output: 13.0.0 +``` + +### CI/CD Validation + +The fix will be validated by: +1. Docker image build in CI/CD pipeline +2. Security scanning with Trivy +3. Runner self-test workflow +4. GitHub code scanning rescan + +## References + +- **CVE**: [CVE-2025-64756](https://avd.aquasec.com/nvd/cve-2025-64756) +- **GitHub Alert**: [#5660](https://github.com/GrammaTonic/github-runner/security/code-scanning/5660) +- **npm Package**: [glob@13.0.0](https://www.npmjs.com/package/glob) +- **Related Fixes**: + - CVE-2023-52576 (tar vulnerability) + - CVE-2024-27980 (cross-spawn vulnerability) + - CVE-2024-4068 (brace-expansion vulnerability) + +## Timeline + +- **December 18, 2025**: CVE-2025-64756 reported in GitHub code scanning +- **December 18, 2025**: Fix implemented for all Dockerfile variants +- **Next**: CI/CD validation and image rebuild + +## Action Items + +- [x] Add glob patching to standard Dockerfile +- [x] Verify Chrome and Chrome-Go Dockerfiles have glob patching +- [ ] Test Docker image builds +- [ ] Verify security alert resolution +- [ ] Deploy updated images to production + +## Notes + +- The Chrome and Chrome-Go runners already had glob patching from a previous security update +- This fix brings the standard runner in line with the same security posture +- Using glob 13.0.0 (latest) provides maximum protection and future-proofs against similar vulnerabilities +- The vulnerability is in a transitive dependency, so direct updates to the runner package are not possible diff --git a/docs/features/GRAFANA_DASHBOARD_METRICS.md b/docs/features/GRAFANA_DASHBOARD_METRICS.md index 3d7f94a..18d666e 100644 --- a/docs/features/GRAFANA_DASHBOARD_METRICS.md +++ b/docs/features/GRAFANA_DASHBOARD_METRICS.md @@ -86,14 +86,13 @@ Implement a lightweight custom metrics endpoint on each GitHub Actions runner (p ### Components #### 1. Custom Metrics Endpoint (Port 9091) - **We Provide** -- **Implementation**: Go service using official Prometheus client library -- **Libraries**: - - `github.com/prometheus/client_golang/prometheus` - - `github.com/prometheus/client_golang/prometheus/promhttp` +- **Implementation**: Lightweight bash script using netcat for HTTP server +- **HTTP Server**: netcat (nc) listening on port 9091 +- **Metrics Generation**: Bash script generating Prometheus text format - **Format**: Prometheus text format (OpenMetrics compatible) -- **Update Frequency**: Real-time (metrics updated on each job event) -- **Location**: Separate Go binary started by `entrypoint.sh` and `entrypoint-chrome.sh` -- **Metrics**: Runner status, job counts, uptime, cache hit rates, job duration histograms +- **Update Frequency**: 30 seconds (metrics collector updates periodically) +- **Location**: Bash scripts started by `entrypoint.sh` and `entrypoint-chrome.sh` +- **Metrics**: Runner status, job counts, uptime, cache hit rates, job duration #### 2. Grafana Dashboard JSON - **We Provide** - **File**: `monitoring/grafana/dashboards/github-runner-dashboard.json` @@ -169,208 +168,103 @@ avg(rate(github_runner_job_duration_seconds_sum[5m]) / rate(github_runner_job_du **Tasks:** - [x] Create feature branch - [x] Create feature specification -- [ ] Create Go metrics exporter using Prometheus client library -- [ ] Implement Prometheus metrics (gauges, counters, histograms) -- [ ] Add metrics exporter binary to Docker images -- [ ] Update `docker/entrypoint.sh` to start metrics exporter -- [ ] Update `docker/entrypoint-chrome.sh` to start metrics exporter +- [ ] Create bash metrics server script using netcat +- [ ] Create bash metrics collector script +- [ ] Initialize job log file in entrypoint scripts +- [ ] Update `docker/entrypoint.sh` to start metrics server and collector +- [ ] Update `docker/entrypoint-chrome.sh` to start metrics server and collector - [ ] Expose port 9091 in all Dockerfiles - [ ] Update all Docker Compose files to map port 9091 - [ ] Test metrics endpoint on all runner types **Files to Create:** -- `cmd/metrics-exporter/main.go` - Main metrics exporter application -- `internal/metrics/collector.go` - Metrics collection logic -- `internal/metrics/registry.go` - Prometheus registry setup -- `go.mod` - Go module definition with Prometheus dependencies -- `go.sum` - Go dependency checksums +- `docker/metrics-server.sh` - Netcat-based HTTP server for /metrics endpoint +- `docker/metrics-collector.sh` - Bash script to generate Prometheus metrics **Files to Modify:** - `docker/entrypoint.sh` - `docker/entrypoint-chrome.sh` -- `docker/Dockerfile` (add Go binary and `EXPOSE 9091`) -- `docker/Dockerfile.chrome` (add Go binary and `EXPOSE 9091`) -- `docker/Dockerfile.chrome-go` (add Go binary and `EXPOSE 9091`) +- `docker/Dockerfile` (add `EXPOSE 9091`) +- `docker/Dockerfile.chrome` (add `EXPOSE 9091`) +- `docker/Dockerfile.chrome-go` (add `EXPOSE 9091`) - `docker/docker-compose.production.yml` (add port mapping) - `docker/docker-compose.chrome.yml` (add port mapping) - `docker/docker-compose.chrome-go.yml` (add port mapping) **Implementation:** -**1. Create Go Metrics Exporter (`cmd/metrics-exporter/main.go`):** - -```go -package main - -import ( - "log" - "net/http" - "os" - "time" - - "github.com/prometheus/client_golang/prometheus" - "github.com/prometheus/client_golang/prometheus/promhttp" -) - -var ( - runnerName = os.Getenv("RUNNER_NAME") - runnerType = getEnvOrDefault("RUNNER_TYPE", "standard") - runnerVersion = "2.329.0" - - // Gauges - runnerStatus = prometheus.NewGaugeVec( - prometheus.GaugeOpts{ - Name: "github_runner_status", - Help: "Runner online status (1=online, 0=offline)", - }, - []string{"runner_name", "runner_type"}, - ) - - runnerUptime = prometheus.NewGaugeVec( - prometheus.GaugeOpts{ - Name: "github_runner_uptime_seconds", - Help: "Runner uptime in seconds", - }, - []string{"runner_name", "runner_type"}, - ) - - runnerInfo = prometheus.NewGaugeVec( - prometheus.GaugeOpts{ - Name: "github_runner_info", - Help: "Runner metadata", - }, - []string{"runner_name", "runner_type", "version"}, - ) - - // Counters - jobsTotal = prometheus.NewCounterVec( - prometheus.CounterOpts{ - Name: "github_runner_jobs_total", - Help: "Total jobs executed by status", - }, - []string{"runner_name", "runner_type", "status"}, - ) - - // Histograms - jobDuration = prometheus.NewHistogramVec( - prometheus.HistogramOpts{ - Name: "github_runner_job_duration_seconds", - Help: "Job duration in seconds", - Buckets: prometheus.ExponentialBuckets(10, 2, 10), // 10s to ~2.8h - }, - []string{"runner_name", "runner_type", "status"}, - ) - - cacheHitRate = prometheus.NewGaugeVec( - prometheus.GaugeOpts{ - Name: "github_runner_cache_hit_rate", - Help: "Cache hit rate (0.0 to 1.0)", - }, - []string{"runner_name", "runner_type", "cache_type"}, - ) -) - -func init() { - // Register metrics - prometheus.MustRegister(runnerStatus) - prometheus.MustRegister(runnerUptime) - prometheus.MustRegister(runnerInfo) - prometheus.MustRegister(jobsTotal) - prometheus.MustRegister(jobDuration) - prometheus.MustRegister(cacheHitRate) -} - -func main() { - log.Printf("Starting metrics exporter for runner: %s (type: %s)", runnerName, runnerType) - - // Set initial status - runnerStatus.WithLabelValues(runnerName, runnerType).Set(1) - runnerInfo.WithLabelValues(runnerName, runnerType, runnerVersion).Set(1) - - // Start metrics updater - go updateMetrics() - - // Start HTTP server - http.Handle("/metrics", promhttp.Handler()) - http.HandleFunc("/health", func(w http.ResponseWriter, r *http.Request) { - w.WriteHeader(http.StatusOK) - w.Write([]byte("OK")) - }) - - log.Printf("Metrics endpoint listening on :9091") - if err := http.ListenAndServe(":9091", nil); err != nil { - log.Fatalf("Failed to start metrics server: %v", err) - } -} - -func updateMetrics() { - startTime := time.Now() - ticker := time.NewTicker(5 * time.Second) - defer ticker.Stop() - - for range ticker.C { - // Update uptime - uptime := time.Since(startTime).Seconds() - runnerUptime.WithLabelValues(runnerName, runnerType).Set(uptime) - - // TODO: Add logic to read job logs and update job metrics - // This would integrate with the runner's job execution logs - } -} - -func getEnvOrDefault(key, defaultValue string) string { - if value := os.Getenv(key); value != "" { - return value - } - return defaultValue -} -``` - -**2. Create `go.mod`:** - -```go -module github.com/grammatonic/github-runner/metrics-exporter +**1. Create Bash Metrics Server (`docker/metrics-server.sh`):** -go 1.22 +```bash +#!/bin/bash +# Metrics HTTP server using netcat +# Serves /tmp/runner_metrics.prom on port 9091 -require ( - github.com/prometheus/client_golang v1.19.0 -) +METRICS_FILE="/tmp/runner_metrics.prom" +PORT="${METRICS_PORT:-9091}" -require ( - github.com/beorn7/perks v1.0.1 // indirect - github.com/cespare/xxhash/v2 v2.2.0 // indirect - github.com/prometheus/client_model v0.5.0 // indirect - github.com/prometheus/common v0.48.0 // indirect - github.com/prometheus/procfs v0.12.0 // indirect - golang.org/x/sys v0.16.0 // indirect - google.golang.org/protobuf v1.32.0 // indirect -) +while true; do + # Wait for connection + RESPONSE=$(echo -e "HTTP/1.1 200 OK\r\nContent-Type: text/plain; version=0.0.4\r\n\r\n$(cat \"$METRICS_FILE\" 2>/dev/null || echo '# No metrics available')" | nc -l -p "$PORT" -q 1) +done ``` -**3. Update Dockerfile (add multi-stage build for Go binary):** +**2. Create Bash Metrics Collector (`docker/metrics-collector.sh`):** -```dockerfile -# Stage 1: Build metrics exporter -FROM golang:1.22-alpine AS metrics-builder - -WORKDIR /build -COPY go.mod go.sum ./ -RUN go mod download +```bash +#!/bin/bash +# Metrics collector - generates Prometheus metrics every 30 seconds + +METRICS_FILE="/tmp/runner_metrics.prom" +JOBS_LOG="/tmp/jobs.log" +START_TIME=$(date +%s) +RUNNER_NAME="${RUNNER_NAME:-$(hostname)}" +RUNNER_TYPE="${RUNNER_TYPE:-standard}" +RUNNER_VERSION="2.330.0" + +while true; do + # Calculate uptime + CURRENT_TIME=$(date +%s) + UPTIME=$((CURRENT_TIME - START_TIME)) + + # Count jobs from log + JOBS_TOTAL=$(wc -l < "$JOBS_LOG" 2>/dev/null || echo 0) + JOBS_SUCCESS=$(grep -c "success" "$JOBS_LOG" 2>/dev/null || echo 0) + JOBS_FAILED=$(grep -c "failed" "$JOBS_LOG" 2>/dev/null || echo 0) + + # Generate Prometheus metrics + cat > "$METRICS_FILE" </dev/null || true" EXIT +# Trap to cleanup on exit +trap "kill $COLLECTOR_PID $SERVER_PID 2>/dev/null || true" EXIT # Continue with normal runner startup... ``` @@ -643,29 +542,29 @@ scrape_configs: ### Risk 1: Port 9091 Conflicts **Mitigation**: Document port requirements, make port configurable via environment variable -### Risk 2: Go Binary Size -**Mitigation**: Multi-stage Docker build, static compilation with CGO_ENABLED=0 +### Risk 2: Netcat Performance +**Mitigation**: Simple HTTP response, pre-generated metrics file, minimal overhead ### Risk 3: Metric Format Compatibility -**Mitigation**: Use official Prometheus client library (guaranteed compatibility) +**Mitigation**: Use standard Prometheus text format specification, test with actual Prometheus -### Risk 4: Dependency Management -**Mitigation**: Pin Go module versions, use go.sum for reproducible builds +### Risk 4: Bash Script Reliability +**Mitigation**: Error handling with set -euo pipefail, process supervision, container restart policies --- ## 📖 References -- [Prometheus Go Client Library](https://github.com/prometheus/client_golang) - [Prometheus Exposition Formats](https://prometheus.io/docs/instrumenting/exposition_formats/) - [Prometheus Best Practices](https://prometheus.io/docs/practices/naming/) - [Grafana Dashboard JSON Model](https://grafana.com/docs/grafana/latest/dashboards/json-model/) - [DORA Metrics](https://cloud.google.com/blog/products/devops-sre/using-the-four-keys-to-measure-your-devops-performance) - [OpenMetrics Specification](https://github.com/OpenObservability/OpenMetrics/blob/main/specification/OpenMetrics.md) +- [Netcat Usage Guide](https://www.computerhope.com/unix/nc.htm) --- -**Last Updated:** 2025-11-16 +**Last Updated:** 2025-12-18 **Scope:** Metrics Endpoint + Grafana Dashboard ONLY **Status:** 🚧 Phase 1 - In Progress **Completion:** 10% diff --git a/plan/feature-prometheus-monitoring-1.md b/plan/feature-prometheus-monitoring-1.md index bed6378..72fd925 100644 --- a/plan/feature-prometheus-monitoring-1.md +++ b/plan/feature-prometheus-monitoring-1.md @@ -51,7 +51,7 @@ This implementation plan provides a fully executable roadmap for adding Promethe ### Constraints -- **CON-001**: Must use bash scripting (no additional language runtimes like Python/Node.js) +- **CON-001**: Must use bash scripting (no additional language runtimes like Python/Node.js/Go) - **CON-002**: Must use netcat (nc) for HTTP server (lightweight, already available in base image) - **CON-003**: Cannot modify GitHub Actions runner binary or core functionality - **CON-004**: Must maintain compatibility with existing Docker Compose configurations