Skip to content

feat: Implement Node Agent for distributed monitoring #5

@fregataa

Description

@fregataa

Summary

Implement a Node Agent that runs on monitored hosts to collect metrics, execute health checks, and report status to the central config-server. This enables distributed monitoring without requiring direct network access to all targets.

Background

Currently, the monitoring system relies on Prometheus directly scraping targets. A Node Agent provides:

  • Push-based metrics: Targets behind firewalls can push data out
  • Local script execution: Run health checks locally on the target
  • Reduced network complexity: Only agent→server communication needed
  • Edge computing: Process/aggregate data before sending

Architecture

┌─────────────────────────────────────────────────────────────────┐
│                        Central Infrastructure                    │
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────────────┐  │
│  │   Config    │◄───│ Config-     │◄───│    PostgreSQL       │  │
│  │   Server    │    │ Server UI   │    │    (metadata)       │  │
│  └──────┬──────┘    └─────────────┘    └─────────────────────┘  │
│         │                                                        │
│         │ REST (gRPC extensible)                                │
└─────────┼───────────────────────────────────────────────────────┘
          │
          │ (Outbound from agents)
          │
┌─────────┼───────────────────────────────────────────────────────┐
│         ▼              Monitored Hosts                          │
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐         │
│  │ Node Agent  │    │ Node Agent  │    │ Node Agent  │         │
│  │  (host-1)   │    │  (host-2)   │    │  (host-3)   │         │
│  └─────────────┘    └─────────────┘    └─────────────┘         │
│        │                  │                  │                  │
│        ▼                  ▼                  ▼                  │
│   Local checks       Local checks       Local checks            │
│   - Scripts          - Scripts          - Scripts               │
│   - Exporters        - Exporters        - Exporters             │
└─────────────────────────────────────────────────────────────────┘

Core Features

1. Registration & Discovery

  • Agent registers with config-server on startup
  • Agent-side UUID: Agent generates and persists its own ID at installation
  • Receives target configuration (which checks to run)
  • Heartbeat to maintain connection
  • Auto-reconnect on connection loss

2. Configuration Sync

  • Pull assigned script policies from config-server
  • Watch for configuration changes
  • Hot-reload without restart

3. Check Execution

  • Execute assigned health check scripts
  • Collect script output and exit codes
  • Respect check intervals and timeouts
  • Report results to config-server

4. Metrics Collection (Future)

  • Scrape local exporters
  • Aggregate and push to central location
  • Remote write to Prometheus/VictoriaMetrics
  • Note: Deferred to future phase

Technical Requirements

Communication Protocol

REST (Selected) - with abstraction layer for future gRPC support

POST   /api/v1/bootstrap-tokens/register   # Extend existing API
POST   /api/v1/agents/{id}/heartbeat
POST   /api/v1/agents/{id}/check-results
GET    /api/v1/checks/target/{id}          # Use existing API

Design with API Handler abstraction to allow easy addition of gRPC protocol in the future.

Agent ID & Auto-matching Strategy

Agent-side UUID approach adopted

Generate UUID at agent installation and store in /etc/aami/agent-id. Submit this ID to server during registration.

Installation Script

#!/bin/bash
# install-agent.sh

AGENT_ID_FILE="/etc/aami/agent-id"
if [ ! -f "$AGENT_ID_FILE" ]; then
    mkdir -p /etc/aami
    uuidgen > "$AGENT_ID_FILE"
fi

Cloud-init (AWS/GCP/Azure)

#cloud-config
write_files:
  - path: /etc/aami/agent-id
    permissions: '0644'
    content: |
      ${agent_uuid}   # Injected from IaC

Terraform Example

resource "random_uuid" "agent_id" {}

resource "aws_instance" "monitored" {
  user_data = templatefile("cloud-init.yaml", {
    agent_uuid = random_uuid.agent_id.result
  })
}

# Pre-register in config-server (optional)
resource "aami_target" "this" {
  id       = random_uuid.agent_id.result
  hostname = "web-server-${count.index}"
  group_id = aami_group.web.id
}

Matching Behavior

Scenario AgentID Behavior
New registration (ID provided) Provided Create Target with AgentID
New registration (no ID) Empty Server generates UUID
Re-registration/restart Provided Connect to existing Target
Pre-registered Provided Connect Agent to existing Target

Agent Configuration

# /etc/aami/agent.yaml
server:
  address: "config-server.example.com:8443"
  tls:
    enabled: true
    ca_cert: "/etc/aami/ca.crt"
    client_cert: "/etc/aami/agent.crt"
    client_key: "/etc/aami/agent.key"

agent:
  id_file: "/etc/aami/agent-id"  # Agent-side UUID file path
  hostname: ""  # Auto-detected if empty
  labels:
    environment: "production"
    datacenter: "dc1"

heartbeat:
  interval: 30s
  timeout: 10s

checks:
  default_timeout: 30s
  max_concurrent: 10
  result_buffer_size: 1000

logging:
  level: "info"
  format: "json"

Security

  • mTLS for agent-server communication
  • Agent authentication via client certificates or bootstrap tokens
  • Script execution sandboxing (resource limits, allowed paths)
  • Signed script policies to prevent tampering

API Changes Required

Bootstrap Register Extension

Current:

type BootstrapRegister struct {
    Token     string
    Hostname  string
    IPAddress string
    GroupID   string
    Labels    map[string]string
    Metadata  map[string]string
}

Updated:

type BootstrapRegister struct {
    Token     string
    AgentID   string            // New: Agent-submitted ID (optional)
    Hostname  string
    IPAddress string
    GroupID   string
    Labels    map[string]string
    Metadata  map[string]string
}

Check Results Reporting (New)

POST /api/v1/agents/{id}/check-results
{
  "results": [
    {
      "check_id": "disk-space",
      "status": "critical",
      "exit_code": 2,
      "output": "Disk usage 95%",
      "duration_ms": 150,
      "executed_at": "2024-01-15T10:30:00Z"
    }
  ]
}

Tasks

Phase 1: Core Agent

  • Create services/node-agent directory structure
  • Implement agent configuration loading
  • Implement Agent-side UUID generation and persistence
  • Implement server connection (REST with abstraction layer)
  • Implement registration flow (extend existing bootstrap API)
  • Implement heartbeat mechanism
  • Add graceful shutdown

Phase 2: Check Execution

  • Implement check scheduler
  • Implement script executor with timeout
  • Implement result buffering and batching
  • Add resource limits for script execution

Phase 3: Config-Server Integration

  • Extend bootstrap register to accept AgentID
  • Add check results ingestion endpoint
  • Implement agent reconnection logic
  • Add agent status to UI

Phase 4: Security & Production

  • Implement mTLS support
  • Add agent authentication
  • Add script signing/verification
  • Create systemd service file
  • Create installation script (with UUID generation)
  • Write documentation

Dependencies

Decisions Made

Item Decision Notes
Protocol REST Design with API handler abstraction for gRPC extensibility
Metrics push Deferred Excluded from Phase 1, to be separated into future issue
Auto-matching Agent-side UUID Generate UUID at installation, Cloud environment compatible

Related Work

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions