-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Description
Summary
Implement a Node Agent that runs on monitored hosts to collect metrics, execute health checks, and report status to the central config-server. This enables distributed monitoring without requiring direct network access to all targets.
Background
Currently, the monitoring system relies on Prometheus directly scraping targets. A Node Agent provides:
- Push-based metrics: Targets behind firewalls can push data out
- Local script execution: Run health checks locally on the target
- Reduced network complexity: Only agent→server communication needed
- Edge computing: Process/aggregate data before sending
Architecture
┌─────────────────────────────────────────────────────────────────┐
│ Central Infrastructure │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────┐ │
│ │ Config │◄───│ Config- │◄───│ PostgreSQL │ │
│ │ Server │ │ Server UI │ │ (metadata) │ │
│ └──────┬──────┘ └─────────────┘ └─────────────────────┘ │
│ │ │
│ │ REST (gRPC extensible) │
└─────────┼───────────────────────────────────────────────────────┘
│
│ (Outbound from agents)
│
┌─────────┼───────────────────────────────────────────────────────┐
│ ▼ Monitored Hosts │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Node Agent │ │ Node Agent │ │ Node Agent │ │
│ │ (host-1) │ │ (host-2) │ │ (host-3) │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ Local checks Local checks Local checks │
│ - Scripts - Scripts - Scripts │
│ - Exporters - Exporters - Exporters │
└─────────────────────────────────────────────────────────────────┘
Core Features
1. Registration & Discovery
- Agent registers with config-server on startup
- Agent-side UUID: Agent generates and persists its own ID at installation
- Receives target configuration (which checks to run)
- Heartbeat to maintain connection
- Auto-reconnect on connection loss
2. Configuration Sync
- Pull assigned script policies from config-server
- Watch for configuration changes
- Hot-reload without restart
3. Check Execution
- Execute assigned health check scripts
- Collect script output and exit codes
- Respect check intervals and timeouts
- Report results to config-server
4. Metrics Collection (Future)
- Scrape local exporters
- Aggregate and push to central location
- Remote write to Prometheus/VictoriaMetrics
- Note: Deferred to future phase
Technical Requirements
Communication Protocol
REST (Selected) - with abstraction layer for future gRPC support
POST /api/v1/bootstrap-tokens/register # Extend existing API
POST /api/v1/agents/{id}/heartbeat
POST /api/v1/agents/{id}/check-results
GET /api/v1/checks/target/{id} # Use existing API
Design with API Handler abstraction to allow easy addition of gRPC protocol in the future.
Agent ID & Auto-matching Strategy
Agent-side UUID approach adopted
Generate UUID at agent installation and store in /etc/aami/agent-id. Submit this ID to server during registration.
Installation Script
#!/bin/bash
# install-agent.sh
AGENT_ID_FILE="/etc/aami/agent-id"
if [ ! -f "$AGENT_ID_FILE" ]; then
mkdir -p /etc/aami
uuidgen > "$AGENT_ID_FILE"
fiCloud-init (AWS/GCP/Azure)
#cloud-config
write_files:
- path: /etc/aami/agent-id
permissions: '0644'
content: |
${agent_uuid} # Injected from IaCTerraform Example
resource "random_uuid" "agent_id" {}
resource "aws_instance" "monitored" {
user_data = templatefile("cloud-init.yaml", {
agent_uuid = random_uuid.agent_id.result
})
}
# Pre-register in config-server (optional)
resource "aami_target" "this" {
id = random_uuid.agent_id.result
hostname = "web-server-${count.index}"
group_id = aami_group.web.id
}Matching Behavior
| Scenario | AgentID | Behavior |
|---|---|---|
| New registration (ID provided) | Provided | Create Target with AgentID |
| New registration (no ID) | Empty | Server generates UUID |
| Re-registration/restart | Provided | Connect to existing Target |
| Pre-registered | Provided | Connect Agent to existing Target |
Agent Configuration
# /etc/aami/agent.yaml
server:
address: "config-server.example.com:8443"
tls:
enabled: true
ca_cert: "/etc/aami/ca.crt"
client_cert: "/etc/aami/agent.crt"
client_key: "/etc/aami/agent.key"
agent:
id_file: "/etc/aami/agent-id" # Agent-side UUID file path
hostname: "" # Auto-detected if empty
labels:
environment: "production"
datacenter: "dc1"
heartbeat:
interval: 30s
timeout: 10s
checks:
default_timeout: 30s
max_concurrent: 10
result_buffer_size: 1000
logging:
level: "info"
format: "json"Security
- mTLS for agent-server communication
- Agent authentication via client certificates or bootstrap tokens
- Script execution sandboxing (resource limits, allowed paths)
- Signed script policies to prevent tampering
API Changes Required
Bootstrap Register Extension
Current:
type BootstrapRegister struct {
Token string
Hostname string
IPAddress string
GroupID string
Labels map[string]string
Metadata map[string]string
}Updated:
type BootstrapRegister struct {
Token string
AgentID string // New: Agent-submitted ID (optional)
Hostname string
IPAddress string
GroupID string
Labels map[string]string
Metadata map[string]string
}Check Results Reporting (New)
POST /api/v1/agents/{id}/check-results
{
"results": [
{
"check_id": "disk-space",
"status": "critical",
"exit_code": 2,
"output": "Disk usage 95%",
"duration_ms": 150,
"executed_at": "2024-01-15T10:30:00Z"
}
]
}
Tasks
Phase 1: Core Agent
- Create
services/node-agentdirectory structure - Implement agent configuration loading
- Implement Agent-side UUID generation and persistence
- Implement server connection (REST with abstraction layer)
- Implement registration flow (extend existing bootstrap API)
- Implement heartbeat mechanism
- Add graceful shutdown
Phase 2: Check Execution
- Implement check scheduler
- Implement script executor with timeout
- Implement result buffering and batching
- Add resource limits for script execution
Phase 3: Config-Server Integration
- Extend bootstrap register to accept AgentID
- Add check results ingestion endpoint
- Implement agent reconnection logic
- Add agent status to UI
Phase 4: Security & Production
- Implement mTLS support
- Add agent authentication
- Add script signing/verification
- Create systemd service file
- Create installation script (with UUID generation)
- Write documentation
Dependencies
- Job Manager for handling async operations - Completed (feat(config-server): Implement generic async Job Manager #3)
- Async API endpoints for long-running config operations - Completed
- Check results storage schema (new)
Decisions Made
| Item | Decision | Notes |
|---|---|---|
| Protocol | REST | Design with API handler abstraction for gRPC extensibility |
| Metrics push | Deferred | Excluded from Phase 1, to be separated into future issue |
| Auto-matching | Agent-side UUID | Generate UUID at installation, Cloud environment compatible |
Related Work
- feat(config-server): Implement generic async Job Manager #3 - Job Manager Implementation (Completed)
- feat(config-server): implement Redis-based JobStore with distributed job claiming #13 - Redis-based JobStore Extension (Planned)
Metadata
Metadata
Assignees
Labels
No labels