feat: Implement Node Agent for distributed monitoring

## Summary

Implement a Node Agent that runs on monitored hosts to collect metrics, execute health checks, and report status to the central config-server. This enables distributed monitoring without requiring direct network access to all targets.

## Background

Currently, the monitoring system relies on Prometheus directly scraping targets. A Node Agent provides:
- **Push-based metrics**: Targets behind firewalls can push data out
- **Local script execution**: Run health checks locally on the target
- **Reduced network complexity**: Only agent→server communication needed
- **Edge computing**: Process/aggregate data before sending

## Architecture

```
┌─────────────────────────────────────────────────────────────────┐
│                        Central Infrastructure                    │
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────────────┐  │
│  │   Config    │◄───│ Config-     │◄───│    PostgreSQL       │  │
│  │   Server    │    │ Server UI   │    │    (metadata)       │  │
│  └──────┬──────┘    └─────────────┘    └─────────────────────┘  │
│         │                                                        │
│         │ REST (gRPC extensible)                                │
└─────────┼───────────────────────────────────────────────────────┘
          │
          │ (Outbound from agents)
          │
┌─────────┼───────────────────────────────────────────────────────┐
│         ▼              Monitored Hosts                          │
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐         │
│  │ Node Agent  │    │ Node Agent  │    │ Node Agent  │         │
│  │  (host-1)   │    │  (host-2)   │    │  (host-3)   │         │
│  └─────────────┘    └─────────────┘    └─────────────┘         │
│        │                  │                  │                  │
│        ▼                  ▼                  ▼                  │
│   Local checks       Local checks       Local checks            │
│   - Scripts          - Scripts          - Scripts               │
│   - Exporters        - Exporters        - Exporters             │
└─────────────────────────────────────────────────────────────────┘
```

## Core Features

### 1. Registration & Discovery
- Agent registers with config-server on startup
- **Agent-side UUID**: Agent generates and persists its own ID at installation
- Receives target configuration (which checks to run)
- Heartbeat to maintain connection
- Auto-reconnect on connection loss

### 2. Configuration Sync
- Pull assigned script policies from config-server
- Watch for configuration changes
- Hot-reload without restart

### 3. Check Execution
- Execute assigned health check scripts
- Collect script output and exit codes
- Respect check intervals and timeouts
- Report results to config-server

### 4. Metrics Collection (Future)
- Scrape local exporters
- Aggregate and push to central location
- Remote write to Prometheus/VictoriaMetrics
- **Note**: Deferred to future phase

## Technical Requirements

### Communication Protocol

**REST (Selected)** - with abstraction layer for future gRPC support

```
POST   /api/v1/bootstrap-tokens/register   # Extend existing API
POST   /api/v1/agents/{id}/heartbeat
POST   /api/v1/agents/{id}/check-results
GET    /api/v1/checks/target/{id}          # Use existing API
```

Design with API Handler abstraction to allow easy addition of gRPC protocol in the future.

### Agent ID & Auto-matching Strategy

**Agent-side UUID approach adopted**

Generate UUID at agent installation and store in `/etc/aami/agent-id`. Submit this ID to server during registration.

#### Installation Script
```bash
#!/bin/bash
# install-agent.sh

AGENT_ID_FILE="/etc/aami/agent-id"
if [ ! -f "$AGENT_ID_FILE" ]; then
    mkdir -p /etc/aami
    uuidgen > "$AGENT_ID_FILE"
fi
```

#### Cloud-init (AWS/GCP/Azure)
```yaml
#cloud-config
write_files:
  - path: /etc/aami/agent-id
    permissions: '0644'
    content: |
      ${agent_uuid}   # Injected from IaC
```

#### Terraform Example
```hcl
resource "random_uuid" "agent_id" {}

resource "aws_instance" "monitored" {
  user_data = templatefile("cloud-init.yaml", {
    agent_uuid = random_uuid.agent_id.result
  })
}

# Pre-register in config-server (optional)
resource "aami_target" "this" {
  id       = random_uuid.agent_id.result
  hostname = "web-server-${count.index}"
  group_id = aami_group.web.id
}
```

#### Matching Behavior

| Scenario | AgentID | Behavior |
|----------|---------|----------|
| New registration (ID provided) | Provided | Create Target with AgentID |
| New registration (no ID) | Empty | Server generates UUID |
| Re-registration/restart | Provided | Connect to existing Target |
| Pre-registered | Provided | Connect Agent to existing Target |

### Agent Configuration

```yaml
# /etc/aami/agent.yaml
server:
  address: "config-server.example.com:8443"
  tls:
    enabled: true
    ca_cert: "/etc/aami/ca.crt"
    client_cert: "/etc/aami/agent.crt"
    client_key: "/etc/aami/agent.key"

agent:
  id_file: "/etc/aami/agent-id"  # Agent-side UUID file path
  hostname: ""  # Auto-detected if empty
  labels:
    environment: "production"
    datacenter: "dc1"

heartbeat:
  interval: 30s
  timeout: 10s

checks:
  default_timeout: 30s
  max_concurrent: 10
  result_buffer_size: 1000

logging:
  level: "info"
  format: "json"
```

### Security

- mTLS for agent-server communication
- Agent authentication via client certificates or bootstrap tokens
- Script execution sandboxing (resource limits, allowed paths)
- Signed script policies to prevent tampering

## API Changes Required

### Bootstrap Register Extension

Current:
```go
type BootstrapRegister struct {
    Token     string
    Hostname  string
    IPAddress string
    GroupID   string
    Labels    map[string]string
    Metadata  map[string]string
}
```

Updated:
```go
type BootstrapRegister struct {
    Token     string
    AgentID   string            // New: Agent-submitted ID (optional)
    Hostname  string
    IPAddress string
    GroupID   string
    Labels    map[string]string
    Metadata  map[string]string
}
```

### Check Results Reporting (New)
```
POST /api/v1/agents/{id}/check-results
{
  "results": [
    {
      "check_id": "disk-space",
      "status": "critical",
      "exit_code": 2,
      "output": "Disk usage 95%",
      "duration_ms": 150,
      "executed_at": "2024-01-15T10:30:00Z"
    }
  ]
}
```

## Tasks

### Phase 1: Core Agent
- [ ] Create `services/node-agent` directory structure
- [ ] Implement agent configuration loading
- [ ] Implement Agent-side UUID generation and persistence
- [ ] Implement server connection (REST with abstraction layer)
- [ ] Implement registration flow (extend existing bootstrap API)
- [ ] Implement heartbeat mechanism
- [ ] Add graceful shutdown

### Phase 2: Check Execution
- [ ] Implement check scheduler
- [ ] Implement script executor with timeout
- [ ] Implement result buffering and batching
- [ ] Add resource limits for script execution

### Phase 3: Config-Server Integration
- [ ] Extend bootstrap register to accept AgentID
- [ ] Add check results ingestion endpoint
- [ ] Implement agent reconnection logic
- [ ] Add agent status to UI

### Phase 4: Security & Production
- [ ] Implement mTLS support
- [ ] Add agent authentication
- [ ] Add script signing/verification
- [ ] Create systemd service file
- [ ] Create installation script (with UUID generation)
- [ ] Write documentation

## Dependencies

- [x] Job Manager for handling async operations - **Completed** (#3)
- [x] Async API endpoints for long-running config operations - **Completed**
- [ ] Check results storage schema (new)

## Decisions Made

| Item | Decision | Notes |
|------|----------|-------|
| Protocol | REST | Design with API handler abstraction for gRPC extensibility |
| Metrics push | Deferred | Excluded from Phase 1, to be separated into future issue |
| Auto-matching | Agent-side UUID | Generate UUID at installation, Cloud environment compatible |

## Related Work

- #3 - Job Manager Implementation (Completed)
- #13 - Redis-based JobStore Extension (Planned)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Implement Node Agent for distributed monitoring #5

Summary

Background

Architecture

Core Features

1. Registration & Discovery

2. Configuration Sync

3. Check Execution

4. Metrics Collection (Future)

Technical Requirements

Communication Protocol

Agent ID & Auto-matching Strategy

Installation Script

Cloud-init (AWS/GCP/Azure)

Terraform Example

Matching Behavior

Agent Configuration

Security

API Changes Required

Bootstrap Register Extension

Check Results Reporting (New)

Tasks

Phase 1: Core Agent

Phase 2: Check Execution

Phase 3: Config-Server Integration

Phase 4: Security & Production

Dependencies

Decisions Made

Related Work

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Scenario	AgentID	Behavior
New registration (ID provided)	Provided	Create Target with AgentID
New registration (no ID)	Empty	Server generates UUID
Re-registration/restart	Provided	Connect to existing Target
Pre-registered	Provided	Connect Agent to existing Target

Item	Decision	Notes
Protocol	REST	Design with API handler abstraction for gRPC extensibility
Metrics push	Deferred	Excluded from Phase 1, to be separated into future issue
Auto-matching	Agent-side UUID	Generate UUID at installation, Cloud environment compatible

feat: Implement Node Agent for distributed monitoring #5

Description

Summary

Background

Architecture

Core Features

1. Registration & Discovery

2. Configuration Sync

3. Check Execution

4. Metrics Collection (Future)

Technical Requirements

Communication Protocol

Agent ID & Auto-matching Strategy

Installation Script

Cloud-init (AWS/GCP/Azure)

Terraform Example

Matching Behavior

Agent Configuration

Security

API Changes Required

Bootstrap Register Extension

Check Results Reporting (New)

Tasks

Phase 1: Core Agent

Phase 2: Check Execution

Phase 3: Config-Server Integration

Phase 4: Security & Production

Dependencies

Decisions Made

Related Work

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions