A fault-tolerant distributed system with a control server and agent that executes commands reliably, with crash recovery and idempotent execution.
┌─────────────────────┐ ┌──────────────────────┐
│ Control Server │◄────────┤ Agent │
│ │ │ │
│ - REST API │ │ - Command Poller │
│ - SQLite DB │ │ - Executors │
│ - Recovery Logic │ │ - Idempotency DB │
└─────────────────────┘ └──────────────────────┘
│ │
├── POST /commands ├── Polls GET /commands/next
├── GET /commands/:id ├── Submits PUT /commands/:id/result
├── GET /commands (all) └── Crash simulation flags
└── SQLite persistence
Stack: Node.js + TypeScript + Fastify + SQLite Persistence: better-sqlite3 with WAL mode
Responsibilities:
- Accept command submissions via REST API
- Persist command state to SQLite database
- Assign commands to agents with transaction-based locking
- Track command lifecycle: PENDING → RUNNING → COMPLETED/FAILED
- Recover from crashes by marking orphaned RUNNING commands as FAILED
Endpoints:
GET /health- Health checkPOST /commands- Submit new commandGET /commands- List all commandsGET /commands/:id- Get command statusGET /commands/next?agentId=X- Agent polling (with locking)PUT /commands/:id/result- Agent submits result
Stack: Node.js + TypeScript + SQLite (for idempotency)
Responsibilities:
- Poll control server for pending commands
- Execute DELAY and HTTP_GET_JSON commands
- Submit results back to server
- Track executed commands in local SQLite DB (idempotency)
- Simulate crashes for testing (
--kill-after,--random-failures)
Executors:
DELAY: Sleeps for specified milliseconds, returns actual durationHTTP_GET_JSON: Fetches URL, returns status + body (truncated to 100KB)
Location: control-server/data/commands.db
Schema: control-server/src/db/schema.sql
CREATE TABLE commands (
id TEXT PRIMARY KEY,
type TEXT NOT NULL CHECK(type IN ('DELAY', 'HTTP_GET_JSON')),
payload TEXT NOT NULL,
status TEXT NOT NULL DEFAULT 'PENDING',
result TEXT,
agentId TEXT,
createdAt INTEGER NOT NULL,
updatedAt INTEGER NOT NULL,
assignedAt INTEGER
);Indexes:
idx_commands_status- Fast PENDING command lookupsidx_commands_agentId- Agent-specific queries
Location: agent/data/idempotency.db
CREATE TABLE executed_commands (
commandId TEXT PRIMARY KEY,
executedAt INTEGER NOT NULL
);Purpose: Prevents re-execution if agent crashes after completing a command but before reporting to server.
On Startup (control-server/src/server.ts:18):
runRecovery(() => commandsService.recoverRunningCommands());Strategy: Mark all RUNNING commands as FAILED, but agent automatically retries them
Rationale:
- FAILED status provides visibility that server crashed during execution
- Agent polling includes both PENDING and FAILED commands for automatic retry
- Agent idempotency prevents actual re-execution if already completed
- Best of both worlds: clear failure tracking + automatic recovery
Dual Protection:
-
Server-side locking (
control-server/src/services/commands.service.ts:77-110):- Transaction ensures atomic PENDING → RUNNING transition
- No command assigned to multiple agents
-
Agent-side idempotency (
agent/src/services/idempotency.ts):- Checks local DB before executing
- Marks command as executed after completion
- Prevents re-execution even if server state is inconsistent
Edge Cases Handled:
- Crash mid-DELAY: Server marks FAILED, agent retries but won't re-execute if already started (idempotency)
- Crash after execution, before reporting: Server marks FAILED, agent picks it up, skips execution (idempotency), reports result
- Server restart during agent execution: Command marked FAILED on restart, agent may complete and submit successfully or retry
The system is designed for a single agent but includes foundational support for multiple agents:
1. Transaction-Based Command Assignment
The getNextPendingCommand() method uses SQLite transactions to ensure atomic command assignment:
const transaction = db.transaction(() => {
// SELECT + UPDATE in single transaction
// Only one agent gets each command, even with concurrent requests
});This prevents race conditions when multiple agents poll simultaneously.
2. Agent Identification & Tracking
Each command tracks:
agentId: Which agent is executing itassignedAt: When it was assigned- Status transitions prevent reassignment while RUNNING
3. Handling the "Additional Questions"
Q: What happens when multiple agents exist?
✅ Already Handled: Transaction-based locking ensures only one agent receives each command. The agentId field tracks ownership.
Q: Agent restarts quickly?
✅ Already Handled:
- Server-side status prevents reassigning RUNNING commands
- Agent-side idempotency DB prevents re-execution
- If server crashes, commands marked FAILED are automatically retried
Q: Agent requests next command while one is running?
✅ Already Handled:
- Agent polls continuously regardless of current work
- Server only returns PENDING or FAILED commands
- RUNNING commands stay locked until completed or server restart
1. Command Timeout Detection
Problem: Agent dies silently → command stuck in RUNNING forever
Solution Needed:
// Detect commands stuck in RUNNING for > 5 minutes
recoverStalledCommands(timeoutMs: number): number {
const staleTimestamp = Date.now() - timeoutMs;
// Mark as FAILED if assignedAt < staleTimestamp
}2. Agent Heartbeat System
Problem: No way to detect dead agents
Solution Needed:
- Agent registration table tracking last heartbeat
- Periodic check to mark agents DEAD
- Reassign their commands to healthy agents
3. Load Balancing
Current: Any agent can take any command (FIFO)
Better: Prefer assigning FAILED commands to same agent for context
For a single-agent system, these features add complexity without value:
- No timeout needed if agent crashes → server restart recovers
- No heartbeats needed for one agent
- Load balancing irrelevant with one worker
The current architecture supports easy extension to multiple agents by adding:
- Periodic stalled command recovery (10 lines)
- Agent heartbeat endpoint + table (30 lines)
- Smart assignment logic (5 lines)
Decision: On server restart, mark all RUNNING commands as FAILED, but agent automatically retries them by polling for both PENDING and FAILED commands
Alternatives Considered:
- Mark as PENDING: Loses visibility that a crash occurred
- Mark as FAILED only: Commands stay stuck, requiring manual intervention
- Smart detection: Too complex, not worth the risk
Trade-off: Provides visibility (FAILED status shows what crashed) while maintaining automatic recovery. Agent idempotency prevents double execution, making retry safe.
Implementation: SQLite transaction wraps SELECT + UPDATE
Benefits:
- Prevents race conditions if multiple agents added later
- Atomic state transition
- No external locking mechanism needed
Server: Status-based (PENDING → RUNNING prevents reassignment) Agent: Local DB tracking (prevents actual re-execution)
Why Both?:
- Defense in depth
- Handles network failures, partial completions
- Different failure modes covered
Chosen: SQLite with WAL mode
Reasons:
- ACID guarantees
- Efficient indexing for status queries
- Mature, battle-tested
- Transaction support for locking
- Easy to inspect with
sqlite3CLI
Trade-off: Single writer (not a problem for single-agent requirement)
node >= 18.x
npm or yarncd control-server
npm install
npm run start:devServer runs on http://localhost:3000
cd agent
npm install
npm run start:devEnvironment Variables:
SERVER_URL- Control server URL (default:http://localhost:3000)POLL_INTERVAL- Polling interval in ms (default:2000)AGENT_ID- Custom agent ID (default: auto-generated UUID)
Crash Simulation:
npm run start:dev -- --kill-after=10 # Crash after 10 seconds
npm run start:dev -- --random-failures # Random crashes (10% chance per cycle)docker >= 20.x
docker compose >= 2.xStart both services:
docker compose up --buildThis will:
- Build Docker images for both control-server and agent
- Start control-server on port 3000
- Wait for health check before starting agent
- Create persistent volumes for databases and logs
Run in detached mode:
docker compose up -d --buildView logs:
# All services
docker compose logs -f
# Specific service
docker compose logs -f control-server
docker compose logs -f agentStop services:
docker compose downStop and remove volumes (clears databases):
docker compose down -vSubmit commands (use the host port, default 3000 or your configured CONTROL_SERVER_PORT):
# Health check
curl http://localhost:3000/health
# DELAY command
curl -X POST http://localhost:3000/commands \
-H "Content-Type: application/json" \
-d '{"type":"DELAY","payload":{"ms":3000}}'
# HTTP_GET_JSON command
curl -X POST http://localhost:3000/commands \
-H "Content-Type: application/json" \
-d '{"type":"HTTP_GET_JSON","payload":{"url":"https://jsonplaceholder.typicode.com/posts/1"}}'
# Check status
curl http://localhost:3000/commands/<commandId>
# List all commands
curl http://localhost:3000/commandsNote: If you set CONTROL_SERVER_PORT=7001 in your .env, use http://localhost:7001 instead.
Access service logs:
# Check agent logs inside container
docker exec -it agent cat /app/logs/agent.log
# Check server logs inside container
docker exec -it control-server cat /app/logs/server.logAccess databases:
# Control server database
docker exec -it control-server sqlite3 /app/data/commands.db "SELECT id, status FROM commands;"
# Agent idempotency database
docker exec -it agent sqlite3 /app/data/idempotency.db "SELECT * FROM executed_commands;"Persistent data is stored in named volumes:
control-server-data: Server SQLite databasecontrol-server-logs: Server log filesagent-data: Agent idempotency databaseagent-logs: Agent log files
The docker compose configuration supports environment variables for flexible deployment:
-
Copy the example environment file:
cp .env.example .env
-
Edit
.envwith your values:# Port mapping: Host port -> Container port # Control server listens on port 3000 inside the container # This maps it to a different port on the host machine CONTROL_SERVER_PORT=7001 # IMPORTANT: SERVER_URL must use the INTERNAL container port (3000) # NOT the external host port (7001) # Containers communicate directly inside the Docker network SERVER_URL=http://control-server:3000 POLL_INTERVAL=2000 NODE_ENV=production AGENT_ID=agent-1
-
Run with custom configuration:
docker compose up -d
Understanding Port Configuration:
- Container Port: The port the service listens on INSIDE the container (always 3000 for control-server)
- Host Port: The port exposed on the HOST machine (configurable via
CONTROL_SERVER_PORT, e.g., 7001) - Docker Network: Containers communicate using container ports, not host ports
- External Access: Use the host port (e.g.,
http://localhost:7001/health) - Internal Access: Agent uses container port via
SERVER_URL=http://control-server:3000
To expose the control-server through nginx on a custom path:
-
Create nginx configuration (
/etc/nginx/sites-available/agent-executor):server { listen 80; server_name your-domain.com; # or use _ for IP-based access location /agent-executor { rewrite ^/agent-executor(/.*)$ $1 break; rewrite ^/agent-executor$ / break; proxy_pass http://127.0.0.1:7001; # Use the host port proxy_http_version 1.1; proxy_set_header Upgrade $http_upgrade; proxy_set_header Connection 'upgrade'; proxy_set_header Host $host; proxy_set_header X-Real-IP $remote_addr; proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; proxy_cache_bypass $http_upgrade; } }
-
Enable and reload:
sudo ln -s /etc/nginx/sites-available/agent-executor /etc/nginx/sites-enabled/ sudo nginx -t sudo systemctl reload nginx
-
Test:
curl http://your-domain.com/agent-executor/health
Important: Make sure to use the host port (e.g., 7001) in the nginx proxy_pass directive, not the container port (3000).
Problem: Container builds but server can't be accessed from host
Symptom: curl: (56) Recv failure: Connection reset by peer or nginx 502 errors
Solution: The server must bind to 0.0.0.0 instead of localhost to accept connections through Docker port mapping:
// control-server/src/server.ts
await fastify.listen({ port: 3000, host: '0.0.0.0' });Without host: '0.0.0.0', the server only accepts connections from inside the container, preventing Docker port mapping from working.
Problem: Agent logs show "fetch failed" errors
Symptom: Agent continuously fails to connect to control-server
Solution: Check that SERVER_URL uses the internal container port (3000), not the external host port:
# Wrong - uses host port
SERVER_URL=http://control-server:7001
# Correct - uses container port
SERVER_URL=http://control-server:3000Problem: Docker build fails with "EBADENGINE" error
Symptom: npm warn EBADENGINE Unsupported engine { package: 'better-sqlite3@12.6.0' }
Solution: Update Dockerfiles to use Node 20+:
FROM node:20-alpineProblem: TypeScript compilation fails in Docker
Symptom: tsc: The TypeScript Compiler - Version 5.9.3 (help output instead of compilation)
Solution: Ensure tsconfig.json is NOT in .dockerignore. Remove it if present:
# .dockerignore should NOT contain:
# tsconfig.json <- Remove this lineProblem: npm ci fails with "package-lock.json not found"
Solution: Ensure package-lock.json is committed to git and not in .gitignore.
This project includes a GitHub Actions CI/CD pipeline that automatically builds, tests, and pushes Docker images to GitHub Container Registry.
The workflow (.github/workflows/ci-cd.yml) runs on:
- Push to main: Builds, tests, and pushes Docker images to GitHub Container Registry
- Pull requests to main: Runs build and test only (no image push)
- Build and Test: Compiles TypeScript and runs tests for both services
- Build and Push: Creates Docker images and pushes to
ghcr.io/YOUR_USERNAME/agent-executor/control-server:latestandghcr.io/YOUR_USERNAME/agent-executor/agent:latest
No manual configuration needed - GitHub automatically provides GITHUB_TOKEN for pushing to the registry.
After the CI/CD pipeline pushes images to GitHub Container Registry, you can pull and run them on any server:
-
On your server, copy the production compose file:
# Create a directory mkdir -p ~/agent-executor cd ~/agent-executor # Copy files (or git clone the repo) # You need: docker-compose.prod.yml and .env.example
-
Create .env file with your GitHub repository:
cp .env.example .env nano .env
Set this variable:
GITHUB_REPOSITORY=your-username/agent-executor
-
Pull and start services:
docker compose -f docker-compose.prod.yml pull docker compose -f docker-compose.prod.yml up -d
-
Check it's running:
curl http://localhost:3000/health
# Login to GitHub Container Registry (if private repo)
echo $GITHUB_TOKEN | docker login ghcr.io -u YOUR_USERNAME --password-stdin
# Pull images
docker pull ghcr.io/Dagmawi-22/agent-executor/control-server:latest
docker pull ghcr.io/Dagmawi-22/agent-executor/agent:latest
# Run with docker compose.prod.yml
docker compose -f docker-compose.prod.yml up -dNote: By default, GitHub Container Registry packages are private. Make them public in: GitHub → Packages → Your package → Package settings → Change visibility
Terminal 1 (Server):
cd control-server && npm run start:devTerminal 2 (Agent):
cd agent && npm run start:devTerminal 3 (Create commands):
# DELAY command
curl -X POST http://localhost:3000/commands \
-H "Content-Type: application/json" \
-d '{"type":"DELAY","payload":{"ms":3000}}'
# Expected: {"commandId":"<uuid>"}
# HTTP_GET_JSON command
curl -X POST http://localhost:3000/commands \
-H "Content-Type: application/json" \
-d '{"type":"HTTP_GET_JSON","payload":{"url":"https://jsonplaceholder.typicode.com/posts/1"}}'
# Check status
curl http://localhost:3000/commands/<commandId>
# Expected: {"status":"COMPLETED","result":{...},"agentId":"agent-..."}Steps:
- Start server + agent
- Submit DELAY command with 10s
- Kill server (Ctrl+C) while agent is executing
- Restart server
- Check database: command should be marked FAILED
Verify:
sqlite3 control-server/data/commands.db "SELECT id, status FROM commands;"Steps:
- Start server
- Start agent with
--kill-after=5 - Submit DELAY command with 10s
- Agent crashes after 5s
- Restart agent
- Agent should NOT re-execute the command (idempotency check)
Expected Log:
Command <id> already executed (idempotency check)
Steps:
- Start server + agent
- Submit command
- Kill agent AFTER execution but BEFORE result submission
- Restart agent
- Agent fetches same command (marked RUNNING)
- Agent skips execution (already in idempotency DB)
Verify Agent DB:
sqlite3 agent/data/idempotency.db "SELECT * FROM executed_commands;"├── control-server/
│ ├── src/
│ │ ├── db/
│ │ │ ├── index.ts # Database connection
│ │ │ └── schema.sql # SQLite schema
│ │ ├── services/
│ │ │ └── commands.service.ts # CRUD + recovery logic
│ │ ├── routes/
│ │ │ └── index.ts # API endpoints
│ │ ├── types/
│ │ │ └── index.ts # TypeScript types
│ │ └── server.ts # Fastify server + recovery
│ ├── data/ # SQLite database (gitignored)
│ ├── package.json
│ └── tsconfig.json
│
└── agent/
├── src/
│ ├── executors/
│ │ ├── index.ts # Command dispatcher
│ │ ├── delay.ts # DELAY executor
│ │ └── http-get-json.ts # HTTP_GET_JSON executor
│ ├── services/
│ │ ├── api.ts # HTTP client for server
│ │ └── idempotency.ts # Local DB tracking
│ ├── types/
│ │ └── index.ts # TypeScript types
│ └── agent.ts # Main polling loop
├── data/ # Idempotency DB (gitignored)
├── package.json
└── tsconfig.json
A complete Postman collection is available at postman_collection.json with all API endpoints and examples.
- Open Postman
- Click Import button
- Select
postman_collection.jsonfrom this repository - The collection will be imported with all endpoints
The collection uses these variables:
baseUrl: API base URL (default:http://localhost:3000)commandId: Auto-populated when you create a commandagentId: Agent identifier for agent endpoints (default:agent-test-001)
- Update baseUrl if your server is on a different port or host
- Run Health Check to verify the server is running
- Create a command using one of the POST endpoints (commandId will be auto-saved)
- Get Command by ID to check the status (uses the saved commandId)
- Get All Commands to see all commands
- ✅ Health Check
- ✅ Create DELAY Command
- ✅ Create HTTP_GET_JSON Command (JSONPlaceholder example)
- ✅ Create HTTP_GET_JSON Command (GitHub API example)
- ✅ Get All Commands
- ✅ Get Command by ID
- ✅ Agent - Poll for Next Command
- ✅ Agent - Submit Command Result (DELAY)
- ✅ Agent - Submit Command Result (HTTP)
curl -X POST http://localhost:3000/commands \
-H "Content-Type: application/json" \
-d '{"type":"DELAY","payload":{"ms":5000}}'curl -X POST http://localhost:3000/commands \
-H "Content-Type: application/json" \
-d '{"type":"HTTP_GET_JSON","payload":{"url":"https://api.github.com/users/octocat"}}'curl http://localhost:3000/commands/<commandId>curl http://localhost:3000/commandsInput:
{
"type": "DELAY",
"payload": { "ms": 3000 }
}Output:
{
"ok": true,
"tookMs": 3001
}Input:
{
"type": "HTTP_GET_JSON",
"payload": { "url": "https://jsonplaceholder.typicode.com/posts/1" }
}Output (success):
{
"status": 200,
"body": { "userId": 1, "id": 1, "title": "...", "body": "..." },
"truncated": false,
"bytesReturned": 292,
"error": null
}Output (error):
{
"status": 0,
"body": null,
"truncated": false,
"bytesReturned": 0,
"error": "fetch failed"
}- Single agent assumption: Designed for one agent. Multi-agent support exists (transaction-based locking) but lacks timeout detection and heartbeats. See Multi-Agent Support for details.
- No command timeout: Commands can run forever if agent dies silently. Requires timeout detection for production multi-agent deployments.
- No exponential backoff: Retries happen immediately on agent poll, no delay strategy
- No result validation: Server accepts any result format from agent
- No authentication: Open API, no security
- No metrics/monitoring: No Prometheus, logging only
- Command timeout detection and stalled command recovery
- Agent heartbeat system with health monitoring
- Smart load balancing (prefer same agent for retries)
- Agent registration and discovery
- Exponential backoff for retry delays
- Max retry count to prevent infinite retry loops
- Graceful shutdown handling
- Circuit breaker for failing agents
- Command priority queue
- Result schema validation
- Command cancellation support
- Batch command submission
- Authentication & authorization
- Metrics (Prometheus) + monitoring
- Rate limiting
- Audit logging
This project was built collaboratively between a human developer and AI (Claude Code). This section honestly documents the contributions and limitations of both.
The human developer was responsible for:
- Architecture decisions: Chose the overall approach and requirements
- Problem-solving: Identified critical missing features (like automatic retry mechanism)
- Code review: Caught all AI mistakes and required fixes
- Testing: Performed all manual testing and validation
- Git workflow: Created all commits with meaningful messages
- Quality standards: Enforced TypeScript best practices (no
anytypes, proper logging) - Critical thinking: Questioned AI suggestions and pushed for better solutions
AI assisted with:
- Boilerplate generation: TypeScript types, Fastify routes, database schema
- Code scaffolding: Initial structure for services, executors, and utilities
- Documentation: README structure and API examples
- Docker setup: Dockerfiles and docker compose configuration
AI made several significant mistakes that required human intervention:
-
Missing retry logic: Initially implemented recovery that marked commands as FAILED but failed to implement any retry mechanism. The developer had to point out this critical flaw and suggest the solution of polling for both PENDING and FAILED commands.
-
Import issues: Used wrong syntax for
better-sqlite3imports, causing runtime errors -
TypeScript configuration:
- Used deprecated
moduleResolutionsettings - Had to fix multiple times (node → bundler → node10 with ignoreDeprecations)
- Agent's tsconfig.json became empty at one point
- Used deprecated
-
Missing files: The
agent.tsfile was completely missing initially, causing the agent to fail -
Type safety: Initially used
anytypes everywhere until developer requested proper typing -
Directory creation: Forgot to create data directories, causing database initialization failures
-
Logging approach: Initially suggested console.log everywhere; developer requested file-based logging with timestamps
Key improvements driven by the developer:
- Automatic retry mechanism (AI completely missed this)
- File-based logging with timestamps instead of console.log
- Removing all
anytypes for full type safety - Proper error handling and idempotency checks
- Docker setup optimization
While AI accelerated initial development, the human developer was essential for:
- Catching critical design flaws
- Ensuring code quality
- Implementing missing features
- Fixing numerous bugs
- Making architectural decisions
The final working system is a result of active human supervision and correction of AI-generated code.
Built with Node.js, TypeScript, Fastify, and SQLite