A socket-based distributed task execution system showcasing core distributed systems concepts.
- Workers automatically find and bind to available ports
- Retry mechanism if ports are occupied
- Eliminates port conflicts in multi-instance scenarios
- Master discovers workers by scanning port ranges
- Workers register with master on first contact
- Dynamic worker pool management
- Workers monitor master health via heartbeat
- Workers self-terminate if master is unreachable (30s timeout)
- Master detects and removes unresponsive workers
- Automatic task retry on worker failure
- TCP sockets for reliable message delivery
- JSON protocol for task serialization
- Async I/O for non-blocking communication
- Connection pooling and management
- Configurable retry logic per task
- Timeout detection for hung workers
- Graceful degradation when workers fail
- Task reallocation on failure
Master Node (Port 4999)
↓ discovers
Worker-1 (Port 5001) ←→ Socket Communication
Worker-2 (Port 5003) ←→ Task Distribution
Worker-3 (Port 5017) ←→ Health Monitoring
python3 src/main.pyExpected Flow:
- Master starts on first available port (4999+)
- Workers start and self-assign ports (5000-5100 range)
- Master discovers workers via port scanning
- Tasks distributed with priority-based scheduling
- Workers execute and report results
- Master shutdown triggers worker self-termination
master.log- Task distribution, worker discovery, health checksworker-1.log- Task execution, master monitoringworker-2.log- Worker activity and failuresworker-3.log- Port assignment, task processing
Adjust worker count:
workers = await start_workers(num_workers=5)Change port ranges:
WorkerNode(worker_id="worker-1", port_range=(6000, 6100))Modify master timeout for worker self-termination:
worker.master_timeout = 60 # seconds- Network Partitioning: Workers handle master disconnection
- Leader Election: Single master coordinates workers
- Load Balancing: Tasks distributed across available workers
- Health Monitoring: Continuous heartbeat mechanism
- Graceful Degradation: System continues with available workers
- Retry Logic: Automatic recovery from transient failures
- Port Management: Dynamic resource allocation
- Service Discovery: Automatic worker registration