Skip to content

Distributed machine learning training platform leveraging data parallelism across a hybrid compute network

Notifications You must be signed in to change notification settings

Hamzenium/DistributeX

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

28 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Distributed ML Training Platform

This project is a distributed version of a neural network that splits training across multiple peers using FastAPI. The system leverages peer-to-peer communication to efficiently distribute training tasks and track performance metrics such as accuracy and loss.

Infrastructure Services

Docker Compose stack with MinIO, RabbitMQ, and MongoDB.

Quick Start

docker-compose up -d
docker-compose down -v

Services

Service Port Username Password URL
MinIO (S3) 9000, 9001 minioadmin minioadmin http://localhost:9001
RabbitMQ 5672, 15672 admin admin http://localhost:15672

Commands

# Start
docker-compose up -d
# Stop
docker-compose down
# Stop + delete data
# Logs
docker-compose logs -f

High-Level Architecture

Below is the high-level architecture diagram of the distributed neural network training system:

Architecture

graph TB
    subgraph Step1["Step 1: Authentication"]
        Client[User/Client<br/>Desktop FastAPI App]
        Auth[JWT Auth System<br/>+ MongoDB Users]
    end
    
    subgraph Step2["Step 2: Peer Selection"]
        Registry[Active Peer Registry<br/>Returns queue ID & device specs]
    end
    
    subgraph Step3["Step 3: Dataset Upload + Sharding"]
        Upload[Upload dataset<br/>.zip/.csv]
        Sharder[Data Sharder<br/>Splits into N shards]
        S3[Upload to S3 Blob]
    end
    
    subgraph Step4["Step 4: Prepare Training Job"]
        Config[config.json<br/>- RabbitMQ queues<br/>- shard file URLs<br/>- training params + config]
    end
    
    subgraph Step5["Step 5: Send Tasks to Peers"]
        RabbitMQ[RabbitMQ - 1 queue per peer]
        Peer1[Peer 1<br/>Local machine]
        Peer2[Peer 2<br/>Local machine]
        PeerN[Peer N<br/>Local machine]
        
        Core1[Core 1<br/>RabbitMQ<br/>Listener]
        Core2[Core 1<br/>RabbitMQ<br/>Listener]
        CoreN[Core 1<br/>RabbitMQ<br/>Listener]
        
        Train1[Core 2<br/>Forward +<br/>Backward<br/>Training]
        Train2[Core 2<br/>Forward +<br/>Backward<br/>Training]
        TrainN[Core 2<br/>Forward +<br/>Backward<br/>Training]
    end
    
    subgraph Step6["Step 6: Client Aggregates Results"]
        Dispatcher[Core 1<br/>Event/Queue<br/>Dispatcher]
        Aggregator[Core 2<br/>Aggregator Thread<br/>Merge Models]
    end
    
    subgraph Step7["Step 7: Final Model"]
        Final[Client saves model<br/>.pt/.h5]
    end
    
    Client --> Auth
    Auth --> Registry
    Registry --> Upload
    Upload --> Sharder
    Sharder --> S3
    S3 --> Config
    Config --> RabbitMQ
    
    RabbitMQ --> Peer1 & Peer2 & PeerN
    Peer1 --> Core1 --> Train1
    Peer2 --> Core2 --> Train2
    PeerN --> CoreN --> TrainN
    
    Train1 & Train2 & TrainN --> Dispatcher
    Dispatcher --> Aggregator
    Aggregator --> Final
    
    style Client fill:#64748b,stroke:#333,stroke-width:2px,color:#fff
    style Auth fill:#0c4b33,stroke:#333,stroke-width:2px,color:#fff
    style Registry fill:#2563eb,stroke:#333,stroke-width:2px,color:#fff
    style Upload fill:#8b5cf6,stroke:#333,stroke-width:2px,color:#fff
    style Sharder fill:#8b5cf6,stroke:#333,stroke-width:2px,color:#fff
    style S3 fill:#f59e0b,stroke:#333,stroke-width:2px,color:#fff
    style Config fill:#10b981,stroke:#333,stroke-width:2px,color:#fff
    style RabbitMQ fill:#ff6600,stroke:#333,stroke-width:2px,color:#fff
    style Peer1 fill:#3b82f6,stroke:#333,stroke-width:2px,color:#fff
    style Peer2 fill:#3b82f6,stroke:#333,stroke-width:2px,color:#fff
    style PeerN fill:#3b82f6,stroke:#333,stroke-width:2px,color:#fff
    style Core1 fill:#06b6d4,stroke:#333,stroke-width:2px,color:#fff
    style Core2 fill:#06b6d4,stroke:#333,stroke-width:2px,color:#fff
    style CoreN fill:#06b6d4,stroke:#333,stroke-width:2px,color:#fff
    style Train1 fill:#ec4899,stroke:#333,stroke-width:2px,color:#fff
    style Train2 fill:#ec4899,stroke:#333,stroke-width:2px,color:#fff
    style TrainN fill:#ec4899,stroke:#333,stroke-width:2px,color:#fff
    style Dispatcher fill:#0c4b33,stroke:#333,stroke-width:2px,color:#fff
    style Aggregator fill:#dc2626,stroke:#333,stroke-width:2px,color:#fff
    style Final fill:#64748b,stroke:#333,stroke-width:2px,color:#fff
Loading

About

Distributed machine learning training platform leveraging data parallelism across a hybrid compute network

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published