A learning project by Madhav Bhayani — exploring distributed systems, Go concurrency patterns, and real-world ETL pipeline design.
DataForge is a full-stack ETL (Extract, Transform, Load) pipeline engine built entirely from scratch. It lets you upload CSV files, run multi-stage data transformations — cleaning, normalization, and deduplication — and export clean datasets, all powered by a concurrent worker pool and a typed REST API.
Every component is hand-built without relying on external Go frameworks, as a deliberate exercise in systems programming.
┌────────────────────────────────────────────────────────┐
│ React 19 + Tailwind v4 (frontend) │
│ Upload → Analyze → Clean → Normalize → Dedup → Export │
└─────────────────────────┬──────────────────────────────┘
│ REST API
┌─────────────────────────┴──────────────────────────────┐
│ Go 1.25 Backend (chi router) │
│ ┌──────────┐ ┌──────────┐ ┌───────────────┐ │
│ │ Dispatcher│→ │ Worker │→ │ ETL Executors │ │
│ │ │ │ Pool (N) │ │ (clean/norm/ │ │
│ │ │ │ │ │ dedup) │ │
│ └──────────┘ └──────────┘ └───────────────┘ │
│ ┌──────────────────┐ ┌───────────────────┐ │
│ │ Priority Queue │ │ In-Memory Stores │ │
│ │ (high/med/low) │ │ (jobs, datasets) │ │
│ └──────────────────┘ └───────────────────┘ │
└────────────────────────────────────────────────────────┘
| Layer | Technology |
|---|---|
| Backend | Go 1.25, chi router, in-memory stores |
| Frontend | React 19, Vite 7, Tailwind CSS v4 |
| Analytics | Firebase Analytics + Firestore |
| License | MIT |
- Concurrent Worker Pool — configurable goroutine pool with priority dispatching
- Intelligent CSV Analyzer — automatic column type detection with 85% majority-vote threshold
- Multi-Stage ETL Pipeline — clean → normalize → deduplicate with detailed per-step reports
- Smart Cleaning — null filling, whitespace trimming, type coercion with per-cell change tracking
- Exact & Fuzzy Dedup — configurable match columns, keep strategies, and detailed group reports with Load More pagination
- Dry Run Mode — preview duplicates without modifying data
- Typed REST API — structured JSON responses with health checks
- React Dashboard — real-time pipeline stepper with quality delta tracking
- Go 1.25+
- Node.js 20+
cd "Go Distributed Job Processing Unit Project"
go run cmd/server/main.goThe API server starts on http://localhost:8080.
cd frontend/go-distributed-ui
npm install
npm run devOpens at http://localhost:5173.
├── cmd/server/ # Entry point
├── internal/
│ ├── analyzer/ # CSV column type detection
│ ├── api/ # REST handlers & router
│ ├── config/ # Server configuration
│ ├── dataset/ # Dataset types & storage
│ ├── dispatcher/ # Job dispatcher
│ ├── executor/ # ETL executors (clean, normalize, dedup)
│ ├── models/ # Shared data models
│ ├── monitor/ # Health & metrics
│ ├── queue/ # Priority job queue
│ ├── store/ # In-memory job store
│ ├── validator/ # Input validation
│ └── worker/ # Concurrent worker pool
├── frontend/
│ └── go-distributed-ui/ # React + Vite app
└── README.md
MIT — see LICENSE for details.
Built as a learning exercise in distributed systems and Go concurrency.
Star the repo if you find it interesting!