MiniSpark is a minimal, educational re-implementation of Apache Spark in pure Python. It uses the multiprocessing module to simulate a driver and worker cluster, supports lazy transformations and actions, stage/DAG planning, and simple DAG visualization.
- Core RDD API
map,filter,flatMap,samplereduceByKey,join,count,collect,take, etc.
- Lazy Evaluation & Stage Planning
- Builds a lineage DAG of transformations
- Splits at shuffle boundaries into narrow vs. wide stages
- Multiprocessing Scheduler
- Driver dispatches tasks to worker processes
- Timeout-based retry on worker failure with respawn (fault-tolerance)
- Graceful shutdown
- Persistence & Caching
rdd.cache()to memoize results and skip re-computation on repeated actions
- DAG Visualization (optional)
- Uses
networkx+ Graphviz (or spring-layout fallback) - Annotates narrow vs. wide transformations
- Uses
- Examples
- Word count, sampling, letter count
- DAG visualization demo
-
Clone the repo
git clone https://github.com/MadhurDixit13/MiniSpark.git cd mini_pyspark -
Install in editable mode
pip install -e . -
(Optional) Install extras for visualization
pip install networkx matplotlib pygraphviz
- Run the word-count example
python -m examples.word_count.py
minispark/
├── mini_pyspark/
│ ├── context.py # SparkContext, visualize(), plan()
│ ├── rdd.py # RDD class with lazy transforms + cache
│ ├── scheduler.py # TaskScheduler with stage execution and with fault-tolerance
│ ├── worker.py # Worker loop for map/filter tasks with failure simulation
│ ├── planner.py # Lineage walker & stage splitter
│ └── viz.py # NetworkX/Graphviz DAG drawing
├── examples/
│ ├── word_count.py # Core word-count demo
├── sample.txt # Sample data for examples
└── README.md
- Driver builds an
RDDlineage with lazy transforms. - Scheduler requests stages from
planner.py, dispatches narrow transforms to workers, then performs shuffle & reduce centrally. - Workers apply
map/filter/flatMap/samplefunctions in parallel. - Collect gathers results, groups by key, and applies
reduceByKey. - Visualization can render the DAG before computation.
- Fork the repo
- Create a feature branch
- Implement & test your feature
- Submit a Pull Request
Please adhere to the existing code style and add examples/tests for new functionality.
This project is released under the GNU General Purpose License. See LICENSE for details.