MiniSpark

MiniSpark is a minimal, educational re-implementation of Apache Spark in pure Python. It uses the multiprocessing module to simulate a driver and worker cluster, supports lazy transformations and actions, stage/DAG planning, and simple DAG visualization.

🚀 Features

Core RDD API
- map, filter, flatMap, sample
- reduceByKey, join, count, collect, take, etc.
Lazy Evaluation & Stage Planning
- Builds a lineage DAG of transformations
- Splits at shuffle boundaries into narrow vs. wide stages
Multiprocessing Scheduler
- Driver dispatches tasks to worker processes
- Timeout-based retry on worker failure with respawn (fault-tolerance)
- Graceful shutdown
Persistence & Caching
- rdd.cache() to memoize results and skip re-computation on repeated actions
DAG Visualization (optional)
- Uses networkx + Graphviz (or spring-layout fallback)
- Annotates narrow vs. wide transformations
Examples
- Word count, sampling, letter count
- DAG visualization demo

📦 Installation

Clone the repo

git clone https://github.com/MadhurDixit13/MiniSpark.git
cd mini_pyspark

Install in editable mode
```
pip install -e .
```

(Optional) Install extras for visualization

pip install networkx matplotlib pygraphviz

🎓 Quick Start

Run the word-count example
```
python -m examples.word_count.py
```

📂 Project Structure

minispark/
├── mini_pyspark/
│   ├── context.py      # SparkContext, visualize(), plan()
│   ├── rdd.py          # RDD class with lazy transforms + cache
│   ├── scheduler.py    # TaskScheduler with stage execution and with fault-tolerance
│   ├── worker.py       # Worker loop for map/filter tasks with failure simulation
│   ├── planner.py      # Lineage walker & stage splitter
│   └── viz.py          # NetworkX/Graphviz DAG drawing
├── examples/
│   ├── word_count.py        # Core word-count demo
├── sample.txt         # Sample data for examples
└── README.md

📖 How It Works

Driver builds an RDD lineage with lazy transforms.
Scheduler requests stages from planner.py, dispatches narrow transforms to workers, then performs shuffle & reduce centrally.
Workers apply map/filter/flatMap/sample functions in parallel.
Collect gathers results, groups by key, and applies reduceByKey.
Visualization can render the DAG before computation.

🤝 Contributing

Fork the repo
Create a feature branch
Implement & test your feature
Submit a Pull Request

Please adhere to the existing code style and add examples/tests for new functionality.

📄 License

This project is released under the GNU General Purpose License. See LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
examples		examples
mini_pyspark		mini_pyspark
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
sample.txt		sample.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MiniSpark

🚀 Features

📦 Installation

🎓 Quick Start

📂 Project Structure

📖 How It Works

🤝 Contributing

📄 License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

MadhurDixit13/MiniSpark

Folders and files

Latest commit

History

Repository files navigation

MiniSpark

🚀 Features

📦 Installation

🎓 Quick Start

📂 Project Structure

📖 How It Works

🤝 Contributing

📄 License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages