This project is a simplified implementation of the Hadoop Distributed File System (HDFS) using Python. It is designed as a mini project to demonstrate core distributed file system concepts such as metadata management, block replication, fault tolerance, and client-server communication.
The objective of this project is to simulate the basic architecture and workflow of HDFS by implementing:
- A NameNode for metadata management
- Multiple DataNodes for storing file blocks
- Block replication across DataNodes
- Fault tolerance through replica-based recovery
- A Secondary NameNode for checkpointing
- A Client to interact with the system
This project is intended strictly for academic and learning purposes.
Src/
├── client/
├── datanode_0.py
├── datanode_1.py
├── datanode_2.py
├── namenode.py
├── secondary_namenode.py
├── config.py
└── README.md
The NameNode acts as the master of the system. It maintains all filesystem metadata including file names, directory structure, block-to-DataNode mappings, and replication information. It monitors DataNode availability and manages block placement and recovery.
DataNodes store the actual data blocks of files. Each block is replicated across multiple DataNodes as specified by the replication factor. DataNodes respond to read/write requests from clients and periodically report their status to the NameNode.
The Secondary NameNode periodically performs checkpointing by merging filesystem metadata and edit logs. This helps reduce NameNode recovery time in case of failure. It does not replace the NameNode.
The client provides an interface for users to interact with the mini HDFS system. It supports operations such as uploading files, reading files, and listing stored files.
- Each file is divided into fixed-size blocks.
- Every block is replicated across multiple DataNodes.
- If a DataNode fails, the NameNode detects the failure and redirects read requests to available replicas.
- Replication ensures data availability and reliability in case of node failures.
- Python 3.7 or higher
- Linux or Unix-based operating system recommended
- No external libraries required
git clone https://github.com/kaushal1014/18_Project1_BD.git
cd 18_Project1_BD
Edit the config.py file to configure:
- Number of DataNodes
- Replication factor
- Hostnames and ports
python3 namenode.py
Run each DataNode in a separate terminal:
python3 datanode_0.py
python3 datanode_1.py
python3 datanode_2.py
python3 client/client.py
(The exact commands depend on the client implementation.)
- The client sends a request to store a file.
- The NameNode splits the file into blocks and decides replica placement.
- DataNodes store block replicas.
- Metadata is updated in the NameNode.
- During read operations, the client accesses the nearest available replica.
- On DataNode failure, the NameNode ensures continued access via replicas.
- Understanding HDFS architecture
- Demonstrating replication and fault tolerance
- Academic mini project for Big Data and Distributed Systems courses
This mini HDFS project demonstrates key features of HDFS including block replication and fault tolerance. It provides practical insight into how distributed storage systems maintain data reliability and availability.