GitHub - sha10-bit/Efficient-Data-Processing-Using-Parallel-and-Distributed-Processing-: The project explores the advantages of Parallel and Distributed Processing (PDP) techniques to improve efficiency in data analysis for large datasets. By utilizing the NYC Taxi Trip Dataset (yellow_tripdata_2015-01.csv), we aim to demonstrate a significant reduction in processing time using PDP methods compared to traditional sequential processing.

Overview The project explores the advantages of Parallel and Distributed Processing (PDP) techniques to improve efficiency in data analysis for large datasets. By utilizing the NYC Taxi Trip Dataset (yellow_tripdata_2015-01.csv), we aim to demonstrate a significant reduction in processing time using PDP methods compared to traditional sequential processing. Problem Statement Processing large datasets with traditional sequential methods is time-consuming and inefficient. PDP provides a scalable solution by leveraging concurrent processing to reduce execution time, making it ideal for big data applications.

Objectives • Compare sequential and parallel processing performance on NYC taxi data. • Implement an OOP-based, modular, and reusable code structure. • Achieve a 60%-time reduction using PDP techniques. Methodology

Data Preparation: Clean and load the dataset.
Sequential Processing: Analyze data using pandas for baseline performance.
Parallel Processing: Process data using Dask for speed improvements.
Performance Comparison: Measure and evaluate the time saved by PDP. Tools & Technologies Programming Language: • Python Libraries:
pandas: For data analysis and manipulation in sequential processing.
Dask: For parallel and distributed data processing.
time: To measure execution time for performance evaluation. Dataset: • NYC Taxi Trip Data: yellow_tripdata_2015-01.csv, containing detailed information about taxi trips in New York City.

Evaluation Criteria

Efficiency: PDP must process the data in less than 60% of the time taken by the sequential method.
Code Quality: Modular design, proper documentation, and OOP principles.
Execution Time: Sequential processing > 5 minutes, PDP < 2 minutes.
Object-Oriented Design: Code is structured using OOP principles to ensure modularity, reusability, and scalability. Separate classes handle distinct functionalities (e.g., data loading, processing).

Deliverables • Modular Python code implementing sequential and parallel methods. • Documentation with instructions and method explanations. • A performance comparison report showcasing time savings.

Conclusion This project aims to provide a practical demonstration of how PDP techniques can significantly improve data processing efficiency for large datasets. By comparing sequential and parallel processing approaches on NYC taxi data, the project will illustrate the scalability and speed benefits of PDP, making it a valuable resource for tackling big data challenges in various industries. The emphasis on OOP design ensures that the solution is not only effective but also maintainable and extensible.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
Efficient Data Processing Using Parallel and Distributed Processing		Efficient Data Processing Using Parallel and Distributed Processing
Lab 14.pdf		Lab 14.pdf
README.md		README.md
Team Members		Team Members

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Uh oh!

Releases

Packages

Languages

sha10-bit/Efficient-Data-Processing-Using-Parallel-and-Distributed-Processing-

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages