Skip to content

The project explores the advantages of Parallel and Distributed Processing (PDP) techniques to improve efficiency in data analysis for large datasets. By utilizing the NYC Taxi Trip Dataset (yellow_tripdata_2015-01.csv), we aim to demonstrate a significant reduction in processing time using PDP methods compared to traditional sequential processing.

Notifications You must be signed in to change notification settings

sha10-bit/Efficient-Data-Processing-Using-Parallel-and-Distributed-Processing-

Repository files navigation

Overview The project explores the advantages of Parallel and Distributed Processing (PDP) techniques to improve efficiency in data analysis for large datasets. By utilizing the NYC Taxi Trip Dataset (yellow_tripdata_2015-01.csv), we aim to demonstrate a significant reduction in processing time using PDP methods compared to traditional sequential processing. Problem Statement Processing large datasets with traditional sequential methods is time-consuming and inefficient. PDP provides a scalable solution by leveraging concurrent processing to reduce execution time, making it ideal for big data applications.

Objectives • Compare sequential and parallel processing performance on NYC taxi data. • Implement an OOP-based, modular, and reusable code structure. • Achieve a 60%-time reduction using PDP techniques. Methodology

  1. Data Preparation: Clean and load the dataset.
  2. Sequential Processing: Analyze data using pandas for baseline performance.
  3. Parallel Processing: Process data using Dask for speed improvements.
  4. Performance Comparison: Measure and evaluate the time saved by PDP. Tools & Technologies Programming Language: • Python Libraries:
  5. pandas: For data analysis and manipulation in sequential processing.
  6. Dask: For parallel and distributed data processing.
  7. time: To measure execution time for performance evaluation. Dataset: • NYC Taxi Trip Data: yellow_tripdata_2015-01.csv, containing detailed information about taxi trips in New York City.

Evaluation Criteria

  1. Efficiency: PDP must process the data in less than 60% of the time taken by the sequential method.
  2. Code Quality: Modular design, proper documentation, and OOP principles.
  3. Execution Time: Sequential processing > 5 minutes, PDP < 2 minutes.
  4. Object-Oriented Design: Code is structured using OOP principles to ensure modularity, reusability, and scalability. Separate classes handle distinct functionalities (e.g., data loading, processing).

Deliverables • Modular Python code implementing sequential and parallel methods. • Documentation with instructions and method explanations. • A performance comparison report showcasing time savings.

Conclusion This project aims to provide a practical demonstration of how PDP techniques can significantly improve data processing efficiency for large datasets. By comparing sequential and parallel processing approaches on NYC taxi data, the project will illustrate the scalability and speed benefits of PDP, making it a valuable resource for tackling big data challenges in various industries. The emphasis on OOP design ensures that the solution is not only effective but also maintainable and extensible.

About

The project explores the advantages of Parallel and Distributed Processing (PDP) techniques to improve efficiency in data analysis for large datasets. By utilizing the NYC Taxi Trip Dataset (yellow_tripdata_2015-01.csv), we aim to demonstrate a significant reduction in processing time using PDP methods compared to traditional sequential processing.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages