This project is a comprehensive End-to-End Data Engineering Pipeline that transforms raw Uber trip data into actionable insights. Built as part of my journey as a Computer Science student at Alamein International University (AIU), this project covers the entire data lifecycle: from cloud storage and orchestration to data modeling and visualization.
The pipeline follows a modern data stack approach:
- Data Source: Raw Uber CSV data stored in Google Cloud Storage (GCS).
- Orchestration: Mage AI running on a GCP Compute Engine (VM) instance.
- Processing: Data cleaning and transformation using Python (Pandas).
- Data Modeling: Designing a Star Schema with Fact and Dimension tables.
- Data Warehouse: Google BigQuery for high-performance analytics.
- Final Layer: Custom SQL joins for the analytics table.
- Visualization: Interactive Dashboard built with Google Looker Studio.
| Tool | Purpose |
|---|---|
| Python | Data Transformation & ETL Logic |
| Mage AI | Modern Data Pipeline Orchestration |
| GCP (Compute Engine) | Virtual Machine Hosting |
| GCP (GCS) | Raw Data Lake Storage |
| BigQuery | Cloud Data Warehousing |
| SQL | Analytics Table Construction |
| Looker Studio | BI & Dashboarding |
To optimize query performance and maintain data integrity, the data was modeled into a Star Schema:
- Fact Table:
fact_table(Measures and FKs). - Dimension Tables: *
datetime_dimpassenger_count_dimtrip_distance_dimrate_code_dimpickup_location_dimdropoff_location_dimpayment_type_dim
Building this pipeline wasn't without its hurdles. Here’s how I tackled the technical challenges:
- Environment Isolation (PEP 668):
- Issue: Encountered the
externally-managed-environmenterror on Python 3.11 when installing GCP libraries. - Fix: Managed the installation using the
--break-system-packagesflag to ensure the VM environment had the necessary BigQuery SDKs.
- Issue: Encountered the
- Mage Exporter Logic:
- Issue: The initial exporter block attempted to iterate through columns as tables, causing
NameErrorandTable not found. - Fix: Refactored the Python logic to correctly handle a dictionary of DataFrames, ensuring each dimension table was exported individually to BigQuery.
- Issue: The initial exporter block attempted to iterate through columns as tables, causing
- Geospatial Data Visualization:
- Issue: Looker Studio initially failed to recognize Latitude and Longitude fields.
- Fix: Reconfigured the Data Source field types to Geo coordinates to enable the Map visualizations.
The final analytics layer provides a deep dive into Uber's operations:
- Revenue Analysis: ~$1.6M total revenue processed.
- Geospatial Mapping: Identification of high-density pickup zones in New York.
- Operational Efficiency: Average trip distances and payment method preferences.
I'm Mohamed Amer, a 1st-year CS student at AIU, deeply interested in Cloud Infrastructure, Data Engineering.
- GitHub: [https://github.com/mohamedamerdev-coder]
- LinkedIn: [www.linkedin.com/in/mohamed-amer-46733833b]
