Skip to content
/ bdm3 Public

A modular data-driven framework leveraging Spark, Delta Lake, MLflow, Airflow, and Streamlit to build and orchestrate a full-stack data lake solution. Designed for scalable ETL, automated ML pipelines, and interactive dashboarding.

Notifications You must be signed in to change notification settings

romj99/bdm3

Repository files navigation

πŸš€ BDM Lab 3: Data Lake Architecture with Spark, MLflow & Airflow

L3-T01: Julian Romero & Moritz Peist

Big Data Management for Data Science (23D020)


πŸ“‘ Index


πŸ“‹ Assignment Overview

This project implements a complete data lake architecture using PySpark, Delta Lake, MLflow, and Airflow to analyze Barcelona housing market data through three-zone data processing (Landing β†’ Formatted β†’ Exploitation) with interactive dashboards and ML model management bunblded in a comprehensive output layer represented by Streamlit. Final Submission:

βœ… Data Management Backbone (Tasks A.1 - A.4)

βœ… Data Analysis Backbone (Tasks B.1 & B.2 - Both Implemented!)

βœ… Bonus Point

🧠 Project Architecture

πŸ“ Data Lake Zones (Local File System)

data_zones/
β”œβ”€β”€ 01_landing/     # Raw data ingestion (JSON/CSV)
β”œβ”€β”€ 02_formatted/   # Standardized Delta tables  
└── 03_exploitation/ # Analytics-ready datasets

πŸ”„ Pipeline Flow

graph TD
    A[Landing Zone] -->|A.2: Spark ETL| B[Formatted Zone]
    B -->|A.3: Analytics Transform| C[Exploitation Zone]
    C -->|A.4: Data Validation| D[Quality Reports]
    C -->|B.1: Descriptive| E[Streamlit Dashboards]
    C -->|B.2: ML Pipeline| F[MLflow Models]
    G[Airflow] -->|Orchestrates| A
    G -->|Orchestrates| B
    G -->|Orchestrates| C
Loading

πŸ“Š Selected Datasets & KPIs

Based on Assignment Task A.1, we selected these datasets from Open Data BCN:

🏠 Core Datasets

  1. Idealista (JSON, 21,389 records) - Real estate listings with prices, locations, characteristics
  2. Income (CSV, 811 records) - Socioeconomic data by district/neighborhood (2007-2017)
  3. Cultural Sites (CSV, 871 records) - Distribution of cultural amenities across Barcelona

πŸ“ˆ Key Performance Indicators (KPIs)

Our analysis focuses on housing affordability, socioeconomic equity, and urban quality of life:

Housing Market KPIs

  • Average Price per mΒ² by District
  • Property Type Distribution & Pricing
  • Market Supply by Neighborhood

Socioeconomic KPIs

  • Income Inequality Index (Coefficient of Variation)
  • Housing Affordability Ratio (Price vs Income)
  • Economic Accessibility Correlation

Cultural Accessibility KPIs

  • Cultural Density (sites per 1000 residents)
  • Cultural-Economic Correlation Analysis
  • Amenity Distribution Equity

Composite KPIs

  • Neighborhood Attractiveness Score
  • Spatial Equity Index across Income Quintiles

πŸ“„ Detailed KPI documentation: Dataset Selection and KPI Definition notebook

πŸ› οΈ Technology Stack

Component Technology Version Purpose
Data Processing Apache Spark (PySpark) 4.0 Distributed ETL and analytics
Data Storage Delta Lake 4.0 ACID transactions, schema evolution
Orchestration Apache Airflow 3.0.2 Workflow scheduling and monitoring
ML Tracking MLflow 3.0.0 Experiment tracking and model registry
Visualization Streamlit Latest Interactive dashboards and data exploration
Containerization Docker & Docker Compose Latest Reproducible deployment

πŸš€ Quick Start

1. πŸ“¦ Prerequisites

# Required
docker --version
docker-compose --version

# Optional (for notebooks)
pip install uv

2. 🐳 Start All Services

# Clone and navigate
git clone <repository-url>
cd bdm3

# Start the complete stack
docker-compose up --build

3. 🌐 Access Applications

4. πŸ“Š Upload & Process Data

  1. Upload datasets via Streamlit Landing Zone interface
  2. Trigger pipeline in Airflow: bcn_data_pipeline_with_validation
  3. Monitor progress through Airflow UI
  4. Explore results in Streamlit dashboards and MLflow experiments

πŸ“‹ Assignment Task Implementation

A.1: Data Exploration & KPI Selection βœ…

  • Location: notebooks/a1.ipynb + KPI Documentation
  • Deliverable: Selected 3 datasets (1 JSON + 2 CSV) and defined 10 comprehensive KPIs
  • Validation: Interactive EDA dashboards in Streamlit

A.2: Data Formatting Pipeline βœ…

  • Location: src/airflow/dags/pipelines/a2.py
  • Process: Raw data β†’ Standardized Delta tables with partitioning
  • Output: data_zones/02_formatted/ with cleaned, typed data
  • Validation: Schema enforcement and data quality checks

A.3: Exploitation Zone Pipeline βœ…

  • Location: src/airflow/dags/pipelines/a3.py
  • Process: Formatted data β†’ Analytics-ready datasets with aggregations
  • Output: data_zones/03_exploitation/ with 9 analytical tables
  • Features: KPI calculations, cross-dataset joins, feature engineering

A.4: Data Validation Pipeline βœ…

B.1: Descriptive Analysis & Dashboarding βœ…

  • Location: src/streamlit_ui/
  • Features:
    • Interactive data exploration with quality metrics
    • Real-time pipeline monitoring and validation reports
    • Multi-zone data browsing (Landing/Formatted/Exploitation)
    • KPI dashboards with Barcelona housing market insights

B.2: Predictive Analysis & Model Management βœ…

  • Location: src/ml_experiments/house_price_prediction.py
  • Models: Linear Regression + Random Forest for house price prediction
  • Features: MLflow experiment tracking, model registry, automatic deployment
  • Integration: Airflow DAG for model training and serving pipeline

Bonus: Orchestration Framework βœ…

  • Location: src/airflow/dags/airflow_orchestration.py
  • Features: Complete pipeline automation with dependencies, error handling, notifications
  • Compatibility: Airflow 3.0+ with modern TaskFlow API and DAG versioning

πŸ“ Project Structure

bdm3/
β”œβ”€β”€ πŸ“Š data_zones/              # Data Lake Implementation
β”‚   β”œβ”€β”€ 01_landing/            # A.2: Raw data ingestion
β”‚   β”œβ”€β”€ 02_formatted/          # A.3: Standardized Delta tables  
β”‚   └── 03_exploitation/       # A.4: Analytics-ready datasets
β”‚
β”œβ”€β”€ πŸ““ latex/              # A.1: Exploratory Analysis
β”‚   β”œβ”€β”€ ...
β”‚   β”œβ”€β”€ ...
β”‚   └── latex files for documentation
β”‚
β”œβ”€β”€ πŸ““ notebooks/              # A.1: Exploratory Analysis
β”‚   β”œβ”€β”€ a1.ipynb             # Data exploration & KPI selection
β”‚   β”œβ”€β”€ a2.ipynb             # Formatting pipeline development  
β”‚   └── a3.ipynb             # Exploitation pipeline development
β”‚
β”œβ”€β”€ πŸ”§ src/                   # Application Implementation
β”‚   β”œβ”€β”€ airflow/              # Bonus: Pipeline Orchestration
β”‚   β”‚   β”œβ”€β”€ dags/
β”‚   β”‚   β”‚   β”œβ”€β”€ pipelines/
β”‚   β”‚   β”‚   β”‚   β”œβ”€β”€ a2.py    # A.2: Data Formatting
β”‚   β”‚   β”‚   β”‚   β”œβ”€β”€ a3.py    # A.3: Data Exploitation  
β”‚   β”‚   β”‚   β”‚   └── a4.py    # A.4: Data Validation
β”‚   β”‚   β”‚   └── airflow_orchestration.py
β”‚   β”‚   └── Dockerfile
β”‚   β”‚
β”‚   β”œβ”€β”€ ml_experiments/       # B.2: Predictive Analysis
β”‚   β”‚   └── house_price_prediction.py
β”‚   β”‚
β”‚   β”œβ”€β”€ streamlit_ui/         # B.1: Dashboarding
β”‚   β”‚   β”œβ”€β”€ app.py           # Main Streamlit application
β”‚   β”‚   β”œβ”€β”€ sections/        # Multi-page interface
β”‚   β”‚   └── Dockerfile
β”‚   β”‚
β”‚   └── utils/               # Shared utilities
β”‚       └── eda_dashboard.py # Interactive data exploration
β”‚
β”œβ”€β”€ πŸ“ outputs/              # Pipeline Outputs
β”‚   └── mlruns/             # MLflow tracking data
β”‚
β”œβ”€β”€ 🐳 docker-compose.yml    # Service orchestration
β”œβ”€β”€ πŸ“„ L3-T01_submission.pdf    # Final submission document
β”œβ”€β”€ πŸ“„ main.tex             # Latex code for final document
β”œβ”€β”€ ⌨️ pyproject.toml       # Python dependencies
β”œβ”€β”€ πŸ“– README.md            # This file
└── ⌨️ uv.lock              # Python dependencies

πŸ”§ Development Setup

Local Development (Optional)

# Install uv for Python environment management
curl -LsSf https://astral.sh/uv/install.sh | sh

# Create and sync environment
uv sync

# Run notebooks locally  
uv run jupyter lab notebooks/

Manual Pipeline Execution

# Run individual pipelines (requires local Spark setup)
python src/airflow/dags/pipelines/a2.py  # Data Formatting
python src/airflow/dags/pipelines/a3.py  # Data Exploitation  
python src/airflow/dags/pipelines/a4.py  # Data Validation

πŸ“ˆ Key Features & Innovations

πŸ—οΈ Modern Data Architecture

  • Delta Lake 4.0 for ACID transactions and schema evolution
  • Spark 4.0 for distributed data processing
  • Airflow 3.0 with modern TaskFlow API and DAG versioning

πŸ“Š Comprehensive Analytics

  • 10 carefully selected KPIs addressing housing affordability and urban equity
  • Cross-dataset integration for neighborhood attractiveness scoring
  • Real-time data quality monitoring with automated validation reports

πŸ€– End-to-End ML Pipeline

  • Automated model training with hyperparameter tracking
  • Model registry with automatic deployment of best-performing models
  • Experiment reproducibility through MLflow integration

πŸŽ›οΈ Production-Ready Orchestration

  • Airflow 3.0+ compatibility with asset-based scheduling
  • Comprehensive error handling and notification system
  • Pipeline dependency management with automatic validation

πŸ–₯️ User-Friendly Interface

  • Multi-page Streamlit application for data exploration and pipeline management
  • Interactive data quality dashboards with real-time metrics
  • Integrated tool access (Airflow/MLflow) through embedded interfaces

🎯 Assignment Deliverables

βœ… Three Python Scripts/Notebooks:

  • src/airflow/dags/pipelines/a2.py (Data Formatting)
  • src/airflow/dags/pipelines/a3.py (Data Exploitation)
  • src/airflow/dags/pipelines/a4.py (Data Validation)

βœ… PDF Documentation: Dataset Selection and KPI Definition.md

βœ… Additional Implementations:

  • Streamlit dashboarding application (A+B)
  • MLflow model management pipeline (B.2)
  • Complete Airflow orchestration (Bonus)

App Overview

Zones


Landing Zone Formatted Zone Exploitation Zone

Airflow Job Scheduleer


Ariflow Dags Ariflow Pipeline

MLFlow Experiments


MLFLow Experiments MLFLow Models

Data Sanity Dashboard


Data sanity Dashboard

🚨 Important Notes

πŸ’‘ Data Requirements: The repository already conuts with the initial data data_zones/01_landing/ in order to facilitate running the pipelines.

⚠️ Service Dependencies: MLflow and Airflow services must be running for full functionality

⚠️ Resource Usage: Spark jobs configured for local execution - adjust memory settings in Docker if needed

About

A modular data-driven framework leveraging Spark, Delta Lake, MLflow, Airflow, and Streamlit to build and orchestrate a full-stack data lake solution. Designed for scalable ETL, automated ML pipelines, and interactive dashboarding.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •