L3-T01: Julian Romero & Moritz Peist
Big Data Management for Data Science (23D020)
- Assignment Completion Overview
- Project Architecture
- Selected Datasets & KPIs
- Technology Stack
- Quick Start
- Assignment Task Implementation
- Project Structure
- Development Setup
- Key Features & Innovations
- Assignment Deliverables
- App Overview
- Important Notes
This project implements a complete data lake architecture using PySpark, Delta Lake, MLflow, and Airflow to analyze Barcelona housing market data through three-zone data processing (Landing β Formatted β Exploitation) with interactive dashboards and ML model management bunblded in a comprehensive output layer represented by Streamlit. Final Submission:
- A.1: Data exploration and KPI selection β Completed in notebook with KPI documentation
- A.2: Data formatting pipeline β Spark job:
src/airflow/dags/pipelines/a2.py - A.3: Exploitation zone pipeline β Spark job:
src/airflow/dags/pipelines/a3.py - A.4: Data validation pipeline β Spark job:
src/airflow/dags/pipelines/a4.pyand validation notebooks per task (A2, A3)
- B.1: Descriptive Analysis & Dashboarding β Streamlit UI with interactive dashboards
- B.2: Predictive Analysis & Model Management β MLflow tracking + house price prediction
- Orchestration Framework β Apache Airflow DAGs for complete pipeline automation
data_zones/
βββ 01_landing/ # Raw data ingestion (JSON/CSV)
βββ 02_formatted/ # Standardized Delta tables
βββ 03_exploitation/ # Analytics-ready datasets
graph TD
A[Landing Zone] -->|A.2: Spark ETL| B[Formatted Zone]
B -->|A.3: Analytics Transform| C[Exploitation Zone]
C -->|A.4: Data Validation| D[Quality Reports]
C -->|B.1: Descriptive| E[Streamlit Dashboards]
C -->|B.2: ML Pipeline| F[MLflow Models]
G[Airflow] -->|Orchestrates| A
G -->|Orchestrates| B
G -->|Orchestrates| C
Based on Assignment Task A.1, we selected these datasets from Open Data BCN:
- Idealista (JSON, 21,389 records) - Real estate listings with prices, locations, characteristics
- Income (CSV, 811 records) - Socioeconomic data by district/neighborhood (2007-2017)
- Cultural Sites (CSV, 871 records) - Distribution of cultural amenities across Barcelona
Our analysis focuses on housing affordability, socioeconomic equity, and urban quality of life:
- Average Price per mΒ² by District
- Property Type Distribution & Pricing
- Market Supply by Neighborhood
- Income Inequality Index (Coefficient of Variation)
- Housing Affordability Ratio (Price vs Income)
- Economic Accessibility Correlation
- Cultural Density (sites per 1000 residents)
- Cultural-Economic Correlation Analysis
- Amenity Distribution Equity
- Neighborhood Attractiveness Score
- Spatial Equity Index across Income Quintiles
π Detailed KPI documentation: Dataset Selection and KPI Definition notebook
| Component | Technology | Version | Purpose |
|---|---|---|---|
| Data Processing | Apache Spark (PySpark) | 4.0 | Distributed ETL and analytics |
| Data Storage | Delta Lake | 4.0 | ACID transactions, schema evolution |
| Orchestration | Apache Airflow | 3.0.2 | Workflow scheduling and monitoring |
| ML Tracking | MLflow | 3.0.0 | Experiment tracking and model registry |
| Visualization | Streamlit | Latest | Interactive dashboards and data exploration |
| Containerization | Docker & Docker Compose | Latest | Reproducible deployment |
# Required
docker --version
docker-compose --version
# Optional (for notebooks)
pip install uv# Clone and navigate
git clone <repository-url>
cd bdm3
# Start the complete stack
docker-compose up --build- Streamlit App: http://localhost:8501 - Main interface for data exploration and pipeline management
- Airflow UI: http://localhost:8080 - Pipeline orchestration and monitoring
- MLflow UI: http://localhost:5001 - ML experiment tracking and model registry
- Upload datasets via Streamlit Landing Zone interface
- Trigger pipeline in Airflow:
bcn_data_pipeline_with_validation - Monitor progress through Airflow UI
- Explore results in Streamlit dashboards and MLflow experiments
- Location:
notebooks/a1.ipynb+ KPI Documentation - Deliverable: Selected 3 datasets (1 JSON + 2 CSV) and defined 10 comprehensive KPIs
- Validation: Interactive EDA dashboards in Streamlit
- Location:
src/airflow/dags/pipelines/a2.py - Process: Raw data β Standardized Delta tables with partitioning
- Output:
data_zones/02_formatted/with cleaned, typed data - Validation: Schema enforcement and data quality checks
- Location:
src/airflow/dags/pipelines/a3.py - Process: Formatted data β Analytics-ready datasets with aggregations
- Output:
data_zones/03_exploitation/with 9 analytical tables - Features: KPI calculations, cross-dataset joins, feature engineering
- Location:
src/airflow/dags/pipelines/a4.pyas well as A2 Validation Notebook and A3 Validation Notebook - Process: Comprehensive data quality assessment and KPI validation
- Output: JSON reports for Streamlit consumption + quality metrics
- Features: Data integrity checks, performance metrics, recommendations
- Location:
src/streamlit_ui/ - Features:
- Interactive data exploration with quality metrics
- Real-time pipeline monitoring and validation reports
- Multi-zone data browsing (Landing/Formatted/Exploitation)
- KPI dashboards with Barcelona housing market insights
- Location:
src/ml_experiments/house_price_prediction.py - Models: Linear Regression + Random Forest for house price prediction
- Features: MLflow experiment tracking, model registry, automatic deployment
- Integration: Airflow DAG for model training and serving pipeline
- Location:
src/airflow/dags/airflow_orchestration.py - Features: Complete pipeline automation with dependencies, error handling, notifications
- Compatibility: Airflow 3.0+ with modern TaskFlow API and DAG versioning
bdm3/
βββ π data_zones/ # Data Lake Implementation
β βββ 01_landing/ # A.2: Raw data ingestion
β βββ 02_formatted/ # A.3: Standardized Delta tables
β βββ 03_exploitation/ # A.4: Analytics-ready datasets
β
βββ π latex/ # A.1: Exploratory Analysis
β βββ ...
β βββ ...
β βββ latex files for documentation
β
βββ π notebooks/ # A.1: Exploratory Analysis
β βββ a1.ipynb # Data exploration & KPI selection
β βββ a2.ipynb # Formatting pipeline development
β βββ a3.ipynb # Exploitation pipeline development
β
βββ π§ src/ # Application Implementation
β βββ airflow/ # Bonus: Pipeline Orchestration
β β βββ dags/
β β β βββ pipelines/
β β β β βββ a2.py # A.2: Data Formatting
β β β β βββ a3.py # A.3: Data Exploitation
β β β β βββ a4.py # A.4: Data Validation
β β β βββ airflow_orchestration.py
β β βββ Dockerfile
β β
β βββ ml_experiments/ # B.2: Predictive Analysis
β β βββ house_price_prediction.py
β β
β βββ streamlit_ui/ # B.1: Dashboarding
β β βββ app.py # Main Streamlit application
β β βββ sections/ # Multi-page interface
β β βββ Dockerfile
β β
β βββ utils/ # Shared utilities
β βββ eda_dashboard.py # Interactive data exploration
β
βββ π outputs/ # Pipeline Outputs
β βββ mlruns/ # MLflow tracking data
β
βββ π³ docker-compose.yml # Service orchestration
βββ π L3-T01_submission.pdf # Final submission document
βββ π main.tex # Latex code for final document
βββ β¨οΈ pyproject.toml # Python dependencies
βββ π README.md # This file
βββ β¨οΈ uv.lock # Python dependencies# Install uv for Python environment management
curl -LsSf https://astral.sh/uv/install.sh | sh
# Create and sync environment
uv sync
# Run notebooks locally
uv run jupyter lab notebooks/# Run individual pipelines (requires local Spark setup)
python src/airflow/dags/pipelines/a2.py # Data Formatting
python src/airflow/dags/pipelines/a3.py # Data Exploitation
python src/airflow/dags/pipelines/a4.py # Data Validation- Delta Lake 4.0 for ACID transactions and schema evolution
- Spark 4.0 for distributed data processing
- Airflow 3.0 with modern TaskFlow API and DAG versioning
- 10 carefully selected KPIs addressing housing affordability and urban equity
- Cross-dataset integration for neighborhood attractiveness scoring
- Real-time data quality monitoring with automated validation reports
- Automated model training with hyperparameter tracking
- Model registry with automatic deployment of best-performing models
- Experiment reproducibility through MLflow integration
- Airflow 3.0+ compatibility with asset-based scheduling
- Comprehensive error handling and notification system
- Pipeline dependency management with automatic validation
- Multi-page Streamlit application for data exploration and pipeline management
- Interactive data quality dashboards with real-time metrics
- Integrated tool access (Airflow/MLflow) through embedded interfaces
β Three Python Scripts/Notebooks:
src/airflow/dags/pipelines/a2.py(Data Formatting)src/airflow/dags/pipelines/a3.py(Data Exploitation)src/airflow/dags/pipelines/a4.py(Data Validation)
β PDF Documentation: Dataset Selection and KPI Definition.md
β Additional Implementations:
- Streamlit dashboarding application (A+B)
- MLflow model management pipeline (B.2)
- Complete Airflow orchestration (Bonus)
π‘ Data Requirements: The repository already conuts with the initial data
data_zones/01_landing/in order to facilitate running the pipelines.
β οΈ Service Dependencies: MLflow and Airflow services must be running for full functionality
β οΈ Resource Usage: Spark jobs configured for local execution - adjust memory settings in Docker if needed







