Skip to content

End-to-end Data Analytics pipeline for analyzing, clustering, and forecasting residential construction costs in Europe (Eurostat data). Includes ETL, Feature Engineering, and Machine Learning models (Random Forest, K-Means).

Notifications You must be signed in to change notification settings

saccolucax2/DA_Project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

43 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🏗️ European Residential Construction Cost Analysis

Python Scikit-Learn Eurostat Status

📋 Abstract

This project implements an end-to-end Data Analytics pipeline to analyze, cluster, and forecast the production costs of residential buildings in Europe. Using official Eurostat data (2000-2024), the study identifies economic trends, groups countries by similarity, and predicts future cost variations using Machine Learning techniques.

The workflow covers the entire data lifecycle: from automated ingestion and cleaning to feature engineering, supervised/unsupervised modeling, and bootstrap validation.

👥 Authors

  • Carmela Mandato
  • Giulia Di Biase
  • Luca Sacco
  • Simone Di Mario

📊 Dataset

The analysis is based on the Eurostat dataset:

  • Source Code: STS-COPI-A (Production in construction - annual data).
  • Variable Target: OBS_VALUE (Construction cost index, percentage change).
  • Scope: 27 EU countries + major aggregates (EA19, EA20, EU27).
  • Timeframe: 2000 - 2024.

⚙️ Project Pipeline

The project is structured into four main methodological blocks:

1. Data Ingestion & Preprocessing

  • ETL: Automated extraction from raw CSVs and mapping of geographical aggregates.
  • Cleaning: Filtering for COST indicators and PCH_SM units.
  • Outlier Detection: Implemented Interquartile Range (IQR) method ($Q1 - 1.5 \times IQR$ / $Q3 + 1.5 \times IQR$) to identify anomalies (e.g., post-pandemic inflation shocks in Bulgaria/Estonia).
  • Imputation: Handling missing values via row-wise means for minor gaps.

2. Feature Engineering

Transformed raw time-series data into predictive features to capture volatility and trends:

  • Rolling Statistics: Mean and Standard Deviation (3-year and 5-year windows).
  • Trends: pct_change_3 (3-year percentage change) and slope_3 (local regression slope).
  • Categorical Labels: Generated label_variation (Increase, Stable, Decrease) based on a $\pm 15%$ threshold.

3. Unsupervised Learning (Clustering)

Goal: Identify groups of countries with similar economic trajectories.

  • Dimensionality Reduction: PCA (Principal Component Analysis).
  • Algorithm: K-Means ($k=2$ selected via Silhouette Score of 0.53).
  • Outcome: Successfully separated stable economies (e.g., Germany, France) from volatile/emerging markets (e.g., Baltic states).
  • Note: DBSCAN was tested but discarded due to sensitivity to density variations in normalized data.

4. Supervised Learning (Prediction)

  • Regression (Forecasting):
    • Model: Random Forest Regressor (Tuned).
    • Performance: RMSE: 2.5, MAE: 1.5.
    • Metric: The model explains approx. 65% of variance, outperforming the Decision Tree baseline.
  • Classification (Trend Direction):
    • Model: Random Forest Classifier.
    • Imbalance Handling: Applied SMOTE (Synthetic Minority Oversampling Technique) to the training set.
    • Performance: F1-weighted Score: 0.74.

📉 Key Results & Visualizations

Model Performance (Bootstrap Analysis)

To ensure robustness, models were validated using 200 bootstrap iterations:

  • Regression Stability: 95% CI for RMSE = [2.44, 2.73].
  • Classification Stability: 95% CI for F1-Score = [0.672, 0.765].

Feature Importance

The most influential predictors for construction costs were:

  1. pct_change_3: Medium-term momentum.
  2. rolling_std_3: Economic volatility/instability.
  3. OBS_VALUE: Current cost index.

🚀 Usage

Prerequisites

  • Python 3.x
  • See requirements.txt for dependencies.

Installation

git clone https://github.com/saccolucax2/DA_Project.git
cd DA_Project
pip install -r requirements.txt

Running the Pipeline

To execute the full analysis pipeline (from raw data to model output):

python main.py
# Or launch the Jupyter Notebook for interactive analysis
jupyter notebook notebooks/analysis.ipynb

🔮 Future Developments

  • Integration of exogenous variables (energy prices, raw material indices).
  • Implementation of advanced time-series models (LSTM, Prophet).
  • Development of an interactive Streamlit dashboard.

About

End-to-end Data Analytics pipeline for analyzing, clustering, and forecasting residential construction costs in Europe (Eurostat data). Includes ETL, Feature Engineering, and Machine Learning models (Random Forest, K-Means).

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •