🏗️ European Residential Construction Cost Analysis

📋 Abstract

This project implements an end-to-end Data Analytics pipeline to analyze, cluster, and forecast the production costs of residential buildings in Europe. Using official Eurostat data (2000-2024), the study identifies economic trends, groups countries by similarity, and predicts future cost variations using Machine Learning techniques.

The workflow covers the entire data lifecycle: from automated ingestion and cleaning to feature engineering, supervised/unsupervised modeling, and bootstrap validation.

👥 Authors

Carmela Mandato
Giulia Di Biase
Luca Sacco
Simone Di Mario

📊 Dataset

The analysis is based on the Eurostat dataset:

Source Code: STS-COPI-A (Production in construction - annual data).
Variable Target: OBS_VALUE (Construction cost index, percentage change).
Scope: 27 EU countries + major aggregates (EA19, EA20, EU27).
Timeframe: 2000 - 2024.

⚙️ Project Pipeline

The project is structured into four main methodological blocks:

1. Data Ingestion & Preprocessing

ETL: Automated extraction from raw CSVs and mapping of geographical aggregates.
Cleaning: Filtering for COST indicators and PCH_SM units.
Outlier Detection: Implemented Interquartile Range (IQR) method ($Q1 - 1.5 \times IQR$ / $Q3 + 1.5 \times IQR$) to identify anomalies (e.g., post-pandemic inflation shocks in Bulgaria/Estonia).
Imputation: Handling missing values via row-wise means for minor gaps.

2. Feature Engineering

Transformed raw time-series data into predictive features to capture volatility and trends:

Rolling Statistics: Mean and Standard Deviation (3-year and 5-year windows).
Trends: pct_change_3 (3-year percentage change) and slope_3 (local regression slope).
Categorical Labels: Generated label_variation (Increase, Stable, Decrease) based on a $\pm 15%$ threshold.

3. Unsupervised Learning (Clustering)

Goal: Identify groups of countries with similar economic trajectories.

Dimensionality Reduction: PCA (Principal Component Analysis).
Algorithm: K-Means ($k=2$ selected via Silhouette Score of 0.53).
Outcome: Successfully separated stable economies (e.g., Germany, France) from volatile/emerging markets (e.g., Baltic states).
Note: DBSCAN was tested but discarded due to sensitivity to density variations in normalized data.

4. Supervised Learning (Prediction)

Regression (Forecasting):
- Model: Random Forest Regressor (Tuned).
- Performance: RMSE: 2.5, MAE: 1.5.
- Metric: The model explains approx. 65% of variance, outperforming the Decision Tree baseline.
Classification (Trend Direction):
- Model: Random Forest Classifier.
- Imbalance Handling: Applied SMOTE (Synthetic Minority Oversampling Technique) to the training set.
- Performance: F1-weighted Score: 0.74.

📉 Key Results & Visualizations

Model Performance (Bootstrap Analysis)

To ensure robustness, models were validated using 200 bootstrap iterations:

Regression Stability: 95% CI for RMSE = [2.44, 2.73].
Classification Stability: 95% CI for F1-Score = [0.672, 0.765].

Feature Importance

The most influential predictors for construction costs were:

pct_change_3: Medium-term momentum.
rolling_std_3: Economic volatility/instability.
OBS_VALUE: Current cost index.

🚀 Usage

Prerequisites

Python 3.x
See requirements.txt for dependencies.

Installation

git clone https://github.com/saccolucax2/DA_Project.git
cd DA_Project
pip install -r requirements.txt

Running the Pipeline

To execute the full analysis pipeline (from raw data to model output):

python main.py
# Or launch the Jupyter Notebook for interactive analysis
jupyter notebook notebooks/analysis.ipynb

🔮 Future Developments

Integration of exogenous variables (energy prices, raw material indices).
Implementation of advanced time-series models (LSTM, Prophet).
Development of an interactive Streamlit dashboard.

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
.idea		.idea
data		data
notebook		notebook
report		report
src		src
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🏗️ European Residential Construction Cost Analysis

📋 Abstract

👥 Authors

📊 Dataset

⚙️ Project Pipeline

1. Data Ingestion & Preprocessing

2. Feature Engineering

3. Unsupervised Learning (Clustering)

4. Supervised Learning (Prediction)

📉 Key Results & Visualizations

Model Performance (Bootstrap Analysis)

Feature Importance

🚀 Usage

Prerequisites

Installation

Running the Pipeline

🔮 Future Developments

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

saccolucax2/DA_Project

Folders and files

Latest commit

History

Repository files navigation

🏗️ European Residential Construction Cost Analysis

📋 Abstract

👥 Authors

📊 Dataset

⚙️ Project Pipeline

1. Data Ingestion & Preprocessing

2. Feature Engineering

3. Unsupervised Learning (Clustering)

4. Supervised Learning (Prediction)

📉 Key Results & Visualizations

Model Performance (Bootstrap Analysis)

Feature Importance

🚀 Usage

Prerequisites

Installation

Running the Pipeline

🔮 Future Developments

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages