This project implements an end-to-end Data Analytics pipeline to analyze, cluster, and forecast the production costs of residential buildings in Europe. Using official Eurostat data (2000-2024), the study identifies economic trends, groups countries by similarity, and predicts future cost variations using Machine Learning techniques.
The workflow covers the entire data lifecycle: from automated ingestion and cleaning to feature engineering, supervised/unsupervised modeling, and bootstrap validation.
- Carmela Mandato
- Giulia Di Biase
- Luca Sacco
- Simone Di Mario
The analysis is based on the Eurostat dataset:
- Source Code:
STS-COPI-A(Production in construction - annual data). - Variable Target:
OBS_VALUE(Construction cost index, percentage change). - Scope: 27 EU countries + major aggregates (EA19, EA20, EU27).
- Timeframe: 2000 - 2024.
The project is structured into four main methodological blocks:
- ETL: Automated extraction from raw CSVs and mapping of geographical aggregates.
-
Cleaning: Filtering for
COSTindicators andPCH_SMunits. -
Outlier Detection: Implemented Interquartile Range (IQR) method (
$Q1 - 1.5 \times IQR$ /$Q3 + 1.5 \times IQR$ ) to identify anomalies (e.g., post-pandemic inflation shocks in Bulgaria/Estonia). - Imputation: Handling missing values via row-wise means for minor gaps.
Transformed raw time-series data into predictive features to capture volatility and trends:
- Rolling Statistics: Mean and Standard Deviation (3-year and 5-year windows).
-
Trends:
pct_change_3(3-year percentage change) andslope_3(local regression slope). -
Categorical Labels: Generated
label_variation(Increase, Stable, Decrease) based on a$\pm 15%$ threshold.
Goal: Identify groups of countries with similar economic trajectories.
- Dimensionality Reduction: PCA (Principal Component Analysis).
-
Algorithm: K-Means (
$k=2$ selected via Silhouette Score of 0.53). - Outcome: Successfully separated stable economies (e.g., Germany, France) from volatile/emerging markets (e.g., Baltic states).
- Note: DBSCAN was tested but discarded due to sensitivity to density variations in normalized data.
- Regression (Forecasting):
- Model: Random Forest Regressor (Tuned).
- Performance: RMSE: 2.5, MAE: 1.5.
- Metric: The model explains approx. 65% of variance, outperforming the Decision Tree baseline.
- Classification (Trend Direction):
- Model: Random Forest Classifier.
- Imbalance Handling: Applied SMOTE (Synthetic Minority Oversampling Technique) to the training set.
- Performance: F1-weighted Score: 0.74.
To ensure robustness, models were validated using 200 bootstrap iterations:
- Regression Stability: 95% CI for RMSE =
[2.44, 2.73]. - Classification Stability: 95% CI for F1-Score =
[0.672, 0.765].
The most influential predictors for construction costs were:
pct_change_3: Medium-term momentum.rolling_std_3: Economic volatility/instability.OBS_VALUE: Current cost index.
- Python 3.x
- See
requirements.txtfor dependencies.
git clone https://github.com/saccolucax2/DA_Project.git
cd DA_Project
pip install -r requirements.txtTo execute the full analysis pipeline (from raw data to model output):
python main.py
# Or launch the Jupyter Notebook for interactive analysis
jupyter notebook notebooks/analysis.ipynb- Integration of exogenous variables (energy prices, raw material indices).
- Implementation of advanced time-series models (LSTM, Prophet).
- Development of an interactive Streamlit dashboard.