Skip to content

Exploring how demographic data can be used to predict household income using machine learning

Notifications You must be signed in to change notification settings

HrStamm/IncomePrediction_ML

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

12 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

๐Ÿ’ก Predicting Household Income with Machine Learning

This repository contains the code and reports for Project 1 & 2 in 02450 Machine Learning at DTU (Spring 2025).

Author:

  • Valdemar Stamm Kristensen (s244742)

Study line: Artificial Intelligence and Data


๐Ÿ“Œ Whatโ€™s this project about?

Across two connected projects, we explored how machine learning can be applied to predict annual household income using demographic data.

  • Project 1: Focused on data preprocessing and exploration โ€“ handling missing values, standardizing ordinal features, one-hot encoding categorical variables, and visualizing feature relationships (incl. PCA for structure).
  • Project 2: Built on this foundation with supervised learning models for both regression and classification, combined with robust evaluation and statistical testing.

โš ๏ธ Dataset limitations

A key limitation of the dataset is that the target variable, annual household income, was only provided in ordinal categories rather than as continuous values.

To make regression possible, each bracket was mapped to a representative numeric value. While this made model training feasible, it also introduced constraints:

  • The target became artificially continuous, based on our mapping rather than observed income values.
  • Predictions and coefficients cannot be interpreted as exact economic relationships.
  • At the higher end, the mapping caused a ceiling effect, collapsing very different income levels into one value.

This transformation was necessary, but it reduces the interpretability and generalizability of the results.


๐Ÿง  How we did it

  • Preprocessing: Mode imputation, z-score normalization of ordinal features, one-hot encoding of categorical features, and mapping of income brackets.
  • Exploration (Project 1): Feature analysis, PCA, covariance and correlation inspection.
  • Modeling (Project 2):
    • Regression: Linear Regression, Ridge Regression, Artificial Neural Network (ANN).
    • Classification: Logistic Regression, K-Nearest Neighbors (KNN), and baseline models.
  • Evaluation: Nested 10-fold cross-validation for hyperparameter tuning and performance estimation.
  • Statistical testing: Paired t-tests to verify if performance differences were significant.

๐Ÿ“Š What we found

  • Regression:

    • ANN achieved the lowest average MSE (โ‰ˆ0.53).
    • Ridge Regression followed closely (โ‰ˆ0.55).
    • Both clearly outperformed the baseline (โ‰ˆ1.0).
  • Classification:

    • Logistic Regression had the lowest error rate (โ‰ˆ0.66).
    • KNN was close (โ‰ˆ0.68).
    • Both clearly outperformed the baseline (โ‰ˆ0.81).
  • Statistical tests: Confirmed that ANN > Ridge > Baseline for regression, and Logistic Regression > KNN > Baseline for classification.


๐Ÿ“‚ Whatโ€™s inside the repo

  • Code for both Projects.ipynb โ†’ Full notebook (covers Project 1 preprocessing + Project 2 modeling, results, and plots).
  • Project 2 - code.py โ†’ Standalone Python script of Project 2 code.
  • corrected_data1.csv โ†’ Dataset after preprocessing.
  • marketing.info.txt โ†’ Information about the dataset used

๐Ÿ”ฎ Reflections & Future Ideas

  • Regression: ANN performed best, but Ridge offered stable results.
  • Classification: Logistic Regression was most effective.
  • Dataset limitations (ordinal target mapped to numbers) are the biggest bottleneck.
  • With larger and more realistic data, results would likely improve.
  • Future work could include ensemble methods, ordinal regression techniques, and validation on external datasets.

๐Ÿ“– References

  • Introduction to Statistics at DTU (Brockhoff et al., 2024)
  • Introduction to Machine Learning and Data Mining (Herlau, Schmidt & Mรธrup, DTU, 2023)
  • scikit-learn documentation

About

Exploring how demographic data can be used to predict household income using machine learning

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages