Skip to content

End-to-end Machine Learning pipeline to predict user churn from music streaming logs. Features automated time-series feature engineering and a 5-Fold XGBoost ensemble. Developed for the Python for Data Science course at ร‰cole Polytechnique.

Notifications You must be signed in to change notification settings

4LEC212/python_for_dsb_final_project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

10 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

๐ŸŽต Music Streaming Churn Prediction

Authors: Me & Sila (https://github.com/silabou)
Context: Python for Data Science class @ ร‰cole Polytechnique

๐Ÿš€ Project Overview

This repository contains a complete Machine Learning pipeline designed to predict user churn for a music streaming service. Using raw user activity logs (listening history, page visits, errors, etc.), we engineered time-series features to classify whether a user is likely to cancel their subscription within a 10-day window.

The goal was to build a robust model capable of handling class imbalance and maximizing the F1-Score on the leaderboard.

๐Ÿ›  Key Features

  • Advanced Feature Engineering:

    • Multi-Scale Rolling Windows: Calculated 3d, 7d, 14d, and 30d moving stats (sums, averages) to capture short-term vs. long-term behavior.
    • Velocity & Trend Features: Engineered ratio features (e.g., Activity last 3 days / Activity last 14 days) to explicitly model "slowing down" behavior.
    • Interaction Ratios: Created "Frustration" (Errors per hour) and "Engagement" (Songs per session) indices.
  • Data Augmentation via "Snapshot Stacking":

    • Instead of a single row per user, we implemented a Sliding Window Snapshot strategy.
    • We generated training samples every 2 days (from Oct 7 to Nov 1). This multiplied our training size and allowed the model to learn the evolution of a user's journey, effectively turning a static classification problem into a temporal one.
  • Robust Modeling Pipeline:

    • Feature Selection: Used SelectFromModel with a base XGBoost estimator to prune noise and retain only the top predictive features.
    • Algorithm: Single XGBoost Classifier (optimized with the hist tree method for speed).
    • Grouped Cross-Validation: Trained a 5-fold ensemble (grouped by UserID to prevent leakage) and averaged the predictions to reduce variance.
  • Automated Tuning: Used Optuna for Bayesian optimization of the XGBoost hyperparameters (Learning Rate, Depth, L1/L2 Reg), optimizing specifically for AUC.

๐Ÿ“‰ Iterative Modeling Strategy

We approached this problem iteratively to improve performance and stability:

  1. Baseline & Feature Engineering:

    • We started by aggregating logs per user. However, simple aggregates failed to capture the speed of churn.
    • Pivot: We introduced "Velocity" features (ratios of short-term vs long-term windows) to detect sudden drops in activity.
  2. Addressing Data Scarcity (The "Snapshot" Shift):

    • Problem: With a limited number of unique users, a simple "one row per user" model was overfitting and lacked sufficient training examples.
    • Solution: We moved to a Stacked Snapshot Dataset. By taking a snapshot of every user every 2 days and predicting churn in the next 10 days, we increased our dataset size by ~13x. This helped the model distinguish between a "safe" period and a "risk" period for the same user.
  3. Leakage Prevention:

    • Strictly separated feature calculation (history) from the target window.
    • Used GroupKFold during validation to ensure that the same user (appearing in multiple snapshots) never appeared in both the Train and Validation sets simultaneously.
  4. Final Calibration (Dynamic Thresholding):

    • Instead of a standard 0.5 threshold, we implemented a Target-Rate Calibration.
    • The final submission dynamically selects the probability threshold that forces the predicted churn rate to match the expected population churn (~40%), ensuring the predictions are aligned with the business reality.

๐Ÿ“Š Results

The pipeline generates comprehensive performance visualizations:

  1. ROC & Precision-Recall Curves: To evaluate the trade-off between True Positives and False Positives.
  2. Cumulative Gains Curve (The "Banana" Plot): Explicit visualization of how much better the model is compared to random guessing (e.g., "Top 40% of predictions capture X% of churners").
  3. Submission Output: A probability file (submission_final_binary.csv) generated by the 5-fold ensemble.

This project is part of the academic coursework at ร‰cole Polytechnique.

About

End-to-end Machine Learning pipeline to predict user churn from music streaming logs. Features automated time-series feature engineering and a 5-Fold XGBoost ensemble. Developed for the Python for Data Science course at ร‰cole Polytechnique.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •