Skip to content

25 structural break detection methods for univariate time series: XGBoost, Neural Networks, Ensembles, Reinforcement Learning, and Statistical approaches. Evaluated on cross-dataset generalization.

License

Notifications You must be signed in to change notification settings

waddadaa/structural_break_detection

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Structural Break Detection

Disclaimer: For academic and educational use only—not financial advice nor trading advice. This repository demonstrates a well-documented phenomenon: backtest overfitting. Models that achieved strong in-sample performance (Dataset A) showed degraded out-of-sample results (Dataset B), consistent with findings that backtested metrics offer limited predictive value for live performance. Do not use these models for real trading decisions.

Note: These results are based on local validation sets provided during the competition phase and do not represent final official leaderboard standings.

A comprehensive collection of 25 structural break detection methods for univariate time series, developed for the ADIA Lab Structural Break Challenge hosted by CrunchDAO.

GitHub stars License: MIT Python 3.8+ Documentation

Documentation: For detailed documentation with interactive navigation, visit the GitHub Pages site.

Key Finding: Single-Dataset Benchmarks Are Misleading

We evaluated all 25 detectors on two independent datasets. The results reveal a critical insight:

Models that ranked #1 and #2 on Dataset A dropped to #6 and #10 on Dataset B.

This demonstrates that single-dataset performance is unreliable. Our evaluation emphasizes cross-dataset generalization using stability metrics.

Cross-Dataset Results Summary

Top 10 Models by Robust Score

The Stability Score measures cross-dataset consistency:

Stability Score = 1 - |AUC_A - AUC_B| / max(AUC_A, AUC_B)

The Robust Score combines worst-case performance with stability:

Robust Score = Min(AUC_A, AUC_B) × Stability Score
Rank Detector Dataset A Dataset B Min AUC Stability Robust Score
1 xgb_tuned_regularization 0.7423 0.7705 0.7423 96.3% 0.715
2 weighted_dynamic_ensemble 0.6742 0.6849 0.6742 98.4% 0.664
3 quad_model_ensemble 0.6756 0.6622 0.6622 98.0% 0.649
4 mlp_ensemble_deep_features 0.7122 0.6787 0.6787 95.3% 0.647
5 xgb_selective_spectral 0.6451 0.6471 0.6451 99.7% 0.643
6 xgb_70_statistical 0.6685 0.6493 0.6493 97.1% 0.631
7 mlp_xgb_simple_blend 0.6746 0.6399 0.6399 94.9% 0.607
8 xgb_core_7features 0.6188 0.6315 0.6188 98.0% 0.606
9 xgb_30f_fast_inference 0.6282 0.6622 0.6282 94.9% 0.596
10 xgb_importance_top15 0.6723 0.6266 0.6266 93.2% 0.584

The Overfitting Problem

These models showed strong Dataset A performance but failed to generalize:

Model Dataset A Rank Dataset B Rank AUC Drop Stability
gradient_boost_comprehensive #1 (0.7930) #6 (0.6533) -17.6% 82.4%
meta_stacking_7models #2 (0.7662) #10 (0.6422) -16.2% 83.8%
knn_spectral_fft #15 (0.5793) #23 (0.4808) -17.0% 83.0%
hypothesis_testing_pure #20 (0.5394) #25 (0.4118) -23.7% 76.3%

Lesson: High single-dataset AUC does not guarantee real-world performance. Always validate on multiple datasets.

Stability Score Methodology

The Stability Score measures how consistently a model performs across datasets:

Stability = 1 - |AUC_A - AUC_B| / max(AUC_A, AUC_B)
  • 100%: Identical performance on both datasets
  • < 85%: Significant overfitting concern

Most Stable Models

Model Stability Interpretation
xgb_selective_spectral 99.7% Near-identical performance
weighted_dynamic_ensemble 98.4% Excellent generalization
quad_model_ensemble 98.0% Excellent generalization
xgb_core_7features 98.0% Excellent generalization
xgb_70_statistical 97.1% Strong generalization
xgb_tuned_regularization 96.3% Strong generalization

Least Stable Models (Overfitters)

Model Stability Warning
hypothesis_testing_pure 76.3% Severe instability
welch_ttest 71.9% Severe instability
gradient_boost_comprehensive 82.4% Overfit to Dataset A
knn_spectral_fft 83.0% Overfit to Dataset A
meta_stacking_7models 83.8% Overfit to Dataset A

Top Performer in Local Benchmarks: xgb_tuned_regularization

xgb_tuned_regularization is the top performer when considering cross-dataset performance in local validation:

Metric xgb_tuned_regularization gradient_boost (former #1) meta_stacking (former #2)
Dataset A AUC 0.7423 0.7930 0.7662
Dataset B AUC 0.7705 0.6533 0.6422
Min AUC 0.7423 0.6533 0.6422
Stability 96.3% 82.4% 83.8%
Robust Score 0.715 0.538 0.538
Train Time 60-185s 179-451s 332-32,030s

Why xgb_tuned_regularization is Top Performer

  1. Best Robust Score (0.715) — Highest combined performance and stability
  2. Actually improved on Dataset B — 0.7423 → 0.7705 (+3.8%)
  3. High Stability (96.3%) — Consistent across different data
  4. Fast Training — 60-185s vs hours for complex ensembles
  5. Strong Regularization — L1/L2 prevents overfitting
XGBClassifier(
    n_estimators=200,
    max_depth=5,           # Shallow trees
    learning_rate=0.05,
    reg_alpha=0.5,         # Strong L1 regularization
    reg_lambda=2.0,        # Strong L2 regularization
    min_child_weight=10    # Larger minimum leaf weight
)

Key Insights

What Works

  1. Regularization is Critical: Models with aggressive regularization generalize better. xgb_tuned_regularization uses strong L1/L2 penalties.

  2. Simpler Models Generalize Better: xgb_core_7features (7 features) has 98.0% stability vs meta_stacking_7models (339 features) at 83.8%.

  3. Feature Engineering Matters: Statistical features (KS statistic, Cohen's d, t-test) consistently outperform raw time series inputs.

  4. Ensemble Diversity: Combining different model types (trees + neural nets) works, but keep it simple.

What Doesn't Work

  1. Complex Ensembles Overfit: meta_stacking_7models (7 models, 32,000s training) dropped from #2 to #10.

  2. Deep Learning Fails on Univariate Data: Transformer (0.49-0.54 AUC) and LSTM (0.50-0.52 AUC) remain near-random on BOTH datasets.

  3. RL Approaches Underperform: Q-learning and DQN (0.46-0.55 AUC) don't compete with supervised learning.

  4. Pure Statistical Tests Are Unstable: hypothesis_testing_pure dropped from 0.54 to 0.41 AUC.

Top Performers by Category

Category Model Notes
Best Robust Score xgb_tuned_regularization 0.715 robust, 96.3% stable
Fastest Training xgb_core_7features 7 features, 40s training, 98% stable
Highest Stability xgb_selective_spectral 99.7% stability
No ML Required segment_statistics_only Statistical only, 95.4% stable

Full Results

All 25 Detectors (Sorted by Robust Score)

Rank Detector Dataset A Dataset B Min AUC Stability Robust Score
1 xgb_tuned_regularization 0.7423 0.7705 0.7423 96.3% 0.715
2 weighted_dynamic_ensemble 0.6742 0.6849 0.6742 98.4% 0.664
3 quad_model_ensemble 0.6756 0.6622 0.6622 98.0% 0.649
4 mlp_ensemble_deep_features 0.7122 0.6787 0.6787 95.3% 0.647
5 xgb_selective_spectral 0.6451 0.6471 0.6451 99.7% 0.643
6 xgb_70_statistical 0.6685 0.6493 0.6493 97.1% 0.631
7 mlp_xgb_simple_blend 0.6746 0.6399 0.6399 94.9% 0.607
8 xgb_core_7features 0.6188 0.6315 0.6188 98.0% 0.606
9 xgb_30f_fast_inference 0.6282 0.6622 0.6282 94.9% 0.596
10 xgb_importance_top15 0.6723 0.6266 0.6266 93.2% 0.584
11 segment_statistics_only 0.6249 0.5963 0.5963 95.4% 0.569
12 meta_stacking_7models 0.7662 0.6422 0.6422 83.8% 0.538
13 gradient_boost_comprehensive 0.7930 0.6533 0.6533 82.4% 0.538
14 bayesian_bocpd_fused_lasso 0.5005 0.4884 0.4884 97.6% 0.477
15 wavelet_lstm 0.5249 0.5000 0.5000 95.3% 0.476
16 qlearning_rolling_stats 0.5488 0.5078 0.5078 92.5% 0.470
17 dqn_base_model_selector 0.5474 0.5067 0.5067 92.6% 0.469
18 kolmogorov_smirnov_xgb 0.4939 0.5205 0.4939 94.9% 0.469
19 qlearning_bayesian_cpd 0.5540 0.5067 0.5067 91.5% 0.463
20 hierarchical_transformer 0.5439 0.4862 0.4862 89.4% 0.435
21 qlearning_memory_tabular 0.4986 0.4559 0.4559 91.4% 0.417
22 knn_wavelet 0.5812 0.4898 0.4898 84.3% 0.413
23 knn_spectral_fft 0.5793 0.4808 0.4808 83.0% 0.399
24 welch_ttest 0.4634 0.6444 0.4634 71.9% 0.333
25 hypothesis_testing_pure 0.5394 0.4118 0.4118 76.3% 0.314

Class Imbalance & Cost Considerations

The Rare Event Problem

Structural breaks are inherently rare events. This creates fundamental model bias:

  • Models can achieve ~70% accuracy by predicting "no break" for everything
  • Several models (hierarchical_transformer, wavelet_lstm, welch_ttest) exhibit this behavior with 0% recall

Cost Asymmetry

Not all errors are equal in structural break detection:

Error Type Description Cost
False Negative (FN) Missing a real break Moderate — Missed opportunity to act on regime change
False Positive (FP) Predicting break when none exists Severe — Triggers unnecessary position changes, transaction fees, slippage

This asymmetry means we should prioritize precision alongside recall, making F1 score a critical metric.

Why Deep Learning & RL Models Underperformed

The hierarchical_transformer (0.49-0.54 AUC), wavelet_lstm (0.50-0.52 AUC), and RL models (0.46-0.55 AUC) all underperformed tree-based ensembles on BOTH datasets.

The Core Problem: Univariate Features Are Insufficient

These architectures are designed to learn relationships between multiple input variables. With only a univariate time series and its derived features, they lack the rich input space needed to learn meaningful patterns.

What Would Help: Adding exogenous variables (correlated assets, macroeconomic indicators, sentiment data, volume) would provide the multi-dimensional context these architectures need.

Key Insight: Tree-based ensembles excel at learning from handcrafted statistical features that explicitly encode distributional differences. Deep learning needs raw, multi-dimensional input to learn such representations.

Feature Engineering Methodology

Segment-Based Features

Each time series is split at a boundary into pre-segment and post-segment. Features capture differences between these segments:

Pre-segment   |  Post-segment
--------------+---------------
 values[0:T]  |  values[T:end]

Feature Categories

Category Description Example Features
Moments Statistical moments per segment mean_diff, std_ratio, skew_diff, kurt_diff
Effect Sizes Standardized differences Cohen's d, Glass's delta, Hedges' g
Distribution Tests Hypothesis test statistics t_statistic, ks_statistic, mann_whitney_u
Quantiles Percentile comparisons median_diff, iqr_ratio, q25_diff, q75_diff
Spectral Frequency domain dominant_freq_diff, spectral_centroid_diff
Wavelet Multi-scale decomposition dwt_energy_ratio, wavelet_entropy_diff
Temporal Time-dependent patterns acf_diff, trend_diff, volatility_ratio

Most Discriminative Features (by importance)

  1. ks_statistic - Kolmogorov-Smirnov test statistic
  2. mean_diff_normalized - Normalized mean difference
  3. std_ratio - Standard deviation ratio
  4. cohens_d - Cohen's effect size
  5. mann_whitney_u - Mann-Whitney U statistic

Repository Structure

structural_break_detection/
├── README.md                          # This file
├── requirements.txt                   # Dependencies
├── run_all_experiments.py             # Full experiment runner
├── quick_benchmark.py                 # Fast benchmarking
│
├── results_dataset_a.csv              # Dataset A results
├── results_dataset_a.md               # Dataset A results (markdown)
├── results_dataset_b.csv              # Dataset B results
├── results_dataset_b.md               # Dataset B results (markdown)
│
├── xgb_tuned_regularization/          # Top performer in local benchmarks
├── weighted_dynamic_ensemble/         # #2 by robust score
├── quad_model_ensemble/               # #3 by robust score
├── mlp_ensemble_deep_features/        # #4 by robust score
│
├── gradient_boost_comprehensive/      # Former #1, overfits
├── meta_stacking_7models/             # Former #2, overfits
│
├── wavelet_lstm/                      # Deep learning (underperforms)
├── hierarchical_transformer/          # Deep learning (underperforms)
│
├── qlearning_rolling_stats/           # RL approach
├── qlearning_bayesian_cpd/            # RL + Bayesian
├── dqn_base_model_selector/           # DQN approach
│
└── ... (25 detectors total)

Each detector folder contains:

  • features.py - Feature extraction class
  • model.py - Detector model class
  • main.py - Training and inference scripts

Usage

Installation

pip install -r requirements.txt

Training the Top Performer

cd xgb_tuned_regularization
python main.py --mode train --data-dir /path/to/data --model-path ./model.joblib

Running Inference

python main.py --mode infer --data-dir /path/to/data --model-path ./model.joblib

Benchmarking All Detectors

# Quick test (top 5 detectors)
python quick_benchmark.py --data-dir /path/to/data

# Full benchmark (all 25 detectors)
python run_all_experiments.py --data-dir /path/to/data --output results.csv

Evaluation Metrics

Metric Description
ROC AUC Area Under ROC Curve (discrimination ability)
Stability Score Cross-dataset consistency (higher = better generalization)
Robust Score Min AUC × Stability (overall reliability)
F1 Score Harmonic mean of precision and recall
Recall True positive rate (sensitivity)

Lessons Learned

  1. Single-dataset benchmarks lie: The #1 model on Dataset A dropped to #6 on Dataset B.

  2. Regularization prevents overfitting: xgb_tuned_regularization uses strong L1/L2 penalties and generalizes well.

  3. Complexity ≠ Robustness: meta_stacking_7models (7 models, 32,000s) is less stable than xgb_tuned_regularization (1 model, 185s).

  4. Stability Score matters: Always evaluate on multiple datasets and measure consistency.

  5. Deep learning needs multivariate input: Transformer and LSTM fail on univariate time series.

  6. Simple features work: Statistical tests (KS, t-test) as features outperform complex architectures.

Dependencies

  • Python 3.8+
  • scikit-learn
  • xgboost
  • lightgbm
  • PyTorch (for neural network models)
  • PyWavelets
  • scipy
  • pandas
  • numpy

References

Financial Machine Learning

  • Lopez de Prado, M. (2018). Advances in Financial Machine Learning. Wiley.

Statistical Methods

  • Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences. Lawrence Erlbaum Associates.
  • Welch, B. L. (1947). "The Generalization of 'Student's' Problem when Several Different Population Variances are Involved." Biometrika, 34(1/2), 28–35. Link
  • Mann, H. B., & Whitney, D. R. (1947). "On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other." The Annals of Mathematical Statistics, 18(1), 50–60. Link

Change Point Detection

  • Page, E. S. (1954). "Continuous Inspection Schemes." Biometrika, 41(1/2), 100–115. Link
  • Adams, R. P., & MacKay, D. J. C. (2007). "Bayesian Online Changepoint Detection." arXiv:0710.3742. Link
  • Sharifi, A., Sun, W., & Seco, L. A. (2025). "Detecting Structural Breaks in Dynamic Environments Using Reinforcement Learning and Bayesian Change Point Models." SSRN. Link

Machine Learning

  • Chen, T., & Guestrin, C. (2016). "XGBoost: A Scalable Tree Boosting System." KDD, 785–794. Link

Deep Learning & Transformers

  • Vaswani, A. et al. (2017). "Attention Is All You Need." NeurIPS, 30. Link
  • Hochreiter, S., & Schmidhuber, J. (1997). "Long Short-Term Memory." Neural Computation, 9(8), 1735–1780.
  • Wang, Y. et al. (2024). "TimeXer: Empowering Transformers for Time Series Forecasting with Exogenous Variables." arXiv:2402.19072. Link
  • Liu, Y. et al. (2024). "ExoTST: Exogenous-Aware Temporal Sequence Transformer." arXiv:2410.12184. Link

Signal Processing

  • Daubechies, I. (1992). Ten Lectures on Wavelets. SIAM.
  • Song, J. H., Lopez de Prado, M., Simon, H., & Wu, K. (2014). "Exploring Irregular Time Series Through Non-Uniform Fast Fourier Transform." Proceedings of the International Conference for High Performance Computing, IEEE. Link

Acknowledgments

This project was developed for the ADIA Lab Structural Break Challenge, a machine learning competition hosted by CrunchDAO in partnership with ADIA Lab. The challenge focused on detecting structural breaks in univariate time series data.

License

MIT License

About

25 structural break detection methods for univariate time series: XGBoost, Neural Networks, Ensembles, Reinforcement Learning, and Statistical approaches. Evaluated on cross-dataset generalization.

Topics

Resources

License

Stars

Watchers

Forks

Languages