Disclaimer: For academic and educational use only—not financial advice nor trading advice. This repository demonstrates a well-documented phenomenon: backtest overfitting. Models that achieved strong in-sample performance (Dataset A) showed degraded out-of-sample results (Dataset B), consistent with findings that backtested metrics offer limited predictive value for live performance. Do not use these models for real trading decisions.
Note: These results are based on local validation sets provided during the competition phase and do not represent final official leaderboard standings.
A comprehensive collection of 25 structural break detection methods for univariate time series, developed for the ADIA Lab Structural Break Challenge hosted by CrunchDAO.
Documentation: For detailed documentation with interactive navigation, visit the GitHub Pages site.
We evaluated all 25 detectors on two independent datasets. The results reveal a critical insight:
Models that ranked #1 and #2 on Dataset A dropped to #6 and #10 on Dataset B.
This demonstrates that single-dataset performance is unreliable. Our evaluation emphasizes cross-dataset generalization using stability metrics.
The Stability Score measures cross-dataset consistency:
Stability Score = 1 - |AUC_A - AUC_B| / max(AUC_A, AUC_B)
The Robust Score combines worst-case performance with stability:
Robust Score = Min(AUC_A, AUC_B) × Stability Score
| Rank | Detector | Dataset A | Dataset B | Min AUC | Stability | Robust Score |
|---|---|---|---|---|---|---|
| 1 | xgb_tuned_regularization | 0.7423 | 0.7705 | 0.7423 | 96.3% | 0.715 |
| 2 | weighted_dynamic_ensemble | 0.6742 | 0.6849 | 0.6742 | 98.4% | 0.664 |
| 3 | quad_model_ensemble | 0.6756 | 0.6622 | 0.6622 | 98.0% | 0.649 |
| 4 | mlp_ensemble_deep_features | 0.7122 | 0.6787 | 0.6787 | 95.3% | 0.647 |
| 5 | xgb_selective_spectral | 0.6451 | 0.6471 | 0.6451 | 99.7% | 0.643 |
| 6 | xgb_70_statistical | 0.6685 | 0.6493 | 0.6493 | 97.1% | 0.631 |
| 7 | mlp_xgb_simple_blend | 0.6746 | 0.6399 | 0.6399 | 94.9% | 0.607 |
| 8 | xgb_core_7features | 0.6188 | 0.6315 | 0.6188 | 98.0% | 0.606 |
| 9 | xgb_30f_fast_inference | 0.6282 | 0.6622 | 0.6282 | 94.9% | 0.596 |
| 10 | xgb_importance_top15 | 0.6723 | 0.6266 | 0.6266 | 93.2% | 0.584 |
These models showed strong Dataset A performance but failed to generalize:
| Model | Dataset A Rank | Dataset B Rank | AUC Drop | Stability |
|---|---|---|---|---|
| gradient_boost_comprehensive | #1 (0.7930) | #6 (0.6533) | -17.6% | 82.4% |
| meta_stacking_7models | #2 (0.7662) | #10 (0.6422) | -16.2% | 83.8% |
| knn_spectral_fft | #15 (0.5793) | #23 (0.4808) | -17.0% | 83.0% |
| hypothesis_testing_pure | #20 (0.5394) | #25 (0.4118) | -23.7% | 76.3% |
Lesson: High single-dataset AUC does not guarantee real-world performance. Always validate on multiple datasets.
The Stability Score measures how consistently a model performs across datasets:
Stability = 1 - |AUC_A - AUC_B| / max(AUC_A, AUC_B)
- 100%: Identical performance on both datasets
- < 85%: Significant overfitting concern
| Model | Stability | Interpretation |
|---|---|---|
| xgb_selective_spectral | 99.7% | Near-identical performance |
| weighted_dynamic_ensemble | 98.4% | Excellent generalization |
| quad_model_ensemble | 98.0% | Excellent generalization |
| xgb_core_7features | 98.0% | Excellent generalization |
| xgb_70_statistical | 97.1% | Strong generalization |
| xgb_tuned_regularization | 96.3% | Strong generalization |
| Model | Stability | Warning |
|---|---|---|
| hypothesis_testing_pure | 76.3% | Severe instability |
| welch_ttest | 71.9% | Severe instability |
| gradient_boost_comprehensive | 82.4% | Overfit to Dataset A |
| knn_spectral_fft | 83.0% | Overfit to Dataset A |
| meta_stacking_7models | 83.8% | Overfit to Dataset A |
xgb_tuned_regularization is the top performer when considering cross-dataset performance in local validation:
| Metric | xgb_tuned_regularization | gradient_boost (former #1) | meta_stacking (former #2) |
|---|---|---|---|
| Dataset A AUC | 0.7423 | 0.7930 | 0.7662 |
| Dataset B AUC | 0.7705 | 0.6533 | 0.6422 |
| Min AUC | 0.7423 | 0.6533 | 0.6422 |
| Stability | 96.3% | 82.4% | 83.8% |
| Robust Score | 0.715 | 0.538 | 0.538 |
| Train Time | 60-185s | 179-451s | 332-32,030s |
- Best Robust Score (0.715) — Highest combined performance and stability
- Actually improved on Dataset B — 0.7423 → 0.7705 (+3.8%)
- High Stability (96.3%) — Consistent across different data
- Fast Training — 60-185s vs hours for complex ensembles
- Strong Regularization — L1/L2 prevents overfitting
XGBClassifier(
n_estimators=200,
max_depth=5, # Shallow trees
learning_rate=0.05,
reg_alpha=0.5, # Strong L1 regularization
reg_lambda=2.0, # Strong L2 regularization
min_child_weight=10 # Larger minimum leaf weight
)-
Regularization is Critical: Models with aggressive regularization generalize better.
xgb_tuned_regularizationuses strong L1/L2 penalties. -
Simpler Models Generalize Better:
xgb_core_7features(7 features) has 98.0% stability vsmeta_stacking_7models(339 features) at 83.8%. -
Feature Engineering Matters: Statistical features (KS statistic, Cohen's d, t-test) consistently outperform raw time series inputs.
-
Ensemble Diversity: Combining different model types (trees + neural nets) works, but keep it simple.
-
Complex Ensembles Overfit:
meta_stacking_7models(7 models, 32,000s training) dropped from #2 to #10. -
Deep Learning Fails on Univariate Data: Transformer (0.49-0.54 AUC) and LSTM (0.50-0.52 AUC) remain near-random on BOTH datasets.
-
RL Approaches Underperform: Q-learning and DQN (0.46-0.55 AUC) don't compete with supervised learning.
-
Pure Statistical Tests Are Unstable:
hypothesis_testing_puredropped from 0.54 to 0.41 AUC.
| Category | Model | Notes |
|---|---|---|
| Best Robust Score | xgb_tuned_regularization | 0.715 robust, 96.3% stable |
| Fastest Training | xgb_core_7features | 7 features, 40s training, 98% stable |
| Highest Stability | xgb_selective_spectral | 99.7% stability |
| No ML Required | segment_statistics_only | Statistical only, 95.4% stable |
| Rank | Detector | Dataset A | Dataset B | Min AUC | Stability | Robust Score |
|---|---|---|---|---|---|---|
| 1 | xgb_tuned_regularization | 0.7423 | 0.7705 | 0.7423 | 96.3% | 0.715 |
| 2 | weighted_dynamic_ensemble | 0.6742 | 0.6849 | 0.6742 | 98.4% | 0.664 |
| 3 | quad_model_ensemble | 0.6756 | 0.6622 | 0.6622 | 98.0% | 0.649 |
| 4 | mlp_ensemble_deep_features | 0.7122 | 0.6787 | 0.6787 | 95.3% | 0.647 |
| 5 | xgb_selective_spectral | 0.6451 | 0.6471 | 0.6451 | 99.7% | 0.643 |
| 6 | xgb_70_statistical | 0.6685 | 0.6493 | 0.6493 | 97.1% | 0.631 |
| 7 | mlp_xgb_simple_blend | 0.6746 | 0.6399 | 0.6399 | 94.9% | 0.607 |
| 8 | xgb_core_7features | 0.6188 | 0.6315 | 0.6188 | 98.0% | 0.606 |
| 9 | xgb_30f_fast_inference | 0.6282 | 0.6622 | 0.6282 | 94.9% | 0.596 |
| 10 | xgb_importance_top15 | 0.6723 | 0.6266 | 0.6266 | 93.2% | 0.584 |
| 11 | segment_statistics_only | 0.6249 | 0.5963 | 0.5963 | 95.4% | 0.569 |
| 12 | meta_stacking_7models | 0.7662 | 0.6422 | 0.6422 | 83.8% | 0.538 |
| 13 | gradient_boost_comprehensive | 0.7930 | 0.6533 | 0.6533 | 82.4% | 0.538 |
| 14 | bayesian_bocpd_fused_lasso | 0.5005 | 0.4884 | 0.4884 | 97.6% | 0.477 |
| 15 | wavelet_lstm | 0.5249 | 0.5000 | 0.5000 | 95.3% | 0.476 |
| 16 | qlearning_rolling_stats | 0.5488 | 0.5078 | 0.5078 | 92.5% | 0.470 |
| 17 | dqn_base_model_selector | 0.5474 | 0.5067 | 0.5067 | 92.6% | 0.469 |
| 18 | kolmogorov_smirnov_xgb | 0.4939 | 0.5205 | 0.4939 | 94.9% | 0.469 |
| 19 | qlearning_bayesian_cpd | 0.5540 | 0.5067 | 0.5067 | 91.5% | 0.463 |
| 20 | hierarchical_transformer | 0.5439 | 0.4862 | 0.4862 | 89.4% | 0.435 |
| 21 | qlearning_memory_tabular | 0.4986 | 0.4559 | 0.4559 | 91.4% | 0.417 |
| 22 | knn_wavelet | 0.5812 | 0.4898 | 0.4898 | 84.3% | 0.413 |
| 23 | knn_spectral_fft | 0.5793 | 0.4808 | 0.4808 | 83.0% | 0.399 |
| 24 | welch_ttest | 0.4634 | 0.6444 | 0.4634 | 71.9% | 0.333 |
| 25 | hypothesis_testing_pure | 0.5394 | 0.4118 | 0.4118 | 76.3% | 0.314 |
Structural breaks are inherently rare events. This creates fundamental model bias:
- Models can achieve ~70% accuracy by predicting "no break" for everything
- Several models (hierarchical_transformer, wavelet_lstm, welch_ttest) exhibit this behavior with 0% recall
Not all errors are equal in structural break detection:
| Error Type | Description | Cost |
|---|---|---|
| False Negative (FN) | Missing a real break | Moderate — Missed opportunity to act on regime change |
| False Positive (FP) | Predicting break when none exists | Severe — Triggers unnecessary position changes, transaction fees, slippage |
This asymmetry means we should prioritize precision alongside recall, making F1 score a critical metric.
The hierarchical_transformer (0.49-0.54 AUC), wavelet_lstm (0.50-0.52 AUC), and RL models (0.46-0.55 AUC) all underperformed tree-based ensembles on BOTH datasets.
The Core Problem: Univariate Features Are Insufficient
These architectures are designed to learn relationships between multiple input variables. With only a univariate time series and its derived features, they lack the rich input space needed to learn meaningful patterns.
What Would Help: Adding exogenous variables (correlated assets, macroeconomic indicators, sentiment data, volume) would provide the multi-dimensional context these architectures need.
Key Insight: Tree-based ensembles excel at learning from handcrafted statistical features that explicitly encode distributional differences. Deep learning needs raw, multi-dimensional input to learn such representations.
Each time series is split at a boundary into pre-segment and post-segment. Features capture differences between these segments:
Pre-segment | Post-segment
--------------+---------------
values[0:T] | values[T:end]
| Category | Description | Example Features |
|---|---|---|
| Moments | Statistical moments per segment | mean_diff, std_ratio, skew_diff, kurt_diff |
| Effect Sizes | Standardized differences | Cohen's d, Glass's delta, Hedges' g |
| Distribution Tests | Hypothesis test statistics | t_statistic, ks_statistic, mann_whitney_u |
| Quantiles | Percentile comparisons | median_diff, iqr_ratio, q25_diff, q75_diff |
| Spectral | Frequency domain | dominant_freq_diff, spectral_centroid_diff |
| Wavelet | Multi-scale decomposition | dwt_energy_ratio, wavelet_entropy_diff |
| Temporal | Time-dependent patterns | acf_diff, trend_diff, volatility_ratio |
ks_statistic- Kolmogorov-Smirnov test statisticmean_diff_normalized- Normalized mean differencestd_ratio- Standard deviation ratiocohens_d- Cohen's effect sizemann_whitney_u- Mann-Whitney U statistic
structural_break_detection/
├── README.md # This file
├── requirements.txt # Dependencies
├── run_all_experiments.py # Full experiment runner
├── quick_benchmark.py # Fast benchmarking
│
├── results_dataset_a.csv # Dataset A results
├── results_dataset_a.md # Dataset A results (markdown)
├── results_dataset_b.csv # Dataset B results
├── results_dataset_b.md # Dataset B results (markdown)
│
├── xgb_tuned_regularization/ # Top performer in local benchmarks
├── weighted_dynamic_ensemble/ # #2 by robust score
├── quad_model_ensemble/ # #3 by robust score
├── mlp_ensemble_deep_features/ # #4 by robust score
│
├── gradient_boost_comprehensive/ # Former #1, overfits
├── meta_stacking_7models/ # Former #2, overfits
│
├── wavelet_lstm/ # Deep learning (underperforms)
├── hierarchical_transformer/ # Deep learning (underperforms)
│
├── qlearning_rolling_stats/ # RL approach
├── qlearning_bayesian_cpd/ # RL + Bayesian
├── dqn_base_model_selector/ # DQN approach
│
└── ... (25 detectors total)
Each detector folder contains:
features.py- Feature extraction classmodel.py- Detector model classmain.py- Training and inference scripts
pip install -r requirements.txtcd xgb_tuned_regularization
python main.py --mode train --data-dir /path/to/data --model-path ./model.joblibpython main.py --mode infer --data-dir /path/to/data --model-path ./model.joblib# Quick test (top 5 detectors)
python quick_benchmark.py --data-dir /path/to/data
# Full benchmark (all 25 detectors)
python run_all_experiments.py --data-dir /path/to/data --output results.csv| Metric | Description |
|---|---|
| ROC AUC | Area Under ROC Curve (discrimination ability) |
| Stability Score | Cross-dataset consistency (higher = better generalization) |
| Robust Score | Min AUC × Stability (overall reliability) |
| F1 Score | Harmonic mean of precision and recall |
| Recall | True positive rate (sensitivity) |
-
Single-dataset benchmarks lie: The #1 model on Dataset A dropped to #6 on Dataset B.
-
Regularization prevents overfitting:
xgb_tuned_regularizationuses strong L1/L2 penalties and generalizes well. -
Complexity ≠ Robustness:
meta_stacking_7models(7 models, 32,000s) is less stable thanxgb_tuned_regularization(1 model, 185s). -
Stability Score matters: Always evaluate on multiple datasets and measure consistency.
-
Deep learning needs multivariate input: Transformer and LSTM fail on univariate time series.
-
Simple features work: Statistical tests (KS, t-test) as features outperform complex architectures.
- Python 3.8+
- scikit-learn
- xgboost
- lightgbm
- PyTorch (for neural network models)
- PyWavelets
- scipy
- pandas
- numpy
- Lopez de Prado, M. (2018). Advances in Financial Machine Learning. Wiley.
- Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences. Lawrence Erlbaum Associates.
- Welch, B. L. (1947). "The Generalization of 'Student's' Problem when Several Different Population Variances are Involved." Biometrika, 34(1/2), 28–35. Link
- Mann, H. B., & Whitney, D. R. (1947). "On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other." The Annals of Mathematical Statistics, 18(1), 50–60. Link
- Page, E. S. (1954). "Continuous Inspection Schemes." Biometrika, 41(1/2), 100–115. Link
- Adams, R. P., & MacKay, D. J. C. (2007). "Bayesian Online Changepoint Detection." arXiv:0710.3742. Link
- Sharifi, A., Sun, W., & Seco, L. A. (2025). "Detecting Structural Breaks in Dynamic Environments Using Reinforcement Learning and Bayesian Change Point Models." SSRN. Link
- Chen, T., & Guestrin, C. (2016). "XGBoost: A Scalable Tree Boosting System." KDD, 785–794. Link
- Vaswani, A. et al. (2017). "Attention Is All You Need." NeurIPS, 30. Link
- Hochreiter, S., & Schmidhuber, J. (1997). "Long Short-Term Memory." Neural Computation, 9(8), 1735–1780.
- Wang, Y. et al. (2024). "TimeXer: Empowering Transformers for Time Series Forecasting with Exogenous Variables." arXiv:2402.19072. Link
- Liu, Y. et al. (2024). "ExoTST: Exogenous-Aware Temporal Sequence Transformer." arXiv:2410.12184. Link
- Daubechies, I. (1992). Ten Lectures on Wavelets. SIAM.
- Song, J. H., Lopez de Prado, M., Simon, H., & Wu, K. (2014). "Exploring Irregular Time Series Through Non-Uniform Fast Fourier Transform." Proceedings of the International Conference for High Performance Computing, IEEE. Link
This project was developed for the ADIA Lab Structural Break Challenge, a machine learning competition hosted by CrunchDAO in partnership with ADIA Lab. The challenge focused on detecting structural breaks in univariate time series data.
MIT License