Machine-Learning Exploration on Soft Commodity Futures

Quantitative Trading Tool – Work in Progress

1. Objective and Data

This project investigates whether simple, interpretable machine-learning techniques can extract useful structure from soft commodity futures (sugar, coffee, cocoa). The focus is on building a realistic research pipeline that respects time-series constraints and is easy to explain in an interview, rather than on optimizing trading performance.

Universe (Yahoo Finance tickers):

Sugar #11 front contract: SB=F
Coffee front contract: KC=F
Cocoa front contract: CC=F

Data:

Daily prices from 2010 to today, downloaded via yfinance.
Returns and features computed at daily frequency.

The workflow is: raw prices → cleaned prices and returns → features (spreads, volatility, seasonality) → regimes (clustering) → simple ML signal.

2. Pipeline Overview

The overall research pipeline is:

graph TD;
   A[Raw Yahoo Finance prices<br>- SB=F, KC=F, CC=F -] --> B[Data washing: align calendars, fill gaps, clip outliers];
   B --> C[Cleaned prices];
   C --> D[Log-returns];
   C --> E[Spreads and z-scores];
   D --> F[Rolling stats:<br>volatility and momentum];
   C --> G[Seasonality features:<br>day-of-year sine/cosine, month];
   E --> H[Feature matrix X];
   F --> H;
   G --> H;
   D --> I[Target y<br>- next-day sugar log-return / sign -];
   H --> J[Clustering - PCA + KMeans -:<br>market regimes];
   H --> K[Logistic regression with time-series CV];
   J --> L[Regime interpretation];
   K --> M[Directional signal and simple trading rule];

This structure is implemented in:

src/data.py: data download, cleaning, returns.
src/features.py: spreads, z-scores, rolling stats, seasonality, target.
notebooks/01_exploration.ipynb: EDA, prices/returns, correlations.
notebooks/02_modeling_and_backtest.ipynb: features, regimes, logistic regression.

3. Data Cleaning and Exploratory Analysis

3.1 Data cleaning

Steps applied to the raw Yahoo data:

Align trading days across sugar, coffee, cocoa.
Fill small gaps with forward- and back-filling.
Clip extreme returns to reduce the influence of obvious bad ticks.
Rebuild a cleaned price series from the clipped log-returns.

The result is a smooth but realistic price history that preserves overall structure while avoiding single-point artifacts that would dominate ML training.

3.2 Prices vs returns (Figure 1)

Figure 1 (report/figures/prices_and_returns.png, produced in 01_exploration.ipynb) shows three panels of price and log-return per commodity:

Prices for sugar, coffee and cocoa exhibit long-term trends and occasional sharp moves (non-stationary behaviour).
Log-returns fluctuate around zero with time-varying volatility (closer to stationarity).

Working with log-returns

$$ r_t = \log(P_t) - \log(P_{t-1}) $$

is standard in quantitative finance because it makes series more stationary and additive over time, which is essential for most statistical and ML methods.

3.3 Return distributions and correlations (Figures 2–3)

Figure 2 (histogram of next-day sugar log-returns from 02_modeling_and_backtest.ipynb) shows a distribution centered close to zero with a few larger moves in the tails. Over 2010–2025:

Average daily log-return is near zero (about $-0.008%$ for cocoa, $+0.002%$ for coffee, $-0.035%$ for sugar).
Daily volatility is around 2% for all three contracts.

Figure 3 (report/figures/correlation_heatmap.png) shows a correlation heatmap of daily returns. Correlations are positive but imperfect between the three commodities. This supports the use of cross-asset features (spreads and relative value) rather than treating each contract in isolation.

4. Features and Target Definition

4.1 Spread and z-score features

We focus on simple, interpretable spreads:

Sugar–Coffee: spread_sb_kc = SB - KC
Sugar–Cocoa: spread_sb_cc = SB - CC

For each spread, we compute a rolling 60-day mean $\mu_t$ and standard deviation $\sigma_t$, and define a z-score:

$$ z_t = \frac{\text{spread}_t - \mu_t}{\sigma_t}. $$

Interpretation:

$z_t \approx 0$: spread is near its recent average, "normal" regime.
$z_t \gg 0$ or $z_t \ll 0$: spread is unusually wide or tight, potential mean-reversion opportunity.

4.2 Return-based features

From daily log-returns of each contract, we compute rolling statistics:

5-day and 20-day rolling mean: short- and medium-term momentum.
5-day and 20-day rolling volatility: local risk level.

These features capture:

Recent trends (e.g., sustained up-move in sugar),
Volatility clustering (higher volatility tends to persist),
Differences across commodities (one may be in a quiet regime while another is stressed).

4.3 Seasonality features

To model the strong seasonality of agricultural markets, we add:

Day-of-year encoded as sine and cosine:

$$ ext{doy_sin}_t = \sin\left(2\pi \cdot \text{doy}_t / 365\right),\quad ext{doy_cos}_t = \cos\left(2\pi \cdot \text{doy}_t / 365\right) $$
Month as an integer (1–12).

This allows the model to learn cyclic patterns (harvest, planting, holidays) without discontinuity at year-end.

4.4 Target variable

We consider two related targets derived from sugar (SB=F):

Regression target: next-day log return $r_{t+1}$.
Classification target: sign of next-day return:
- $y_t = 1$ if $r_{t+1} > 0$,
- $y_t = 0$ otherwise.

For modeling and interpretability, we primarily use the classification version.

5. Regime Clustering (Unsupervised Learning, Figures 4–5)

To identify market regimes, we:

Select features summarizing risk, relative value, and seasonality:
- 20-day volatilities for all three commodities (*_ret_vol_20),
- Spread z-scores (spread_sb_kc_z, spread_sb_cc_z),
- doy_sin, doy_cos.
Standardize features (zero mean, unit variance).
Apply PCA and keep the first two components for visualization.
Run KMeans with (k = 3) clusters on the standardized features.

Figure 4 (report/figures/market_regimes_pca_clusters.png) is the PCA scatter plot colored by cluster and shows three well-separated regimes. Cluster-level averages (Figure 5, report/figures/regime_cluster_summary.png) reveal:

A high-volatility, large |z-score| regime (stressed markets, large spread dislocations),
A low-volatility, near-zero spread regime (calm, "normal" conditions),
An intermediate regime.

These regimes align with intuitive phases of soft commodity markets, where weather shocks, geopolitical events, and policy changes periodically push spreads far away from their typical ranges.

6. Supervised Model: Predicting Next-Day Direction

6.1 Model choice

We use a logistic regression model to predict the sign of next-day sugar return:

$$ P(y_t = 1 \mid X_t) = \sigma(\beta_0 + \beta^\top X_t), $$

where $\sigma$ is the logistic function and $X_t$ are the features described above.

Reasons for this choice:

Interpretability: coefficients tell us which features push the probability of an up-move up or down.
Robustness on relatively small, noisy financial data.
Baseline against which more complex models (random forests, gradient boosting) can be compared later.

6.2 Time-series cross-validation (Figure 6)

To avoid look-ahead bias, we use TimeSeriesSplit instead of standard K-fold:

Each fold trains on a contiguous block of past data and tests on a later block.
This mimics how a real strategy would be deployed over time.

The notebook reports:

Fold-by-fold accuracy,
Mean and standard deviation of time-series CV accuracy,
A line plot of accuracy vs fold with a 50% horizontal line (random baseline).

Figure 6 (report/figures/time_series_cv_accuracy.png) is the time-series CV accuracy by fold. In our runs, time-series CV accuracy is about 50.0% ± 1.2%, effectively at the random 50% baseline. This confirms that, with a simple daily feature set on liquid futures, directional predictability is extremely weak, which is consistent with efficient-market intuition.

7. Interpretation and Context

7.1 Behaviour of returns and spreads (Figures 1–5)

Across 2010–2025, the descriptive statistics and the actual plots saved from the notebooks point to a realistic picture of liquid futures markets:

Daily log-returns for all three contracts have means close to zero and standard deviations around 2%, with occasional larger moves but no obvious drift (Figures 1–3).
Sugar–coffee and sugar–cocoa spreads have large negative levels (reflecting different price scales) but their z-scores are centered near 0 with standard deviation around 1.3 and rare excursions beyond ±4σ.
Figure 5 (bar chart of average 20-day volatility and spread z-scores per cluster from regime_cluster_summary.png) shows that the “stress” regime has both higher volatility and larger absolute z-scores, while the “calm” regime is characterized by lower volatility and spreads near zero.

This combination of volatility clustering, occasional spread dislocations and seasonality is exactly what one would expect for soft commodities affected by weather, crop cycles and macro shocks.

7.2 Connection to current markets

Recent years (2023–2025) have been marked by:

Weather-related supply concerns (droughts, crop diseases),
Macro volatility (rates, FX, geopolitical risk),
Increased attention to food security and the sugar–ethanol link.

In such an environment, regime identification (calm vs stress) and monitoring of spread dislocations are particularly relevant for commodity traders. The clustering and spread z-score analysis in this project provides a structured way to track these dynamics.

8. Possible Extensions

This work is intentionally simple and transparent. Natural extensions for a longer internship or research project include:

Trading rule and backtest
- Convert the logistic regression probabilities into long/short/flat positions on sugar.
- Run a walk-forward backtest aligned with the TimeSeriesSplit splits.
- Compute P&L, Sharpe ratio, drawdowns and turnover, and assess robustness under transaction cost assumptions.
Richer models
- Compare logistic regression with tree-based methods (random forests, gradient boosting) and potentially simple recurrent networks.
- Use the same time-series CV framework to check whether any performance gains are robust rather than artefacts of overfitting.
Extended feature set and universe
- Add curve-shape features (e.g., difference between nearby maturities), macro and FX indicators, or implied volatility when available.
- Extend the universe to related commodities or indices to test whether relative-value information across a broader set improves predictive power.
Regime-aware modeling
- Train separate models per regime (calm vs stress) or include regime indicators as additional features.
- Evaluate whether the signal is more informative in certain regimes and design trading rules that are active only when the model has higher predictive power.

These steps would build directly on the current pipeline (data → features → regimes → baseline model) and move the project closer to a production-grade quantitative research framework while preserving interpretability.

Compare logistic regression with a random forest or gradient boosting model,
Evaluate whether the additional complexity brings a meaningful gain after proper validation.

Richer features and instruments
- Include more tenors on the futures curve (curve shape features),
- Add macro indicators or FX (e.g., USD index) as exogenous variables,
- Extend to other related commodities.

In an interview, you can present this as a work in progress that already demonstrates:

Solid handling of financial time series (cleaning, stationarity, cross-asset structure),
Use of unsupervised learning to detect regimes,
A first supervised ML signal evaluated with proper time-series methodology,
A clear roadmap for turning this into a more complete trading-research framework.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
notebooks		notebooks
report		report
src		src
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
run_pipeline.py		run_pipeline.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Machine-Learning Exploration on Soft Commodity Futures

1. Objective and Data

2. Pipeline Overview