Add Q-Q analysis metric for extreme value evaluation#1734
Add Q-Q analysis metric for extreme value evaluation#1734jehangirawan wants to merge 5 commits intoecmwf:developfrom
Conversation
a5f491d to
312301e
Compare
| quantile_levels = ds_ref["quantile_levels"].values | ||
|
|
||
| # Find extreme regions (typically below 5% and above 95%) | ||
| lower_extreme_idx = quantile_levels < 0.05 |
There was a problem hiding this comment.
can we make the thresholds as arguments of the plot instead of hardcoded?
| qq_deviation = np.abs(p_quantiles - gt_quantiles) | ||
|
|
||
| # Calculate normalized deviation (relative to interquartile range of ground truth) | ||
| gt_q25 = gt_flat.quantile(0.25, dim="_agg_points") |
There was a problem hiding this comment.
can we have configurable thresholds instead of hardcoded here? like gt_q_low and gt_q_high
| qq_data_coord_array = np.empty(overall_qq_score.shape, dtype=object) | ||
|
|
||
| # Iterate over all positions and create individual JSON strings | ||
| for idx in np.ndindex(overall_qq_score.shape): |
There was a problem hiding this comment.
this is extremely redundant. we should already have the to_list definition somewhere, but here is an example of how Id make it shorter:
def _to_list(arr, idx):
return (
arr.values[(...,) + idx].tolist()
if arr.ndim > 1
else arr.values.tolist()
)
def _to_float(arr, idx):
return float(arr.values[idx]) if arr.ndim > 0 else float(arr.values)
qq_full_data = {
"quantile_levels": quantile_levels.tolist(),
"p_quantiles": _to_list(p_quantiles, idx),
"gt_quantiles": _to_list(gt_quantiles, idx),
"qq_deviation": _to_list(qq_deviation, idx),
"qq_deviation_normalized": _to_list(qq_deviation_normalized, idx),
"extreme_low_mse": _to_float(extreme_low_mse, idx),
"extreme_high_mse": _to_float(extreme_high_mse, idx),
}
qq_data_coord_array[idx] = json.dumps(qq_full_data)
In general you can always ask Claude or Copilot (or chatGPT) to restructure your code in a more modular and compact syntax ;)
There was a problem hiding this comment.
Good catch! Definitely cleaner this way. I’ve refactored the logic to be more modular as suggested. Thanks for the tip!"
| combined_metrics = xr.concat( | ||
| valid_scores, | ||
| dim="metric", | ||
| coords="different", |
There was a problem hiding this comment.
are you sure that this doesn't mess up the other scores?
There was a problem hiding this comment.
Modified and tested successfully with multiple metrics (rmse, bias, mae, qq_analysis)
|
|
||
| metric_stream.loc[criteria] = combined_metrics | ||
|
|
||
| # Preserve metric-specific coordinates that loc assignment may drop |
There was a problem hiding this comment.
please avoid adding metric specific ifs to this function. This should stay as general as possible.
I know that the problem here is having the additional dimension that you can not stack with the others. Maybe we can have a meeting and brainstorm also with @SavvasMel about how to do it properly as in any case we will need it for the rank_histogram as well.
aa97395 to
a48ddd7
Compare
| # Restore metric-specific coordinates that were dropped by coords="minimal" | ||
| # (e.g., quantiles, extreme_percentiles for qq_analysis) | ||
| for coord_name in combined_metrics.coords: |
There was a problem hiding this comment.
I’ve updated the logic to be fully generic by checking dimension compatibility dynamically. This now handles the extra dimensions for qq_analysis and will work for other metrics like rank_histogram as well without any hardcoding.
There was a problem hiding this comment.
All comments addressed. Tested successfully with multiple metrics (rmse, bias, mae, qq_analysis)
Ready for re-review! @iluise
- Rename methods, add helper functions, fix naming conventions - Make percentile thresholds configurable - Create QuantilePlots class, implement generic coordinate handling - All reviewer comments addressed
e8b91c7 to
dcdc75b
Compare

Issue Number
closes #1674
Description
Implements a quantile-quantile (Q-Q) analysis metric for evaluating weather forecast model performance across the full data distribution, with emphasis on extreme values in the tails.
Key additions:
qq_analysismetric in the evaluation pipelineMotivation: Traditional metrics (RMSE, MAE) focus on central tendencies and may miss distributional biases in extremes. Q-Q analysis directly compares predicted vs. observed quantiles, making it ideal for detecting systematic over/under-prediction of extreme weather events.
Usage:
config/evaluate/eval_config.yml
evaluation:
metrics: ["qq_analysis"]
Checklist before asking for review
./scripts/actions.sh lint./scripts/actions.sh unit-test[x] I have documented my code and I have updated the docstrings.
./scripts/actions.sh integration-testlaunch-slurm.py --time 60