Bibliographic data: position-based scoring causes cascading alignment penalties

## Summary

The current scoring approach for the bibliographic_data benchmark matches predicted entries to ground-truth entries by **array position** (`entries[0]` vs `entries[0]`, `entries[1]` vs `entries[1]`). When a model inserts an extra entry or skips one, every subsequent entry is compared to the wrong ground truth, and all downstream field scores collapse. This makes scores on affected pages reflect alignment errors rather than extraction quality.

## How the issue arises

`scoring_helper.py`'s `get_all_keys()` generates paths with array indices (e.g., `entries[6].title`, `entries[7].author[0].family`). The scoring then retrieves the value at that same indexed path from the prediction. There is no step that matches predicted entries to ground-truth entries by identity before comparing fields.

This works well for benchmarks like Library Cards (one record per image — nothing to misalign). But for Bibliographic Data, where each page has 14–20 entries, a single insertion or skip at the start shifts every subsequent comparison.

## Quantified impact

We ran leave-one-out MIPROv2 optimization (Gemini 2.0 Flash) across all 5 pages as part of a [DSPy optimization project for RISE benchmarks](https://github.com/kintopp/dspy-rise-humbench). The results show a stark bimodal distribution:

| Image | Avg fuzzy |
|---|---|
| page_5 | 0.9111 |
| page_2 | 0.8980 |
| page_4 | 0.8895 |
| page_10 | 0.3936 |
| page_3 | 0.3923 |

Pages 3 and 10 score ~0.39 not because of poor field-level extraction, but because of cascading alignment errors:

**Page 3**: The model predicts id "16" where the ground truth expects id "15" at position `entries[1]`. From that point on, every entry's title, author, journal, volume, and page are scored against the *next* entry's ground truth. The field-level data is largely correct — it is shifted by one slot.

Example of the cascade (from the [LOO experiment results](https://github.com/kintopp/dspy-rise-humbench/blob/main/results/bibliographic_data/optimized/loo-mipro-medium-cot_gemini-2.0-flash_test_scores.json)):

| Position | GT entry | Predicted entry | Title match |
|---|---|---|---|
| entries[1] | #15 "The Open Society and Its Enemies" | #16 "The Open Society - a Comment" | 0.70 |
| entries[2] | #16 "The Open Society – a comment" | #17 "The Open Society - a Rejoinder" | 0.76 |
| entries[3] | #17 "The Open Society – a Rejoinder" | #18 "The Open Society: a Reconsideration" | 0.80 |
| entries[4] | #18 "The Open Society: a Reconsideration" | #19 "Has History a Meaning?" | 0.39 |

Each row shows a predicted entry that is correct in isolation, but scored against the wrong ground-truth entry due to the positional shift.

**Page 10**: The model flattens entries 146–149 (which the ground truth nests under entry 145's `related` field, see #91 point 3) into top-level entries, causing the same downstream shift for all subsequent entries.

## Suggestion

Since every entry has a unique `id` field, a pre-alignment step could match predicted entries to ground-truth entries by `id` before comparing fields. A minimal change to `score_request_answer` could reorder predicted entries to match the ground-truth order before flattening to key paths, eliminating cascading penalties while still scoring each field's content accuracy.

Based on field-level data from our experiments, we estimate this would raise pages 3 and 10 from ~0.39 to ~0.70–0.85, and the aggregate from ~0.70 to ~0.80–0.85 — making scores more informative about actual extraction quality for all models on the leaderboard.

## Context

Identified while running [DSPy optimization experiments](https://github.com/kintopp/dspy-rise-humbench) on the bibliographic_data benchmark. See the [Bibliographic Data results section](https://github.com/kintopp/dspy-rise-humbench#bibliographic-data) of the project README for the full experiment write-up. Related: #91 (annotation inconsistencies in page_10.json).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bibliographic data: position-based scoring causes cascading alignment penalties #92

Summary

How the issue arises

Quantified impact

Suggestion

Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Image	Avg fuzzy
page_5	0.9111
page_2	0.8980
page_4	0.8895
page_10	0.3936
page_3	0.3923

Position	GT entry	Predicted entry	Title match
entries[1]	#15 "The Open Society and Its Enemies"	#16 "The Open Society - a Comment"	0.70
entries[2]	#16 "The Open Society – a comment"	#17 "The Open Society - a Rejoinder"	0.76
entries[3]	#17 "The Open Society – a Rejoinder"	#18 "The Open Society: a Reconsideration"	0.80
entries[4]	#18 "The Open Society: a Reconsideration"	#19 "Has History a Meaning?"	0.39

Bibliographic data: position-based scoring causes cascading alignment penalties #92

Description

Summary

How the issue arises

Quantified impact

Suggestion

Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions