-
Notifications
You must be signed in to change notification settings - Fork 6
Description
Summary
The current scoring approach for the bibliographic_data benchmark matches predicted entries to ground-truth entries by array position (entries[0] vs entries[0], entries[1] vs entries[1]). When a model inserts an extra entry or skips one, every subsequent entry is compared to the wrong ground truth, and all downstream field scores collapse. This makes scores on affected pages reflect alignment errors rather than extraction quality.
How the issue arises
scoring_helper.py's get_all_keys() generates paths with array indices (e.g., entries[6].title, entries[7].author[0].family). The scoring then retrieves the value at that same indexed path from the prediction. There is no step that matches predicted entries to ground-truth entries by identity before comparing fields.
This works well for benchmarks like Library Cards (one record per image — nothing to misalign). But for Bibliographic Data, where each page has 14–20 entries, a single insertion or skip at the start shifts every subsequent comparison.
Quantified impact
We ran leave-one-out MIPROv2 optimization (Gemini 2.0 Flash) across all 5 pages as part of a DSPy optimization project for RISE benchmarks. The results show a stark bimodal distribution:
| Image | Avg fuzzy |
|---|---|
| page_5 | 0.9111 |
| page_2 | 0.8980 |
| page_4 | 0.8895 |
| page_10 | 0.3936 |
| page_3 | 0.3923 |
Pages 3 and 10 score ~0.39 not because of poor field-level extraction, but because of cascading alignment errors:
Page 3: The model predicts id "16" where the ground truth expects id "15" at position entries[1]. From that point on, every entry's title, author, journal, volume, and page are scored against the next entry's ground truth. The field-level data is largely correct — it is shifted by one slot.
Example of the cascade (from the LOO experiment results):
| Position | GT entry | Predicted entry | Title match |
|---|---|---|---|
| entries[1] | #15 "The Open Society and Its Enemies" | #16 "The Open Society - a Comment" | 0.70 |
| entries[2] | #16 "The Open Society – a comment" | #17 "The Open Society - a Rejoinder" | 0.76 |
| entries[3] | #17 "The Open Society – a Rejoinder" | #18 "The Open Society: a Reconsideration" | 0.80 |
| entries[4] | #18 "The Open Society: a Reconsideration" | #19 "Has History a Meaning?" | 0.39 |
Each row shows a predicted entry that is correct in isolation, but scored against the wrong ground-truth entry due to the positional shift.
Page 10: The model flattens entries 146–149 (which the ground truth nests under entry 145's related field, see #91 point 3) into top-level entries, causing the same downstream shift for all subsequent entries.
Suggestion
Since every entry has a unique id field, a pre-alignment step could match predicted entries to ground-truth entries by id before comparing fields. A minimal change to score_request_answer could reorder predicted entries to match the ground-truth order before flattening to key paths, eliminating cascading penalties while still scoring each field's content accuracy.
Based on field-level data from our experiments, we estimate this would raise pages 3 and 10 from ~0.39 to ~0.70–0.85, and the aggregate from ~0.70 to ~0.80–0.85 — making scores more informative about actual extraction quality for all models on the leaderboard.
Context
Identified while running DSPy optimization experiments on the bibliographic_data benchmark. See the Bibliographic Data results section of the project README for the full experiment write-up. Related: #91 (annotation inconsistencies in page_10.json).