Skip to content

Bibliographic data: position-based scoring causes cascading alignment penalties #92

@kintopp

Description

@kintopp

Summary

The current scoring approach for the bibliographic_data benchmark matches predicted entries to ground-truth entries by array position (entries[0] vs entries[0], entries[1] vs entries[1]). When a model inserts an extra entry or skips one, every subsequent entry is compared to the wrong ground truth, and all downstream field scores collapse. This makes scores on affected pages reflect alignment errors rather than extraction quality.

How the issue arises

scoring_helper.py's get_all_keys() generates paths with array indices (e.g., entries[6].title, entries[7].author[0].family). The scoring then retrieves the value at that same indexed path from the prediction. There is no step that matches predicted entries to ground-truth entries by identity before comparing fields.

This works well for benchmarks like Library Cards (one record per image — nothing to misalign). But for Bibliographic Data, where each page has 14–20 entries, a single insertion or skip at the start shifts every subsequent comparison.

Quantified impact

We ran leave-one-out MIPROv2 optimization (Gemini 2.0 Flash) across all 5 pages as part of a DSPy optimization project for RISE benchmarks. The results show a stark bimodal distribution:

Image Avg fuzzy
page_5 0.9111
page_2 0.8980
page_4 0.8895
page_10 0.3936
page_3 0.3923

Pages 3 and 10 score ~0.39 not because of poor field-level extraction, but because of cascading alignment errors:

Page 3: The model predicts id "16" where the ground truth expects id "15" at position entries[1]. From that point on, every entry's title, author, journal, volume, and page are scored against the next entry's ground truth. The field-level data is largely correct — it is shifted by one slot.

Example of the cascade (from the LOO experiment results):

Position GT entry Predicted entry Title match
entries[1] #15 "The Open Society and Its Enemies" #16 "The Open Society - a Comment" 0.70
entries[2] #16 "The Open Society – a comment" #17 "The Open Society - a Rejoinder" 0.76
entries[3] #17 "The Open Society – a Rejoinder" #18 "The Open Society: a Reconsideration" 0.80
entries[4] #18 "The Open Society: a Reconsideration" #19 "Has History a Meaning?" 0.39

Each row shows a predicted entry that is correct in isolation, but scored against the wrong ground-truth entry due to the positional shift.

Page 10: The model flattens entries 146–149 (which the ground truth nests under entry 145's related field, see #91 point 3) into top-level entries, causing the same downstream shift for all subsequent entries.

Suggestion

Since every entry has a unique id field, a pre-alignment step could match predicted entries to ground-truth entries by id before comparing fields. A minimal change to score_request_answer could reorder predicted entries to match the ground-truth order before flattening to key paths, eliminating cascading penalties while still scoring each field's content accuracy.

Based on field-level data from our experiments, we estimate this would raise pages 3 and 10 from ~0.39 to ~0.70–0.85, and the aggregate from ~0.70 to ~0.80–0.85 — making scores more informative about actual extraction quality for all models on the leaderboard.

Context

Identified while running DSPy optimization experiments on the bibliographic_data benchmark. See the Bibliographic Data results section of the project README for the full experiment write-up. Related: #91 (annotation inconsistencies in page_10.json).

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions