-
Notifications
You must be signed in to change notification settings - Fork 6
Description
I noticed a few annotation inconsistencies in the ground truth files — mostly in page_10.json, which appears to follow CSL-JSON conventions while the other pages (and the Pydantic schema in dataclass.py) follow a different convention.
1. Hyphenated vs underscored keys (page_10)
page_10.json uses CSL-JSON hyphenated keys:
"publisher-place": "London",
"container-title": "..."While pages 2–5 and the Entry model in dataclass.py use underscored keys:
"publisher_place": "London",
"container_title": "..."2. Entry type values (page_10, page_2)
page_10.json uses CSL-JSON type values:
"article-journal"instead of"journal-article"(used in pages 2–5)"chapter"(not present in theEntryTypeenum, which definesbook,journal-article,other)
page_2.json also uses "review" as a type value, which is not in the EntryType enum either.
3. Nested entries inside related field (page_10, entry 145)
Entry 145 in page_10.json is a conference proceedings that contains three sub-entries (146–148) as full objects nested inside its related field:
{
"id": "145",
"related": [
{"id": "146", "type": "article-journal", "title": "Storia e materialismo storico", ...},
{"id": "147", "type": "article-journal", "title": "Critica del giudizio storico", ...},
{"id": "148", "type": "article-journal", "title": "Forza e spirito nella storia", ...}
]
}This makes the ground truth structurally different from what models typically produce (a flat list of entries), and from how related is used in other pages (where it's either absent or contains simple ID references). It also means automated scoring with get_all_keys / get_nested_value traverses into these nested objects, creating key paths like entries[5].related[0].title that don't align with a flat prediction structure.