Skip to content

Bibliographic data: annotation inconsistencies in page_10.json and page_2.json #91

@kintopp

Description

@kintopp

I noticed a few annotation inconsistencies in the ground truth files — mostly in page_10.json, which appears to follow CSL-JSON conventions while the other pages (and the Pydantic schema in dataclass.py) follow a different convention.

1. Hyphenated vs underscored keys (page_10)

page_10.json uses CSL-JSON hyphenated keys:

"publisher-place": "London",
"container-title": "..."

While pages 2–5 and the Entry model in dataclass.py use underscored keys:

"publisher_place": "London",
"container_title": "..."

2. Entry type values (page_10, page_2)

page_10.json uses CSL-JSON type values:

  • "article-journal" instead of "journal-article" (used in pages 2–5)
  • "chapter" (not present in the EntryType enum, which defines book, journal-article, other)

page_2.json also uses "review" as a type value, which is not in the EntryType enum either.

3. Nested entries inside related field (page_10, entry 145)

Entry 145 in page_10.json is a conference proceedings that contains three sub-entries (146–148) as full objects nested inside its related field:

{
  "id": "145",
  "related": [
    {"id": "146", "type": "article-journal", "title": "Storia e materialismo storico", ...},
    {"id": "147", "type": "article-journal", "title": "Critica del giudizio storico", ...},
    {"id": "148", "type": "article-journal", "title": "Forza e spirito nella storia", ...}
  ]
}

This makes the ground truth structurally different from what models typically produce (a flat list of entries), and from how related is used in other pages (where it's either absent or contains simple ID references). It also means automated scoring with get_all_keys / get_nested_value traverses into these nested objects, creating key paths like entries[5].related[0].title that don't align with a flat prediction structure.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions