feat: Integrate docling-hierarchical-pdf back into docling #2676

krrome · 2025-11-24T19:20:27Z

This is still a draft with limited functionality (and failing tests) to gauge whether my approach of the integration is in line with the docling team. I will keep extending the PR to full functionality, but I would like to receive feedback on the integration as early as possible.

Changes:

The reading order model was extended to handle header hierarchies.
docling/models/header_hierarchy was added as a home to header level inference

Issue resolved by this Pull Request:
Resolves #2591, #652, #287, #1023, #2121 and maybe more.

Checklist:

Documentation has been updated, if necessary.
Examples have been added, if necessary.
Tests have been added, if necessary.

github-actions · 2025-11-24T19:20:38Z

✅ DCO Check Passed

Thanks @krrome, all your commits are properly signed off. 🎉

mergify · 2025-11-24T19:21:01Z

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🔴 Require two reviewer for test updates

This rule is failing.

When test data is updated, we require two reviewers

#approved-reviews-by >= 2

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

add draft-style-based-inference attempt to fix metadata pytest failures Signed-off-by: Roman Kayan BAZG <roman.kayan@bazg.admin.ch> fix tests Signed-off-by: Roman Kayan BAZG <roman.kayan@bazg.admin.ch> add missing docling/models/header_hierarchy/parsers.py Signed-off-by: Roman Kayan BAZG <roman.kayan@bazg.admin.ch> add tests for hierarchical headers Signed-off-by: Roman Kayan BAZG <roman.kayan@bazg.admin.ch> fix integration of style based header hierarchy extend tests fix tests, finish first attempt to integrate hierarchical parsing Signed-off-by: Roman Kayan BAZG <roman.kayan@bazg.admin.ch>

dosubot · 2025-12-05T10:31:42Z

Related Documentation

Checked 5 published document(s) in 1 knowledge base(s). No updates required.

^{How did I do? Any feedback?}

krrome · 2025-12-05T10:40:29Z

tests/data/groundtruth/docling_v2/2305.03393v1.md

 Secondly, OTSL has more inherent structure and a significantly restricted vocabulary size. This allows autoregressive models to perform better in the TED metric, but especially with regards to prediction accuracy of the table-cell bounding boxes (see Table 2). As shown in Figure 5, we observe that the OTSL drastically reduces the drift for table cell bounding boxes at high row count and in sparse tables. This leads to more accurate predictions and a significant reduction in post-processing complexity, which is an undesired necessity in HTML-based Im2Seq models. Significant novelty lies in OTSL syntactical rules, which are few, simple and always backwards looking. Each new token can be validated only by analyzing the sequence of previous tokens, without requiring the entire sequence to detect mistakes. This in return allows to perform structural error detection and correction on-the-fly during sequence generation.

-## References
+### References


This should not be happening, do want me to fix it?

krrome · 2025-12-05T11:29:17Z

Hi all,

Thank you for reviewing. From my point of view this now not a draft anymore. Unfortunately the changes required are quite substantial.

I still have a few questions which I would like to ask before you are off reviewing in detail:

Is this PR too big for your taste in order to make it? Reason: I have invested quite a lot of time in this integration, so if you think it is too much then I'll stop the efforts there and focus on something else :)
Some automated tests are failing on github because the test job runs out of disk... How can I fix that?
I haven't gotten around to add metadata-TOC-support for pdfium in my code
Using metadata-based TOCs as input: when headings are all caps in text, but not in TOC then they are not found, do you want me to fix that?
When inferring header hierarchy from numbered headers then errors might occur when bigger sections like here: https://github.com/docling-project/docling/pull/2676/files#r2592219081

Looking forward to your feedback.

Thanks,
Roman

codecov · 2025-12-05T12:02:44Z

Codecov Report

❌ Patch coverage is 90.24823% with 55 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
docling/models/header_hierarchy/parsers.py	56.66%	26 Missing ⚠️
...g/models/header_hierarchy/style_based_hierarchy.py	93.07%	18 Missing ⚠️
...ling/models/header_hierarchy/metadata_hierarchy.py	96.55%	3 Missing ⚠️
docling/models/readingorder_model.py	94.82%	3 Missing ⚠️
docling/backend/pypdfium2_backend.py	33.33%	2 Missing ⚠️
...dels/header_hierarchy/types/hierarchical_header.py	96.15%	2 Missing ⚠️
...cling/models/header_hierarchy/hierarchy_builder.py	96.87%	1 Missing ⚠️

📢 Thoughts on this report? Let us know!

PeterStaar-IBM requested review from PeterStaar-IBM, cau-git and dolfim-ibm November 25, 2025 04:35

krrome force-pushed the integrate-hierarchical-pdf branch from 8554b70 to e373e4c Compare December 5, 2025 10:27

krrome marked this pull request as ready for review December 5, 2025 10:30

krrome commented Dec 5, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Integrate docling-hierarchical-pdf back into docling #2676

feat: Integrate docling-hierarchical-pdf back into docling #2676

krrome commented Nov 24, 2025

Uh oh!

github-actions bot commented Nov 24, 2025 •

edited

Loading

Uh oh!

mergify bot commented Nov 24, 2025 •

edited

Loading

Uh oh!

dosubot bot commented Dec 5, 2025

Uh oh!

krrome Dec 5, 2025

Uh oh!

krrome commented Dec 5, 2025

Uh oh!

codecov bot commented Dec 5, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

feat: Integrate docling-hierarchical-pdf back into docling #2676

Are you sure you want to change the base?

feat: Integrate docling-hierarchical-pdf back into docling #2676

Conversation

krrome commented Nov 24, 2025

Uh oh!

github-actions bot commented Nov 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mergify bot commented Nov 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merge Protections

🔴 Require two reviewer for test updates

🟢 Enforce conventional commit

Uh oh!

dosubot bot commented Dec 5, 2025

Uh oh!

krrome Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

krrome commented Dec 5, 2025

Uh oh!

codecov bot commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

github-actions bot commented Nov 24, 2025 •

edited

Loading

mergify bot commented Nov 24, 2025 •

edited

Loading

codecov bot commented Dec 5, 2025 •

edited

Loading