Skip to content

Conversation

@krrome
Copy link
Contributor

@krrome krrome commented Nov 24, 2025

This is still a draft with limited functionality (and failing tests) to gauge whether my approach of the integration is in line with the docling team. I will keep extending the PR to full functionality, but I would like to receive feedback on the integration as early as possible.

Changes:

  • The reading order model was extended to handle header hierarchies.
  • docling/models/header_hierarchy was added as a home to header level inference

Issue resolved by this Pull Request:
Resolves #2591, #652, #287, #1023, #2121 and maybe more.

Checklist:

  • Documentation has been updated, if necessary.
  • Examples have been added, if necessary.
  • Tests have been added, if necessary.

@github-actions
Copy link
Contributor

github-actions bot commented Nov 24, 2025

DCO Check Passed

Thanks @krrome, all your commits are properly signed off. 🎉

@mergify
Copy link

mergify bot commented Nov 24, 2025

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🔴 Require two reviewer for test updates

This rule is failing.

When test data is updated, we require two reviewers

  • #approved-reviews-by >= 2

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

add draft-style-based-inference

attempt to fix metadata pytest failures

Signed-off-by: Roman Kayan BAZG <roman.kayan@bazg.admin.ch>

fix tests

Signed-off-by: Roman Kayan BAZG <roman.kayan@bazg.admin.ch>

add missing docling/models/header_hierarchy/parsers.py

Signed-off-by: Roman Kayan BAZG <roman.kayan@bazg.admin.ch>

add tests for hierarchical headers

Signed-off-by: Roman Kayan BAZG <roman.kayan@bazg.admin.ch>

fix integration of style based header hierarchy

extend tests

fix tests, finish first attempt to integrate hierarchical parsing

Signed-off-by: Roman Kayan BAZG <roman.kayan@bazg.admin.ch>
@krrome krrome force-pushed the integrate-hierarchical-pdf branch from 8554b70 to e373e4c Compare December 5, 2025 10:27
@krrome krrome marked this pull request as ready for review December 5, 2025 10:30
@dosubot
Copy link

dosubot bot commented Dec 5, 2025

Related Documentation

Checked 5 published document(s) in 1 knowledge base(s). No updates required.

How did I do? Any feedback?  Join Discord

Secondly, OTSL has more inherent structure and a significantly restricted vocabulary size. This allows autoregressive models to perform better in the TED metric, but especially with regards to prediction accuracy of the table-cell bounding boxes (see Table 2). As shown in Figure 5, we observe that the OTSL drastically reduces the drift for table cell bounding boxes at high row count and in sparse tables. This leads to more accurate predictions and a significant reduction in post-processing complexity, which is an undesired necessity in HTML-based Im2Seq models. Significant novelty lies in OTSL syntactical rules, which are few, simple and always backwards looking. Each new token can be validated only by analyzing the sequence of previous tokens, without requiring the entire sequence to detect mistakes. This in return allows to perform structural error detection and correction on-the-fly during sequence generation.

## References
### References
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not be happening, do want me to fix it?

@krrome
Copy link
Contributor Author

krrome commented Dec 5, 2025

Hi all,

Thank you for reviewing. From my point of view this now not a draft anymore. Unfortunately the changes required are quite substantial.

I still have a few questions which I would like to ask before you are off reviewing in detail:

  1. Is this PR too big for your taste in order to make it? Reason: I have invested quite a lot of time in this integration, so if you think it is too much then I'll stop the efforts there and focus on something else :)
  2. Some automated tests are failing on github because the test job runs out of disk... How can I fix that?
  3. I haven't gotten around to add metadata-TOC-support for pdfium in my code
  4. Using metadata-based TOCs as input: when headings are all caps in text, but not in TOC then they are not found, do you want me to fix that?
  5. When inferring header hierarchy from numbered headers then errors might occur when bigger sections like here: https://github.com/docling-project/docling/pull/2676/files#r2592219081

Looking forward to your feedback.

Thanks,
Roman

@codecov
Copy link

codecov bot commented Dec 5, 2025

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant