-
Notifications
You must be signed in to change notification settings - Fork 3.3k
feat: Integrate docling-hierarchical-pdf back into docling #2676
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
✅ DCO Check Passed Thanks @krrome, all your commits are properly signed off. 🎉 |
Merge ProtectionsYour pull request matches the following merge protections and will not be merged until they are valid. 🔴 Require two reviewer for test updatesThis rule is failing.When test data is updated, we require two reviewers
🟢 Enforce conventional commitWonderful, this rule succeeded.Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
|
add draft-style-based-inference attempt to fix metadata pytest failures Signed-off-by: Roman Kayan BAZG <roman.kayan@bazg.admin.ch> fix tests Signed-off-by: Roman Kayan BAZG <roman.kayan@bazg.admin.ch> add missing docling/models/header_hierarchy/parsers.py Signed-off-by: Roman Kayan BAZG <roman.kayan@bazg.admin.ch> add tests for hierarchical headers Signed-off-by: Roman Kayan BAZG <roman.kayan@bazg.admin.ch> fix integration of style based header hierarchy extend tests fix tests, finish first attempt to integrate hierarchical parsing Signed-off-by: Roman Kayan BAZG <roman.kayan@bazg.admin.ch>
8554b70 to
e373e4c
Compare
| Secondly, OTSL has more inherent structure and a significantly restricted vocabulary size. This allows autoregressive models to perform better in the TED metric, but especially with regards to prediction accuracy of the table-cell bounding boxes (see Table 2). As shown in Figure 5, we observe that the OTSL drastically reduces the drift for table cell bounding boxes at high row count and in sparse tables. This leads to more accurate predictions and a significant reduction in post-processing complexity, which is an undesired necessity in HTML-based Im2Seq models. Significant novelty lies in OTSL syntactical rules, which are few, simple and always backwards looking. Each new token can be validated only by analyzing the sequence of previous tokens, without requiring the entire sequence to detect mistakes. This in return allows to perform structural error detection and correction on-the-fly during sequence generation. | ||
|
|
||
| ## References | ||
| ### References |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should not be happening, do want me to fix it?
|
Hi all, Thank you for reviewing. From my point of view this now not a draft anymore. Unfortunately the changes required are quite substantial. I still have a few questions which I would like to ask before you are off reviewing in detail:
Looking forward to your feedback. Thanks, |
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
This is still a draft with limited functionality (and failing tests) to gauge whether my approach of the integration is in line with the docling team. I will keep extending the PR to full functionality, but I would like to receive feedback on the integration as early as possible.
Changes:
docling/models/header_hierarchywas added as a home to header level inferenceIssue resolved by this Pull Request:
Resolves #2591, #652, #287, #1023, #2121 and maybe more.
Checklist: