Skip to content

Conversation

@emerose
Copy link
Contributor

@emerose emerose commented Jan 29, 2026

Summary

PDF specification (sec 7.7.3.4, p 80) allows /MediaBox to be inherited from any ancestor in the page tree, not just the immediate parent. The previous fix in v4.7.1 only checked one level of parent, which failed for PDFs where /MediaBox is defined at the grandparent level or higher (common in multi-level page trees).

Problem

Certain valid PDFs fail with:

RuntimeError: could not find the page-dimensions: {"/Type": "/Page"}

These PDFs have /MediaBox defined at the root of the page tree (e.g., grandparent), not on individual pages or their immediate parents.

Example PDF Structure (EU-lex document)

Page (xref 1)
  └─ Parent: Pages node (no /MediaBox)
       └─ Parent: Root Pages node (/MediaBox [0 0 595 842]) ← Here!

Solution

Recursively traverse the parent chain (up to 10 levels to prevent infinite loops in malformed PDFs) until /MediaBox is found or the root is reached.

for(int depth = 0; depth < 10 && current.hasKey("/Parent"); depth++)
{
    QPDFObjectHandle parent = current.getKey("/Parent");
    if(parent.hasKey("/MediaBox"))
    {
        // Extract MediaBox from this ancestor
        ...
    }
    current = parent;
}

Testing

  • All existing tests pass (16/16)
  • Added test case: deep-mediabox-inheritance.pdf (EU-lex document with MediaBox at grandparent level)
  • Tested with multiple real-world PDFs that previously failed

Fixes

Fixes #175

Signed-off-by: Sam Quigley sq@emerose.com

PDF specification (sec 7.7.3.4, p 80) allows MediaBox to be inherited
from any ancestor in the page tree, not just the immediate parent.
The previous fix in v4.7.1 only checked one level of parent, which
failed for PDFs where MediaBox is defined at the grandparent level
or higher.

This change:
- Traverses the parent chain recursively (up to 10 levels) to find
  inherited MediaBox
- Adds logging when MediaBox is found from an ancestor
- Updates the "defaulting to media-box" check to verify media_bbox
  has valid non-zero values

Test case: deep-mediabox-inheritance.pdf (EU-lex document with
MediaBox at grandparent level)

Fixes docling-project#175

Signed-off-by: Sam Quigley <quigley@emerose.com>
@github-actions
Copy link
Contributor

DCO Check Passed

Thanks @emerose, all your commits are properly signed off. 🎉

@mergify
Copy link

mergify bot commented Jan 29, 2026

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

@PeterStaar-IBM
Copy link
Member

@emerose Love it! Thanks for the code + the test-case.

You do have to add the ground-truth files too (they get created otherwise)

Copy link
Member

@dolfim-ibm dolfim-ibm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Copy link
Member

@PeterStaar-IBM PeterStaar-IBM left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wonderful!

@PeterStaar-IBM PeterStaar-IBM merged commit bb0b4ef into docling-project:main Jan 30, 2026
31 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

RuntimeError: could not find the page-dimensions on a PDF

3 participants