Skip to content

Conversation

@Ra5hidIslam
Copy link

@Ra5hidIslam Ra5hidIslam commented Dec 13, 2025

fix(backend): improve Excel table bounds detection and flatten merged cells

Description: This PR refactors the _find_table_bounds method in the MsExcelDocumentBackend to improve how Excel tables are detected and represented.

Key changes:

  • Region Growing Algorithm: Replaced the previous explicit boundary finding logic with a region-growing strategy that uses a
  • GAP_TOLERANCE (set to 3). This helps group nearby data clusters into a single table context more reliably.
  • Visual Grid / Flattening Spans: Changed the ExcelCell generation to force a "Visual Grid" structure.
    All cells are now forced to row_span=1 and col_span=1.
    Merged cell bodies (non-head cells) are explicitly filled with empty strings.
    This change is intended to prevent text duplication issues in downstream Markdown exports.
  • Refactoring: Removed the _find_table_bottom and _find_table_right helper methods as their logic is now integrated into the region expansion loop.

Issue resolved by this Pull Request: Resolves #834

Checklist:

  • Documentation has been updated, if necessary.
  • Examples have been added, if necessary.
  • Tests have been added, if necessary.
    msexcel_backend.py

@github-actions
Copy link
Contributor

github-actions bot commented Dec 13, 2025

DCO Check Passed

Thanks @Ra5hidIslam, all your commits are properly signed off. 🎉

@mergify
Copy link

mergify bot commented Dec 13, 2025

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

I, Rashidul Islam <rasidulislam71@gmail.com>, hereby add my Signed-off-by to this commit: f282abc

Signed-off-by: Rashidul Islam <rasidulislam71@gmail.com>
I, Rashidul Islam <rasidulislam71@gmail.com>, hereby add my Signed-off-by to this commit: f282abc

Signed-off-by: Rashidul Islam <rasidulislam71@gmail.com>
@Ra5hidIslam Ra5hidIslam changed the title Made changes to the _find_table_bounds function. fix(backend): improve Excel table bounds detection and flatten merged cells Dec 13, 2025
@Michele-Zhu
Copy link

Hi, I've noticed that you're working on the table bounds algorithm now. See #2741 and #2626. I support your approach.

I suggest that the algorithm should also expand the bounds on the left due to how the data is scanned for the initial anchor in _find_data_tables.

Here is the test case that shouldn't work (I haven't run your code, so I cannot guarantee it)
edge_cases.xlsx

@Ra5hidIslam
Copy link
Author

Ra5hidIslam commented Dec 15, 2025

Hi @Michele-Zhu I have run my code for that excel file and this is the output:
Screenshot 2025-12-15 at 11 34 44

Does it look fine or wrong to you?

@codecov
Copy link

codecov bot commented Dec 15, 2025

Codecov Report

❌ Patch coverage is 0% with 43 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
docling/backend/msexcel_backend.py 0.00% 43 Missing ⚠️

📢 Thoughts on this report? Let us know!

@ceberam ceberam self-assigned this Dec 15, 2025
@Michele-Zhu
Copy link

@Ra5hidIslam No, in my opinion, it should have detected one table for the first and second sheets.
Since the table-bound scan starts from the topmost left cell of a table, you'll also need to grow the region on the left side.

P.S. According to how you have defined the growth region, it creates a problem with the test of the boolean option treat_singleton_as_text.

@Ra5hidIslam
Copy link
Author

Hi @Michele-Zhu ,

I have a few thoughts on the feedback:
Edge Case: I feel the edge case regarding connecting two sheets by text might be heading in the wrong direction. Is using attached_left for this type of connection considered an industry standard? It doesn't seem to align with typical use cases.

Failing Test: Regarding the title extraction, I don't see the benefit of separating the title. Getting the whole block of data seems more helpful/robust. If we separate the title, we'd need to add another processing layer to re-associate or manage the blocks. Unless isolating the title is crucial for a specific reason, I would prefer to keep the logic as is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

msexcel_backend.py doesn’t parse complex Excel tables properly.

3 participants