fix(backend): improve Excel table bounds detection and flatten merged cells #2778

Ra5hidIslam · 2025-12-13T08:19:44Z

fix(backend): improve Excel table bounds detection and flatten merged cells

Description: This PR refactors the _find_table_bounds method in the MsExcelDocumentBackend to improve how Excel tables are detected and represented.

Key changes:

Region Growing Algorithm: Replaced the previous explicit boundary finding logic with a region-growing strategy that uses a
GAP_TOLERANCE (set to 3). This helps group nearby data clusters into a single table context more reliably.
Visual Grid / Flattening Spans: Changed the ExcelCell generation to force a "Visual Grid" structure.
All cells are now forced to row_span=1 and col_span=1.
Merged cell bodies (non-head cells) are explicitly filled with empty strings.
This change is intended to prevent text duplication issues in downstream Markdown exports.
Refactoring: Removed the _find_table_bottom and _find_table_right helper methods as their logic is now integrated into the region expansion loop.

Issue resolved by this Pull Request: Resolves #834

Checklist:

Documentation has been updated, if necessary.
Examples have been added, if necessary.
Tests have been added, if necessary.
msexcel_backend.py

github-actions · 2025-12-13T08:19:56Z

✅ DCO Check Passed

Thanks @Ra5hidIslam, all your commits are properly signed off. 🎉

mergify · 2025-12-13T08:20:19Z

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

I, Rashidul Islam <rasidulislam71@gmail.com>, hereby add my Signed-off-by to this commit: f282abc Signed-off-by: Rashidul Islam <rasidulislam71@gmail.com>

Michele-Zhu · 2025-12-14T22:04:33Z

Hi, I've noticed that you're working on the table bounds algorithm now. See #2741 and #2626. I support your approach.

I suggest that the algorithm should also expand the bounds on the left due to how the data is scanned for the initial anchor in _find_data_tables.

Here is the test case that shouldn't work (I haven't run your code, so I cannot guarantee it)
edge_cases.xlsx

Ra5hidIslam · 2025-12-15T06:08:25Z

Hi @Michele-Zhu I have run my code for that excel file and this is the output:

Does it look fine or wrong to you?

codecov · 2025-12-15T08:13:14Z

Codecov Report

❌ Patch coverage is 0% with 43 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
docling/backend/msexcel_backend.py	0.00%	43 Missing ⚠️

📢 Thoughts on this report? Let us know!

Michele-Zhu · 2025-12-16T15:58:19Z

@Ra5hidIslam No, in my opinion, it should have detected one table for the first and second sheets.
Since the table-bound scan starts from the topmost left cell of a table, you'll also need to grow the region on the left side.

P.S. According to how you have defined the growth region, it creates a problem with the test of the boolean option treat_singleton_as_text.

Ra5hidIslam · 2025-12-22T10:01:23Z

Hi @Michele-Zhu ,

I have a few thoughts on the feedback:
Edge Case: I feel the edge case regarding connecting two sheets by text might be heading in the wrong direction. Is using attached_left for this type of connection considered an industry standard? It doesn't seem to align with typical use cases.

Failing Test: Regarding the title extraction, I don't see the benefit of separating the title. Getting the whole block of data seems more helpful/robust. If we separate the title, we'd need to add another processing layer to re-associate or manage the blocks. Unless isolating the title is crucial for a specific reason, I would prefer to keep the logic as is.

made changed to the _find_table_bounds function

f282abc

Ra5hidIslam added 2 commits December 13, 2025 13:51

DCO Remediation Commit for Rashidul Islam <rasidulislam71@gmail.com>

200e396

I, Rashidul Islam <rasidulislam71@gmail.com>, hereby add my Signed-off-by to this commit: f282abc Signed-off-by: Rashidul Islam <rasidulislam71@gmail.com>

DCO Remediation Commit for Rashidul Islam <rasidulislam71@gmail.com>

27cd8a9

I, Rashidul Islam <rasidulislam71@gmail.com>, hereby add my Signed-off-by to this commit: f282abc Signed-off-by: Rashidul Islam <rasidulislam71@gmail.com>

Ra5hidIslam changed the title ~~Made changes to the _find_table_bounds function.~~ fix(backend): improve Excel table bounds detection and flatten merged cells Dec 13, 2025

Michele-Zhu mentioned this pull request Dec 14, 2025

fix(excel): merge two separated tables#2626 #2741

Open

3 tasks

ceberam self-assigned this Dec 15, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(backend): improve Excel table bounds detection and flatten merged cells #2778

fix(backend): improve Excel table bounds detection and flatten merged cells #2778

Ra5hidIslam commented Dec 13, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Dec 13, 2025 •

edited

Loading

Uh oh!

mergify bot commented Dec 13, 2025 •

edited

Loading

Uh oh!

Michele-Zhu commented Dec 14, 2025

Uh oh!

Ra5hidIslam commented Dec 15, 2025 •

edited

Loading

Uh oh!

codecov bot commented Dec 15, 2025

Uh oh!

Michele-Zhu commented Dec 16, 2025

Uh oh!

Ra5hidIslam commented Dec 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fix(backend): improve Excel table bounds detection and flatten merged cells #2778

Are you sure you want to change the base?

fix(backend): improve Excel table bounds detection and flatten merged cells #2778

Conversation

Ra5hidIslam commented Dec 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Dec 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mergify bot commented Dec 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merge Protections

🟢 Enforce conventional commit

Uh oh!

Michele-Zhu commented Dec 14, 2025

Uh oh!

Ra5hidIslam commented Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Dec 15, 2025

Codecov Report

Uh oh!

Michele-Zhu commented Dec 16, 2025

Uh oh!

Ra5hidIslam commented Dec 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Ra5hidIslam commented Dec 13, 2025 •

edited

Loading

github-actions bot commented Dec 13, 2025 •

edited

Loading

mergify bot commented Dec 13, 2025 •

edited

Loading

Ra5hidIslam commented Dec 15, 2025 •

edited

Loading