-
Notifications
You must be signed in to change notification settings - Fork 3.3k
fix(backend): improve Excel table bounds detection and flatten merged cells #2778
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
fix(backend): improve Excel table bounds detection and flatten merged cells #2778
Conversation
|
✅ DCO Check Passed Thanks @Ra5hidIslam, all your commits are properly signed off. 🎉 |
Merge ProtectionsYour pull request matches the following merge protections and will not be merged until they are valid. 🟢 Enforce conventional commitWonderful, this rule succeeded.Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
|
I, Rashidul Islam <rasidulislam71@gmail.com>, hereby add my Signed-off-by to this commit: f282abc Signed-off-by: Rashidul Islam <rasidulislam71@gmail.com>
I, Rashidul Islam <rasidulislam71@gmail.com>, hereby add my Signed-off-by to this commit: f282abc Signed-off-by: Rashidul Islam <rasidulislam71@gmail.com>
|
Hi, I've noticed that you're working on the table bounds algorithm now. See #2741 and #2626. I support your approach. I suggest that the algorithm should also expand the bounds on the left due to how the data is scanned for the initial anchor in _find_data_tables. Here is the test case that shouldn't work (I haven't run your code, so I cannot guarantee it) |
|
Hi @Michele-Zhu I have run my code for that excel file and this is the output: Does it look fine or wrong to you? |
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
|
@Ra5hidIslam No, in my opinion, it should have detected one table for the first and second sheets. P.S. According to how you have defined the growth region, it creates a problem with the test of the boolean option |
|
Hi @Michele-Zhu , I have a few thoughts on the feedback: Failing Test: Regarding the title extraction, I don't see the benefit of separating the title. Getting the whole block of data seems more helpful/robust. If we separate the title, we'd need to add another processing layer to re-associate or manage the blocks. Unless isolating the title is crucial for a specific reason, I would prefer to keep the logic as is. |

fix(backend): improve Excel table bounds detection and flatten merged cells
Description: This PR refactors the _find_table_bounds method in the MsExcelDocumentBackend to improve how Excel tables are detected and represented.
Key changes:
All cells are now forced to row_span=1 and col_span=1.
Merged cell bodies (non-head cells) are explicitly filled with empty strings.
This change is intended to prevent text duplication issues in downstream Markdown exports.
Issue resolved by this Pull Request: Resolves #834
Checklist:
msexcel_backend.py