Skip to content

Conversation

@SeanClay10
Copy link
Collaborator

@SeanClay10 SeanClay10 commented Jan 26, 2026

This MR extends the PDF preprocessing pipeline to support automatic table extraction in addition to text extraction. Tables are detected and extracted using PyMuPDF by default, with a camelot-py fallback for cases where PyMuPDF fails to detect tables. Extracted tables are saved as structured JSON alongside the extracted text.

@SeanClay10 SeanClay10 self-assigned this Jan 26, 2026
@SeanClay10 SeanClay10 changed the title WIP: PDF Processing Improvements - Table Extraction PDF Processing Improvements - Table Extraction Feb 2, 2026
@SeanClay10 SeanClay10 marked this pull request as ready for review February 2, 2026 01:07
@SeanClay10 SeanClay10 requested a review from raymondcen February 2, 2026 01:08
Copy link
Collaborator

@raymondcen raymondcen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved, looks good.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants