PDF Processing Improvements - Table Extraction #32

SeanClay10 · 2026-01-26T00:23:14Z

This MR extends the PDF preprocessing pipeline to support automatic table extraction in addition to text extraction. Tables are detected and extracted using PyMuPDF by default, with a camelot-py fallback for cases where PyMuPDF fails to detect tables. Extracted tables are saved as structured JSON alongside the extracted text.

raymondcen

Approved, looks good.

Feat: PDF processing improvements - structured tables

daa0b66

SeanClay10 self-assigned this Jan 26, 2026

SeanClay10 changed the title ~~WIP: PDF Processing Improvements - Table Extraction~~ PDF Processing Improvements - Table Extraction Feb 2, 2026

SeanClay10 marked this pull request as ready for review February 2, 2026 01:07

SeanClay10 requested a review from raymondcen February 2, 2026 01:08

raymondcen approved these changes Feb 2, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PDF Processing Improvements - Table Extraction #32

PDF Processing Improvements - Table Extraction #32

Uh oh!

SeanClay10 commented Jan 26, 2026 •

edited

Loading

Uh oh!

raymondcen left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

PDF Processing Improvements - Table Extraction #32

Are you sure you want to change the base?

PDF Processing Improvements - Table Extraction #32

Uh oh!

Conversation

SeanClay10 commented Jan 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

raymondcen left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

SeanClay10 commented Jan 26, 2026 •

edited

Loading