-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Description
978-3-030-64452-9_35.pdf
deepseek-ocr.pdf
Goal: Based on the attached readings, shortlist and test multiple table-extraction libraries, then recommend one baseline for our thesis pipeline (PDF → tables → keep only “review-summary tables” with a Reference/citation column).
Read (attached):
- Creating a Scholarly Knowledge Graph from Survey Article Tables (they used Tabula; workflow assumes a Reference column).
- DeepSeek-OCR (document parsing/OCR landscape; relevant for image-based / hard-to-extract tables).
Tasks
- Select 3–5 candidate libraries (from the landscape in the attached papers / your quick scan) and list them in this issue.
- Test at least 3 libraries on 2 PDFs (minimum):
- P1: a review paper with ≥1 review-summary table (refs/citations in a column)
- P2: a paper with tables but no review-summary table
- For each (PDF × library), save outputs:
outputs/<pdf_name>/<library>/extracted_tables/*(CSV or equivalent)outputs/<pdf_name>/<library>/notes.md(what worked / what broke)
- Write a short comparison + recommendation in this issue:
- Rank the tested libraries (1 = best)
- Recommend one baseline + one runner-up
- Max 8–12 bullets total; focus on: structure fidelity, multi-page tables, 2-column layout robustness, install friction, CPU/GPU needs, output quality (CSV/HTML/JSON), ease of Python integration.
Acceptance (done when)
- 3–5 candidates listed; ≥3 tested
- Outputs saved for P1 and P2 for each tested library
- Ranked recommendation posted (baseline + runner-up) with key gotchas / next steps
Metadata
Metadata
Assignees
Labels
No labels