Skip to content

choose the most promising table extraction library (and validate) #1

@jd-coderepos

Description

@jd-coderepos

978-3-030-64452-9_35.pdf
deepseek-ocr.pdf

Goal: Based on the attached readings, shortlist and test multiple table-extraction libraries, then recommend one baseline for our thesis pipeline (PDF → tables → keep only “review-summary tables” with a Reference/citation column).

Read (attached):

  • Creating a Scholarly Knowledge Graph from Survey Article Tables (they used Tabula; workflow assumes a Reference column).
  • DeepSeek-OCR (document parsing/OCR landscape; relevant for image-based / hard-to-extract tables).

Tasks

  • Select 3–5 candidate libraries (from the landscape in the attached papers / your quick scan) and list them in this issue.
  • Test at least 3 libraries on 2 PDFs (minimum):
    • P1: a review paper with ≥1 review-summary table (refs/citations in a column)
    • P2: a paper with tables but no review-summary table
  • For each (PDF × library), save outputs:
    • outputs/<pdf_name>/<library>/extracted_tables/* (CSV or equivalent)
    • outputs/<pdf_name>/<library>/notes.md (what worked / what broke)
  • Write a short comparison + recommendation in this issue:
    • Rank the tested libraries (1 = best)
    • Recommend one baseline + one runner-up
    • Max 8–12 bullets total; focus on: structure fidelity, multi-page tables, 2-column layout robustness, install friction, CPU/GPU needs, output quality (CSV/HTML/JSON), ease of Python integration.

Acceptance (done when)

  • 3–5 candidates listed; ≥3 tested
  • Outputs saved for P1 and P2 for each tested library
  • Ranked recommendation posted (baseline + runner-up) with key gotchas / next steps

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions