choose the most promising table extraction library (and validate)

[978-3-030-64452-9_35.pdf](https://github.com/user-attachments/files/24569088/978-3-030-64452-9_35.pdf)
[deepseek-ocr.pdf](https://github.com/user-attachments/files/24569087/deepseek-ocr.pdf)

**Goal:** Based on the attached readings, **shortlist and test multiple** table-extraction libraries, then recommend **one baseline** for our thesis pipeline (PDF → tables → keep only “review-summary tables” with a Reference/citation column).

**Read (attached):**
- *Creating a Scholarly Knowledge Graph from Survey Article Tables* (they used **Tabula**; workflow assumes a **Reference** column).
- *DeepSeek-OCR* (document parsing/OCR landscape; relevant for image-based / hard-to-extract tables).

## Tasks
- [ ] **Select 3–5 candidate libraries** (from the landscape in the attached papers / your quick scan) and list them in this issue.
- [ ] **Test at least 3 libraries** on **2 PDFs** (minimum):
  - P1: a review paper with ≥1 review-summary table (refs/citations in a column)
  - P2: a paper with tables but **no** review-summary table
- [ ] For each (PDF × library), save outputs:
  - `outputs/<pdf_name>/<library>/extracted_tables/*` (CSV or equivalent)
  - `outputs/<pdf_name>/<library>/notes.md` (what worked / what broke)
- [ ] **Write a short comparison + recommendation** in this issue:
  - **Rank** the tested libraries (1 = best)
  - Recommend **one baseline** + **one runner-up**
  - Max **8–12 bullets** total; focus on: structure fidelity, multi-page tables, 2-column layout robustness, install friction, CPU/GPU needs, output quality (CSV/HTML/JSON), ease of Python integration.

## Acceptance (done when)
- 3–5 candidates listed; ≥3 tested
- Outputs saved for P1 and P2 for each tested library
- Ranked recommendation posted (baseline + runner-up) with key gotchas / next steps

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

choose the most promising table extraction library (and validate) #1

Tasks

Acceptance (done when)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

choose the most promising table extraction library (and validate) #1

Description

Tasks

Acceptance (done when)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions