GitHub

Prerequisites

streamlit run src/app.py

Modules are the building blocks of the pipeline. Each Module should execute (in theory) exactly one task. Currently following modules exist:

PdfConverter: Converts PDF to jpg pages
- Input
  - output_folder: str
- Output: jpg images in output_folder
TableRotator: Rotates jpg images
- Input
  - output_folder: str
- Output: rotated jpg images in output_folder
TatrExtractor: Extract table structure from jpg images
- Input
  - output_folder: str
- Output: extracted table images in output_folder
ColumnExtractor: Tries to extract all columns from table images
- Input
  - minFoundColumns: int: Min amount of columns to be found to be an "eligible" page
  - try_experimental_unify: bool: Experimental feature. Tries to map columns based on their extracted width
- Output list[ColumnExtractorResult]
  - columns_rgb: list[np.ndarray]: Cut-out rgb columns
  - columns_gray: list[np.ndarray]: Cut-out gray-scaled columns
  - split_widths: list[float]: Column widths
RowExtractor: Tries to extract all rows inside a column
- Output list[RowExtractorResult]
  - columns: list[list[np.ndarray]]: Cut-out gray cell for each column and row
CellDenoiser: Denoises each cell
- Output list[CellDenoiserResult]
  - columns: list[list[np.ndarray]]: Denoised gray cell for each column and row

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
.streamlit		.streamlit
artefacts		artefacts
config		config
data		data
docker		docker
src		src
.DS_Store		.DS_Store
.gitignore		.gitignore
ReadMe.md		ReadMe.md
cell.png		cell.png
link_to_tatr_models.txt		link_to_tatr_models.txt
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt