Skip to content

Modify the algorithm to switch to full OCR if a page contains only an image #3

@joanfabregat

Description

@joanfabregat

The goal is to support the edge case where the page's main text can not be read via the fast OCR pipeline but does contain some elements (graphs, tables, etc.) which are still readable using the fast OCR pipeline.

The approach for indexing should be configurable with an ocr_mode setting having three options:

  • fast => only the fast OCR pipeline
  • hybrid => the actual hybrid pipeline
  • aggressive => the new pipeline where full OCR is used if the page contains only non textual elements
  • full => the pipeline where the full OCR is always used for all pages

Metadata

Metadata

Assignees

Labels

No labels
No labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions