OCR Document Layout Detection API

Fast and efficient document layout detection and text extraction using YOLO object detection and multiple OCR backends.

Quick Start (TL;DR)

Copy-paste these lines to get started instantly:

git clone <your-repo-url>
cd ocr-api
conda env create -f environment.yml
conda activate ocr
# Create .env file with your GEMINI_API_KEY
make dev

Now open http://localhost:8000/docs and you're in 🚀

Features

Document layout detection using DocLayout-YOLO
Text extraction from detected bounding boxes
Multiple OCR backend support (PaddleOCR, Gemini Vision)
Concurrent processing for improved performance
RESTful API endpoints with automatic documentation
Thread-safe operations for CV2 processing

Prerequisites

Ensure your development environment is ready:

Python 3.13.2
Conda package manager
Gemini API key (for vision-based OCR)

Installation

1. Create Conda Environment

make envcreate

Or manually:

conda env create -f environment.yml

2. Activate Environment

conda activate ocr

3. Configure Environment Variables

Create a .env file in the project root:

GEMINI_BASE_URL=https://generativelanguage.googleapis.com/v1beta/openai/
GEMINI_API_KEY=your_gemini_api_key_here

Usage

The Makefile provides the following targets:

Development Mode

make dev

This starts the server with auto-reload on http://0.0.0.0:8000

Production Mode

make run

This starts the server on http://0.0.0.0:8080

API Documentation

Once the server is running, visit:

Swagger UI: http://localhost:8000/docs
ReDoc: http://localhost:8000/redoc

Make Help

You can list all available commands with:

make help

Available commands:

run          Run application
dev          Start FastAPI with uvicorn in development mode
envcreate    Create conda environment from environment.yml
envupdate    Update conda environment from environment.yml
envexport    Export clean environment.yml
lint         Check code formatting
clean        Remove __pycache__

Update Environment

make envupdate

Export Environment

make envexport

This exports the current environment to environment.yml.

Clean Cache Files

make clean

This removes:

__pycache__ directories
.pytest_cache directories
Build artifacts

API Endpoints

1. Extract Content from Image

Endpoint: POST /predict/content

Extracts all text content from a single image.

Request:

Method: POST
Content-Type: multipart/form-data
Body: file (image file)

Response:

{
  "text": "extracted text content"
}

Example:

curl -X POST "http://localhost:8000/predict/content" \
  -F "file=@/path/to/image.png"

2. Detect Bounding Boxes and Extract Content

Endpoint: POST /predict

Detects document layout bounding boxes and extracts text from each region.

Request:

Method: POST
Content-Type: multipart/form-data
Body: files (one or more image files)

Response:

[
  [
    {
      "image_id": 0,
      "bounding_box": {
        "x1": 100,
        "y1": 200,
        "x2": 300,
        "y2": 400
      },
      "content": "extracted text from region"
    }
  ]
]

Example:

curl -X POST "http://localhost:8000/predict" \
  -F "files=@/path/to/image1.png" \
  -F "files=@/path/to/image2.png"

Models

DocLayout-YOLO

The project uses DocLayout-YOLO for document layout detection, automatically downloaded from HuggingFace:

Repository: juliozhao/DocLayout-YOLO-DocStructBench
Model: doclayout_yolo_docstructbench_imgsz1024.pt

OCR Backends

1. PaddleOCR (Default)

Fast and efficient OCR
Supports multiple languages
No API key required
Best for general use cases

2. Gemini Vision (Optional)

Advanced vision-language model
Requires Gemini API key
Better accuracy for complex layouts
Supports Vietnamese text extraction

Configuration

PaddleOCR Settings

The PaddleOCR model is configured with:

Document orientation classification: Disabled
Document unwarping: Disabled
Text line orientation: Disabled

YOLO Detection Parameters

Image size: 1024x1024
Confidence threshold: 0.2
IoU threshold: 0.3
Maximum detections: 1000

Concurrency

The service uses ThreadPoolExecutor with a default of 5 workers for parallel processing of bounding boxes.

Development

Code Formatting

This project uses:

Black for code formatting (line length: 88)
isort for import sorting
flake8 for linting
mypy for type checking

Run all formatters:

make lint

Adding Dependencies

Install package with pip or conda
Export environment:

make envexport

Performance Considerations

The API includes timing information in console output
Concurrent processing is used for multiple bounding boxes
Thread-safe operations with lock mechanisms for CV2 operations
Temporary files are automatically cleaned up after processing
max_workers=5 for ThreadPoolExecutor (adjustable in predict_service.py)

Notes

The service uses temporary files for processing uploaded images
All temporary files are automatically cleaned up after processing
Thread locks (cv_lock) ensure thread-safe operations for OpenCV
The API includes retry mechanism for Gemini API calls (3 retries)
JSON responses are validated and cleaned before returning
Error handling is comprehensive with detailed logging

Error Handling

The API includes:

Validation errors for missing environment variables
Exception handling for file processing
Retry mechanism for vision model API calls (3 retries)
JSON parsing validation
Detailed error messages in console output

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
models		models
routes		routes
services		services
utils		utils
.env.example		.env.example
.flake8		.flake8
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
app.py		app.py
env.py		env.py
environment.yml		environment.yml
pyproject.toml		pyproject.toml

william1nguyen/ocr

Folders and files

Latest commit

History

Repository files navigation