Fast and efficient document layout detection and text extraction using YOLO object detection and multiple OCR backends.
Copy-paste these lines to get started instantly:
git clone <your-repo-url>
cd ocr-api
conda env create -f environment.yml
conda activate ocr
# Create .env file with your GEMINI_API_KEY
make devNow open http://localhost:8000/docs and you're in 🚀
- Features
- Prerequisites
- Installation
- Usage
- Make Help
- API Endpoints
- Models
- Configuration
- Development
- Performance Considerations
- Notes
- Document layout detection using DocLayout-YOLO
- Text extraction from detected bounding boxes
- Multiple OCR backend support (PaddleOCR, Gemini Vision)
- Concurrent processing for improved performance
- RESTful API endpoints with automatic documentation
- Thread-safe operations for CV2 processing
Ensure your development environment is ready:
- Python 3.13.2
- Conda package manager
- Gemini API key (for vision-based OCR)
make envcreateOr manually:
conda env create -f environment.ymlconda activate ocrCreate a .env file in the project root:
GEMINI_BASE_URL=https://generativelanguage.googleapis.com/v1beta/openai/
GEMINI_API_KEY=your_gemini_api_key_hereThe Makefile provides the following targets:
make devThis starts the server with auto-reload on http://0.0.0.0:8000
make runThis starts the server on http://0.0.0.0:8080
Once the server is running, visit:
- Swagger UI: http://localhost:8000/docs
- ReDoc: http://localhost:8000/redoc
You can list all available commands with:
make helpAvailable commands:
run Run application
dev Start FastAPI with uvicorn in development mode
envcreate Create conda environment from environment.yml
envupdate Update conda environment from environment.yml
envexport Export clean environment.yml
lint Check code formatting
clean Remove __pycache__make envupdatemake envexportThis exports the current environment to environment.yml.
make cleanThis removes:
__pycache__directories.pytest_cachedirectories- Build artifacts
Endpoint: POST /predict/content
Extracts all text content from a single image.
Request:
- Method: POST
- Content-Type: multipart/form-data
- Body:
file(image file)
Response:
{
"text": "extracted text content"
}Example:
curl -X POST "http://localhost:8000/predict/content" \
-F "file=@/path/to/image.png"Endpoint: POST /predict
Detects document layout bounding boxes and extracts text from each region.
Request:
- Method: POST
- Content-Type: multipart/form-data
- Body:
files(one or more image files)
Response:
[
[
{
"image_id": 0,
"bounding_box": {
"x1": 100,
"y1": 200,
"x2": 300,
"y2": 400
},
"content": "extracted text from region"
}
]
]Example:
curl -X POST "http://localhost:8000/predict" \
-F "files=@/path/to/image1.png" \
-F "files=@/path/to/image2.png"The project uses DocLayout-YOLO for document layout detection, automatically downloaded from HuggingFace:
- Repository:
juliozhao/DocLayout-YOLO-DocStructBench - Model:
doclayout_yolo_docstructbench_imgsz1024.pt
- Fast and efficient OCR
- Supports multiple languages
- No API key required
- Best for general use cases
- Advanced vision-language model
- Requires Gemini API key
- Better accuracy for complex layouts
- Supports Vietnamese text extraction
The PaddleOCR model is configured with:
- Document orientation classification: Disabled
- Document unwarping: Disabled
- Text line orientation: Disabled
- Image size: 1024x1024
- Confidence threshold: 0.2
- IoU threshold: 0.3
- Maximum detections: 1000
The service uses ThreadPoolExecutor with a default of 5 workers for parallel processing of bounding boxes.
This project uses:
- Black for code formatting (line length: 88)
- isort for import sorting
- flake8 for linting
- mypy for type checking
Run all formatters:
make lint- Install package with pip or conda
- Export environment:
make envexport- The API includes timing information in console output
- Concurrent processing is used for multiple bounding boxes
- Thread-safe operations with lock mechanisms for CV2 operations
- Temporary files are automatically cleaned up after processing
max_workers=5for ThreadPoolExecutor (adjustable inpredict_service.py)
- The service uses temporary files for processing uploaded images
- All temporary files are automatically cleaned up after processing
- Thread locks (
cv_lock) ensure thread-safe operations for OpenCV - The API includes retry mechanism for Gemini API calls (3 retries)
- JSON responses are validated and cleaned before returning
- Error handling is comprehensive with detailed logging
The API includes:
- Validation errors for missing environment variables
- Exception handling for file processing
- Retry mechanism for vision model API calls (3 retries)
- JSON parsing validation
- Detailed error messages in console output