Document Processing & OCR Pipeline (Async Backend)

A production-style, fully async backend service built using open-source technologies only.
This project demonstrates how to design and implement a scalable document processing system with background jobs, caching, rate limiting, pagination, structured logging, and proper testing.

This repository is intentionally focused on backend engineering practices (no UI, no paid services).

🚀 Project Overview

The system allows users to:

Upload documents (images / simple PDFs)
Process them asynchronously using a background worker
Extract text using OCR
Track job status in real time
Fetch results efficiently using Redis caching

All APIs are non-blocking and built using async Python.

🧱 Architecture (High Level)

FastAPI (Async) — REST API layer
SQLAlchemy Async ORM — Database access
MySQL — Primary persistent storage
Celery + Redis — Background job processing
Redis — Caching + rate limiting
Tesseract OCR — Text extraction (open source)
Docker + Docker Compose — Local orchestration
pytest — Async unit testing

✨ Implemented Features

✅ Core Functionality

Document upload API
Background OCR processing
Job status tracking
OCR result storage
Async I/O across API and DB layer

✅ Performance & Scalability

Redis caching for job status and results
Token-bucket rate limiting using Redis (per client)
Pagination and filtering for document listing APIs

✅ Reliability

Structured JSON logging for API and worker
Error handling for failed OCR jobs
Separation of concerns (API, services, workers)

✅ Testing

Async unit tests using pytest and pytest-asyncio
API tests for upload and job status
Redis behavior validation
Celery task enqueue mocked during tests

🗂 Project Structure

document-processor/
├── app/
│   ├── api/
│   │   └── v1/
│   │       ├── upload.py
│   │       ├── jobs.py
│   │       └── documents.py
│   ├── core/
│   │   ├── config.py
│   │   ├── redis_client.py
│   │   ├── ratelimit.py
│   │   └── logging_config.py
│   ├── db/
│   │   ├── base.py
│   │   ├── session.py
│   │   └── models.py
│   ├── services/
│   │   ├── storage.py
│   │   └── ocr.py
│   ├── workers/
│   │   ├── celery_app.py
│   │   └── tasks.py
│   └── main.py
├── tests/
│   ├── conftest.py
│   ├── test_upload_api.py
│   └── test_job_status_api.py
├── Dockerfile
├── worker.Dockerfile
├── docker-compose.yml
├── requirements.txt
├── .env.example
└── README.md

🔌 API Endpoints

Upload Document

POST /v1/documents/upload-document

Upload a document for OCR processing
Returns a job_id for tracking

Get Job Status

GET /v1/jobs/{job_id}

Returns job status (pending, processing, completed, failed)
Uses Redis cache for fast reads

List Documents (Paginated & Filtered)

GET /v1/documents/list-document

Query Parameters:

limit (default: 10)
offset (default: 0)
filename
content_type
date_from
date_to
sort (created_at_asc / created_at_desc)

⚡ Rate Limiting

Implemented using Redis Token Bucket algorithm
Per-client (IP-based) throttling
Configurable capacity and refill rate via environment variables
Returns HTTP 429 Too Many Requests when limit exceeded

📊 Logging

Structured JSON logs
Separate service identifiers for:
- API (service: api)
- Worker (service: worker)
Request-level logging middleware
Logs are ready for centralized log systems (ELK, Loki, etc.)

🧪 Testing

Run tests locally:

pytest -v

What is tested:

Upload API behavior
Job status API (Redis cache + DB fallback)
Celery task enqueue (mocked)
Redis isolation per test

🐳 Running Locally

1. Create environment file

cp .env.example .env

2. Build and start services

docker-compose up --build

3. API will be available at

http://localhost:8000

Swagger UI:

http://localhost:8000/docs

🧠 Key Backend Concepts Demonstrated

Async API design
Background job orchestration
Cache-first read strategy
Token-bucket rate limiting
Structured logging
Clean architecture separation
Realistic testing strategy

📌 Notes

This project uses only free and open-source technologies
Designed as a backend-focused portfolio project
No authentication, migrations, or CI are included intentionally

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Document Processing & OCR Pipeline (Async Backend)

🚀 Project Overview

🧱 Architecture (High Level)

✨ Implemented Features

✅ Core Functionality

✅ Performance & Scalability

✅ Reliability

✅ Testing

🗂 Project Structure

🔌 API Endpoints

Upload Document

Get Job Status

List Documents (Paginated & Filtered)

⚡ Rate Limiting

📊 Logging

🧪 Testing

🐳 Running Locally

1. Create environment file

2. Build and start services

3. API will be available at

🧠 Key Backend Concepts Demonstrated

📌 Notes

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
app		app
tests		tests
.env.dockerfile		.env.dockerfile
.env.example		.env.example
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt
worker.Dockerfile		worker.Dockerfile

ChiggyJain/PythonDocumentProcessorOCR

Folders and files

Latest commit

History

Repository files navigation

Document Processing & OCR Pipeline (Async Backend)

🚀 Project Overview

🧱 Architecture (High Level)

✨ Implemented Features

✅ Core Functionality

✅ Performance & Scalability

✅ Reliability

✅ Testing

🗂 Project Structure

🔌 API Endpoints

Upload Document

Get Job Status

List Documents (Paginated & Filtered)

⚡ Rate Limiting

📊 Logging

🧪 Testing

🐳 Running Locally

1. Create environment file

2. Build and start services

3. API will be available at

🧠 Key Backend Concepts Demonstrated

📌 Notes

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages