Skip to content

Async document processing backend built with FastAPI, SQLAlchemy, MySQL, Redis, and Celery. Supports OCR jobs, caching, rate limiting, pagination, structured logging, and async unit tests using pytest

Notifications You must be signed in to change notification settings

ChiggyJain/PythonDocumentProcessorOCR

Repository files navigation

Document Processing & OCR Pipeline (Async Backend)

A production-style, fully async backend service built using open-source technologies only.
This project demonstrates how to design and implement a scalable document processing system with background jobs, caching, rate limiting, pagination, structured logging, and proper testing.

This repository is intentionally focused on backend engineering practices (no UI, no paid services).


🚀 Project Overview

The system allows users to:

  1. Upload documents (images / simple PDFs)
  2. Process them asynchronously using a background worker
  3. Extract text using OCR
  4. Track job status in real time
  5. Fetch results efficiently using Redis caching

All APIs are non-blocking and built using async Python.


🧱 Architecture (High Level)

  • FastAPI (Async) — REST API layer
  • SQLAlchemy Async ORM — Database access
  • MySQL — Primary persistent storage
  • Celery + Redis — Background job processing
  • Redis — Caching + rate limiting
  • Tesseract OCR — Text extraction (open source)
  • Docker + Docker Compose — Local orchestration
  • pytest — Async unit testing

✨ Implemented Features

✅ Core Functionality

  • Document upload API
  • Background OCR processing
  • Job status tracking
  • OCR result storage
  • Async I/O across API and DB layer

✅ Performance & Scalability

  • Redis caching for job status and results
  • Token-bucket rate limiting using Redis (per client)
  • Pagination and filtering for document listing APIs

✅ Reliability

  • Structured JSON logging for API and worker
  • Error handling for failed OCR jobs
  • Separation of concerns (API, services, workers)

✅ Testing

  • Async unit tests using pytest and pytest-asyncio
  • API tests for upload and job status
  • Redis behavior validation
  • Celery task enqueue mocked during tests

🗂 Project Structure

document-processor/
├── app/
│   ├── api/
│   │   └── v1/
│   │       ├── upload.py
│   │       ├── jobs.py
│   │       └── documents.py
│   ├── core/
│   │   ├── config.py
│   │   ├── redis_client.py
│   │   ├── ratelimit.py
│   │   └── logging_config.py
│   ├── db/
│   │   ├── base.py
│   │   ├── session.py
│   │   └── models.py
│   ├── services/
│   │   ├── storage.py
│   │   └── ocr.py
│   ├── workers/
│   │   ├── celery_app.py
│   │   └── tasks.py
│   └── main.py
├── tests/
│   ├── conftest.py
│   ├── test_upload_api.py
│   └── test_job_status_api.py
├── Dockerfile
├── worker.Dockerfile
├── docker-compose.yml
├── requirements.txt
├── .env.example
└── README.md

🔌 API Endpoints

Upload Document

POST /v1/documents/upload-document
  • Upload a document for OCR processing
  • Returns a job_id for tracking

Get Job Status

GET /v1/jobs/{job_id}
  • Returns job status (pending, processing, completed, failed)
  • Uses Redis cache for fast reads

List Documents (Paginated & Filtered)

GET /v1/documents/list-document

Query Parameters:

  • limit (default: 10)
  • offset (default: 0)
  • filename
  • content_type
  • date_from
  • date_to
  • sort (created_at_asc / created_at_desc)

⚡ Rate Limiting

  • Implemented using Redis Token Bucket algorithm
  • Per-client (IP-based) throttling
  • Configurable capacity and refill rate via environment variables
  • Returns HTTP 429 Too Many Requests when limit exceeded

📊 Logging

  • Structured JSON logs
  • Separate service identifiers for:
    • API (service: api)
    • Worker (service: worker)
  • Request-level logging middleware
  • Logs are ready for centralized log systems (ELK, Loki, etc.)

🧪 Testing

Run tests locally:

pytest -v

What is tested:

  • Upload API behavior
  • Job status API (Redis cache + DB fallback)
  • Celery task enqueue (mocked)
  • Redis isolation per test

🐳 Running Locally

1. Create environment file

cp .env.example .env

2. Build and start services

docker-compose up --build

3. API will be available at

http://localhost:8000

Swagger UI:

http://localhost:8000/docs

🧠 Key Backend Concepts Demonstrated

  • Async API design
  • Background job orchestration
  • Cache-first read strategy
  • Token-bucket rate limiting
  • Structured logging
  • Clean architecture separation
  • Realistic testing strategy

📌 Notes

  • This project uses only free and open-source technologies
  • Designed as a backend-focused portfolio project
  • No authentication, migrations, or CI are included intentionally

About

Async document processing backend built with FastAPI, SQLAlchemy, MySQL, Redis, and Celery. Supports OCR jobs, caching, rate limiting, pagination, structured logging, and async unit tests using pytest

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published