A production-style, fully async backend service built using open-source technologies only.
This project demonstrates how to design and implement a scalable document processing system with background jobs, caching, rate limiting, pagination, structured logging, and proper testing.
This repository is intentionally focused on backend engineering practices (no UI, no paid services).
The system allows users to:
- Upload documents (images / simple PDFs)
- Process them asynchronously using a background worker
- Extract text using OCR
- Track job status in real time
- Fetch results efficiently using Redis caching
All APIs are non-blocking and built using async Python.
- FastAPI (Async) — REST API layer
- SQLAlchemy Async ORM — Database access
- MySQL — Primary persistent storage
- Celery + Redis — Background job processing
- Redis — Caching + rate limiting
- Tesseract OCR — Text extraction (open source)
- Docker + Docker Compose — Local orchestration
- pytest — Async unit testing
- Document upload API
- Background OCR processing
- Job status tracking
- OCR result storage
- Async I/O across API and DB layer
- Redis caching for job status and results
- Token-bucket rate limiting using Redis (per client)
- Pagination and filtering for document listing APIs
- Structured JSON logging for API and worker
- Error handling for failed OCR jobs
- Separation of concerns (API, services, workers)
- Async unit tests using
pytestandpytest-asyncio - API tests for upload and job status
- Redis behavior validation
- Celery task enqueue mocked during tests
document-processor/
├── app/
│ ├── api/
│ │ └── v1/
│ │ ├── upload.py
│ │ ├── jobs.py
│ │ └── documents.py
│ ├── core/
│ │ ├── config.py
│ │ ├── redis_client.py
│ │ ├── ratelimit.py
│ │ └── logging_config.py
│ ├── db/
│ │ ├── base.py
│ │ ├── session.py
│ │ └── models.py
│ ├── services/
│ │ ├── storage.py
│ │ └── ocr.py
│ ├── workers/
│ │ ├── celery_app.py
│ │ └── tasks.py
│ └── main.py
├── tests/
│ ├── conftest.py
│ ├── test_upload_api.py
│ └── test_job_status_api.py
├── Dockerfile
├── worker.Dockerfile
├── docker-compose.yml
├── requirements.txt
├── .env.example
└── README.md
POST /v1/documents/upload-document
- Upload a document for OCR processing
- Returns a
job_idfor tracking
GET /v1/jobs/{job_id}
- Returns job status (
pending,processing,completed,failed) - Uses Redis cache for fast reads
GET /v1/documents/list-document
Query Parameters:
limit(default: 10)offset(default: 0)filenamecontent_typedate_fromdate_tosort(created_at_asc/created_at_desc)
- Implemented using Redis Token Bucket algorithm
- Per-client (IP-based) throttling
- Configurable capacity and refill rate via environment variables
- Returns HTTP
429 Too Many Requestswhen limit exceeded
- Structured JSON logs
- Separate service identifiers for:
- API (
service: api) - Worker (
service: worker)
- API (
- Request-level logging middleware
- Logs are ready for centralized log systems (ELK, Loki, etc.)
Run tests locally:
pytest -vWhat is tested:
- Upload API behavior
- Job status API (Redis cache + DB fallback)
- Celery task enqueue (mocked)
- Redis isolation per test
cp .env.example .envdocker-compose up --buildhttp://localhost:8000
Swagger UI:
http://localhost:8000/docs
- Async API design
- Background job orchestration
- Cache-first read strategy
- Token-bucket rate limiting
- Structured logging
- Clean architecture separation
- Realistic testing strategy
- This project uses only free and open-source technologies
- Designed as a backend-focused portfolio project
- No authentication, migrations, or CI are included intentionally