DIGIT-X Lab · LMU Munich
Turn unstructured medical documents into validated, machine-readable JSON.
Runs locally — no PHI leaves your machine.
flowchart LR
A["PDF / Image / Text"] --> B["Dual-Engine OCR"]
B --> C["DSPy Pipeline"]
C --> D["Validated JSON"]
style A fill:#B5A89A,stroke:#8a7e72,color:#fff
style B fill:#E87461,stroke:#c25a49,color:#fff
style C fill:#E87461,stroke:#c25a49,color:#fff
style D fill:#B5A89A,stroke:#8a7e72,color:#fff
MOSAICX ships with specialized pipelines for radiology and pathology reports, a generic extraction mode that adapts to any document, plus de-identification and patient timeline summarization. Every pipeline is a DSPy module -- meaning it can be optimized with labeled data for your specific use case.
Why MOSAICX? -- Fully local (no PHI leaves your machine), schema-driven (define exactly what to extract), dual-engine OCR (handles scans and handwriting), and DSPy-optimizable (improve accuracy with your own labeled data). One CLI for radiology, pathology, de-identification, and summarization.
# Install MOSAICX
pip install mosaicx # core
pip install 'mosaicx[mcp]' # + MCP server for AI agents
pip install 'mosaicx[all]' # everything
# Start a local LLM (Apple Silicon via vLLM-MLX)
uv tool install git+https://github.com/waybarrios/vllm-mlx.git
vllm-mlx serve mlx-community/gpt-oss-20b-MXFP4-Q8 --port 8000
# Point MOSAICX at it
export MOSAICX_LM=openai/mlx-community/gpt-oss-20b-MXFP4-Q8
export MOSAICX_API_BASE=http://localhost:8000/v1
# Extract structured data from a report
mosaicx extract --document report.pdf --mode radiologyTip
Not on Apple Silicon? Use Ollama, vLLM, or any OpenAI-compatible server. See the Getting Started guide for all backend options.
| Capability | Commands | Guide |
|---|---|---|
| Extract structured data from clinical documents | mosaicx extract, mosaicx batch |
Pipelines |
| Create and manage templates for custom extraction targets | mosaicx template create / list / refine |
Schemas & Templates |
| De-identify reports (LLM + regex belt-and-suspenders) | mosaicx deidentify |
CLI Reference |
| Summarize patient timelines across multiple reports | mosaicx summarize |
CLI Reference |
| Optimize pipelines with labeled data (DSPy) | mosaicx optimize, mosaicx eval |
Optimization |
| Extend with custom pipelines, MCP server, Python SDK | mosaicx pipeline new, mosaicx mcp serve |
Developer Guide |
Run any command with --help for full options. Complete reference: docs/cli-reference.md
# Radiology report -> structured JSON
mosaicx extract --document ct_chest.pdf --mode radiology
# Template-driven extraction (define your own fields)
mosaicx template create --describe "echo report with LVEF, valve grades, impression"
mosaicx extract --document echo.pdf --template EchoReport
# Batch-process a folder of reports
mosaicx batch --input-dir ./reports --output-dir ./structured --mode radiology --format jsonl
# De-identify a clinical note
mosaicx deidentify --document note.txt
# Patient timeline from multiple reports
mosaicx summarize --dir ./patient_001/ --patient P001See the full CLI Reference for every flag and option.
Important
Data stays on your machine. MOSAICX runs against a local inference server by default -- no external API calls, no cloud uploads. For HIPAA/GDPR compliance guidance and cloud backend caveats, see Configuration.
MOSAICX talks to any OpenAI-compatible endpoint via DSPy + litellm. Pick the backend that fits your hardware -- override with env vars.
| Backend | Port | Example |
|---|---|---|
| Ollama | 11434 | Works out-of-the-box, no config needed |
| llama.cpp | 8080 | llama-server -m model.gguf --port 8080 |
| vLLM | 8000 | vllm serve gpt-oss:120b |
| SGLang | 30000 | python -m sglang.launch_server --model-path gpt-oss:120b |
| vLLM-MLX | 8000 | vllm-mlx serve mlx-community/gpt-oss-20b-MXFP4-Q8 (Apple Silicon) |
export MOSAICX_LM=openai/gpt-oss:120b
export MOSAICX_API_BASE=http://localhost:8000/v1 # point at your server
export MOSAICX_API_KEY=dummy # or your real key for cloud APIsSSH tunneling, vLLM-MLX setup, batch tuning, and benchmarking: docs/configuration.md
| Engine | Approach | Best for |
|---|---|---|
| Surya | Layout detection + recognition | Clean printed text, fast |
| Chandra | Vision-Language Model (Qwen3-VL 9B) | Handwriting, complex layouts, tables |
By default both engines run in parallel, score each page, and pick the best result. Override with MOSAICX_OCR_ENGINE=surya or chandra.
# Essential vars -- point at your local server
export MOSAICX_LM=openai/mlx-community/gpt-oss-20b-MXFP4-Q8 # model name
export MOSAICX_API_BASE=http://localhost:8000/v1 # server URL
export MOSAICX_API_KEY=dummy # or real key for cloud
# View active config
mosaicx config showFull variable reference, .env file setup, and backend scenarios: docs/configuration.md
| Guide | Description |
|---|---|
| Getting Started | Install, first extraction, basics |
| CLI Reference | Every command, every flag, examples |
| Pipelines | Pipeline inputs/outputs, JSONL formats |
| Schemas & Templates | Create and manage extraction schemas |
| Optimization | Improve accuracy with DSPy optimizers |
| Configuration | Env vars, backends, OCR, export formats |
| MCP Server | AI agent integration via MCP |
| Developer Guide | Custom pipelines, Python SDK |
| Architecture | System design, key decisions |
git clone https://github.com/DIGIT-X-Lab/MOSAICX.git
cd MOSAICX
pip install -e ".[dev]" # or: uv sync --group dev
pytest tests/ -qSee Developer Guide for custom pipelines and the Python SDK.
@software{mosaicx2025,
title = {MOSAICX: Medical cOmputational Suite for Advanced Intelligent eXtraction},
author = {Sundar, Lalith Kumar Shiyam and DIGIT-X Lab},
year = {2025},
url = {https://github.com/DIGIT-X-Lab/MOSAICX},
doi = {10.5281/zenodo.17601890}
}Apache 2.0 -- see LICENSE.
Research: lalith.shiyam@med.uni-muenchen.de | Commercial: lalith@zenta.solutions | Issues: github.com/DIGIT-X-Lab/MOSAICX/issues
