Voice Stack is the speech stack I built for my own homelab so I could keep ASR and TTS workloads close to the data I care about—mostly Bazarr in my media server and my OpenWebUI containers so I can literally talk to my AI. It now doubles as a practical portfolio piece: a pair of FastAPI microservices, tuned for Debian 12 on bare metal with NVIDIA GPUs, but still easy to iterate on from a laptop. This README is the canonical entry point—installation, container builds, development workflow, and deep links to every supporting document all live here.
- Install as Debian Service (systemd)
- Build the Production Container Image
- Development Environment and Workflow
- Configuration Overview
- Environment Variables (Services and Containers)
- API Surface (OpenAI-Compatible Highlights)
- Reference Materials
- Repository Structure
- Licensing
The project ships with scripted installers that provision isolated Hatch environments, accept the Coqui license, configure .env, and register long-running services. Use them whenever you want hands-free setup on Debian 11+/Ubuntu 20.04+.
- Python 3.10+ with
pip git,ffmpeg,libsndfile1,libportaudio2,build-essential- Optional but recommended: CUDA 12.1+, cuDNN, and NVIDIA drivers ≥ 525 for GPU workloads
Install the prerequisites with the helper script:
cd /opt
git clone https://github.com/vyscava/voice-stack.git
cd voice-stack
sudo ./scripts/install_system_deps.sh# Automatic ASR installation (system service)
sudo ./scripts/install-asr.sh
# Automatic TTS installation (system service)
sudo ./scripts/install-tts.sh
# To install as user services instead, omit sudo.Each script will:
- Bootstrap Hatch if needed and create
.venv-asr/.venv-tts - Install PyTorch (CUDA-aware when GPUs are present)
- Render
.envfrom the template and prompt for overrides when missing - Register and start the systemd unit (
voice-stack-asrorvoice-stack-tts)
# Health
curl http://localhost:5001/health # ASR
curl http://localhost:5002/health # TTS
# Service lifecycle
sudo systemctl status voice-stack-asr
sudo systemctl restart voice-stack-tts
# Logs
sudo journalctl -u voice-stack-asr -f
# Updating after pulling new code
git pull --rebase
sudo ./scripts/update-asr.sh
sudo ./scripts/update-tts.shTroubleshooting, environment variable details, and uninstall steps are fully documented in scripts/SERVICE_INSTALLATION.md. Refer to that guide whenever you need to override the unit files, relocate model caches, or run behind a reverse proxy.
The repository maintains a single production image (Dockerfile) that switches between ASR and TTS via SERVICE_MODE. CUDA runtimes are detected automatically when the container is launched with --gpus.
docker build -t voice-stack:latest .# ASR on GPU
docker run --rm -d \
--gpus all \
-p 5001:5001 \
-e SERVICE_MODE=asr \
-e ASR_DEVICE=cuda \
voice-stack:latest
# TTS on GPU, mounting custom voices
docker run --rm -d \
--gpus all \
-p 5002:5002 \
-e SERVICE_MODE=tts \
-e TTS_DEVICE=cuda \
-v $(pwd)/voices:/app/voices \
voice-stack:latest# Minimal ASR transcription
curl -X POST http://localhost:5001/v1/audio/transcriptions \
-H "Content-Type: multipart/form-data" \
-F "file=@sample.mp3" \
-F "model=base"
# Minimal TTS generation
curl -X POST http://localhost:5002/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{"input": "Hello from Voice Stack", "voice": "speaker_en", "response_format": "mp3"}' \
--output speech.mp3More involved deployment patterns—Compose files, CI images, release automation, and troubleshooting—are captured in DOCKER_DEPLOYMENT.md.
Voice Stack was designed for repeatable local development with Hatch managing environments and Nox mirroring CI. The flow below keeps both ASR and TTS services runnable with hot reload while staying aligned with the automation that publishes releases.
| Platform | Requirements |
|---|---|
| macOS (dev) | Python 3.11+, brew install ffmpeg, pip install --upgrade hatch nox pre-commit, optional uv |
| Debian/Ubuntu (dev/prod) | Python 3.11+, apt install ffmpeg libsndfile1 libportaudio2, CUDA 12.1+ for GPU |
git clone https://github.com/vyscava/voice-stack.git
cd voice-stack
python -m pip install --upgrade hatch nox pre-commit
hatch env create # creates the default dev environment
pre-commit installRun the services in separate shells:
hatch run run_asr
hatch run run_ttsBoth commands expose Swagger UI under /docs and reload automatically when code under src/ changes.
| Purpose | Command |
|---|---|
| Launch dev shell | hatch shell |
| Start ASR / TTS | hatch run run_asr / hatch run run_tts |
| Start with debugpy | hatch run run_asr_dbg / hatch run run_tts_dbg |
| Format code | hatch run fmt |
| Ruff + lint checks | hatch run lint |
| MyPy type checking | hatch run typecheck |
| Tests (unit mix) | hatch run test |
| Coverage suite | hatch run cov |
nox mirrors the CI layout and is the recommended way to validate changes against multiple Python interpreters:
nox -s fmt— formatting (Black + Ruff import fixes)nox -s lint— Ruff linting plus style enforcementnox -s typecheck— MyPy usingpyproject.tomlsettingsnox -s tests— full pytest matrix across configured Python versionsnox -s ci— convenience bundle for local CI parity
- Create or update feature branches normally (
git checkout -b feature/...). - Keep
.envsynced withexample.envor the service installers if you need custom defaults. - Run
hatch run fmt && hatch run lintbefore committing; pre-commit hooks enforce the same pair. - Execute the relevant
hatch run test-*command (ASR, TTS, core, utils, integration) or justnox -s testsfor wider coverage. - When touching Docker or deployment files, rebuild locally (
docker-compose build) so Compose + GPU flags remain valid.
- After staging or committing work, run
pre-commit runso the configured Nox-backed hooks reformat and lint the files touched in your change set. - If you need a repo-wide sweep—for example before a major release—use
pre-commit run --all-filesto execute the same Nox checkers across the entire tree. - Push only after these hooks pass locally; the GitLab pipeline runs the exact same tooling, so keeping them green avoids MR churn.
All runtime configuration is expressed through environment variables; both the installers and Docker entrypoint honor .env values. The most commonly tuned fields are:
| Category | Variables |
|---|---|
| General API | PROJECT_NAME, API_V1_STR, CORS_ORIGINS, LOG_LEVEL |
| ASR | ASR_DEVICE, ASR_MODEL, ASR_ENGINE, ASR_VAD_ENABLED, ASR_LANGUAGE |
| TTS | TTS_DEVICE, TTS_MODEL, TTS_VOICE_DIR, TTS_AUTO_LANG, TTS_SAMPLE_RATE, TTS_MAX_CHARS |
| Debug | DEBUGPY_ENABLE, DEBUGPY_HOST, DEBUGPY_PORT, DEBUGPY_WAIT_FOR_CLIENT |
The .env generated by the install scripts includes inline comments. For hand-crafted environments, consult scripts/SERVICE_INSTALLATION.md#configuration.
Both the systemd installers and Docker workflows ultimately rely on the same variable surface. Use scripts/.env.production.asr and scripts/.env.production.tts as the authoritative references when promoting changes across environments.
| Scope | Key Variables | Notes |
|---|---|---|
| Server/runtime | ENV, HOST, ASR_PORT/TTS_PORT, LOG_LEVEL, CORS_ORIGINS |
Mirrors FastAPI settings regardless of service mode |
| Engines | ASR_ENGINE, ASR_MODEL, ASR_DEVICE, ASR_COMPUTE_TYPE, ASR_CPU_THREADS, ASR_NUM_OF_WORKERS |
CPU/GPU tuning plus model cache paths |
| TTS specifics | TTS_ENGINE, TTS_MODEL, TTS_DEVICE, TTS_VOICES_DIR, TTS_SAMPLE_RATE, TTS_MAX_CHARS, TTS_MIN_CHARS |
Voice cloning, chunking, and audio format controls |
| Language controls | ASR_LANGUAGE, TTS_DEFAULT_LANG, TTS_AUTO_LANG, TTS_LANG_HINT, TTS_FORCE_LANG |
Force or hint language detection |
| Diagnostics | DEBUGPY_ENABLE, DEBUGPY_HOST, DEBUGPY_PORT, DEBUGPY_WAIT_FOR_CLIENT |
Switch off for production services |
Install scripts will prompt for overrides when .env is missing, but you can also copy the production templates directly and keep per-service overrides in the same file.
The container image reads the same .env values, but Compose provides sensible defaults in docker-compose.yml:
| Container | Key Variables | Description |
|---|---|---|
| Shared | SERVICE_MODE (asr or tts), ENV, LOG_LEVEL, HOST, CORS_ORIGINS |
Determines which FastAPI app boots |
| ASR | ASR_PORT, ASR_ENGINE, ASR_DEVICE, ASR_MODEL, ASR_COMPUTE_TYPE, ASR_BEAM_SIZE, ASR_MAX_WORKERS |
Swap ASR_DEVICE to cuda when running with --gpus all |
| TTS | TTS_PORT, TTS_ENGINE, TTS_DEVICE, TTS_MODEL, TTS_SAMPLE_RATE, TTS_MAX_CHARS, TTS_MIN_CHARS, TTS_RETRY_STEPS, TTS_DEFAULT_LANG, TTS_AUTO_LANG, TTS_VOICES_DIR |
Mount ./voices into /app/voices for custom clones |
You can mount a single .env into the container (- ./.env:/app/.env:ro) to avoid duplicating values. Anything not explicitly set in Compose falls back to the defaults described in the production templates.
Both services maintain OpenAI-compatible routes for drop-in SDK use, while the ASR side exposes native Bazarr endpoints that I rely on in my media server stack. Health instrumentation also includes /health/detailed for resource and model telemetry.
| Service | Endpoint | Notes |
|---|---|---|
| ASR (OpenAI) | POST /v1/audio/transcriptions |
Multipart uploads for Whisper transcription |
| ASR (OpenAI) | POST /v1/audio/transcriptions/verbose |
Segment-level metadata output |
| ASR (OpenAI) | POST /v1/audio/translations |
Translate to English |
| ASR (Bazarr) | POST /bazarr/asr |
Subtitle-friendly output formats (JSON, SRT, VTT, TXT, TSV, JSONL) |
| ASR (Bazarr) | POST /bazarr/detect-language |
Language detection with tunable offsets |
| TTS (OpenAI) | POST /v1/audio/speech |
Multi-format (MP3, OPUS, AAC, FLAC, WAV, PCM) generation |
| TTS (OpenAI) | GET /v1/audio/voices |
Enumerate cloned voice samples |
| Both | GET /v1/models |
Discover loaded models |
| Both | /health, /healthz, /healthcheck |
Liveness probes |
| Both | /health/detailed |
Exposes memory/swap usage, active concurrency slots, and model load state (see src/asr/app.py and src/tts/app.py) |
See tests/ for cURL, SDK, and integration test coverage whenever you need concrete payloads.
scripts/SERVICE_INSTALLATION.md— full systemd guide (installation, updates, troubleshooting, advanced networking)DOCKER_DEPLOYMENT.md— Docker/Compose builds, CI images, release automation, and operational tipsdocs/ARCHITECTURE.md— component breakdowns, request flows, and diagrams; use it when explaining the platformdocs/CONCURRENCY_FEATURES.md— thread-safety and queueing notes for high-throughput workloadsdocs/LOAD_TESTING.md— methodology and reference numbers for stress testsdocs/WHISPER_CHEAT_SHEET.md— practical parameters for each Whisper model size.docs/README.md— generated repository map linking symbols to files (useful when navigating large updates)
- Shout out to SubGen, the Bazarr FastAPI project whose subtitle-friendly routing and schema design heavily influenced the Bazarr endpoints in this stack.
- Shout out to whisper-asr-webservice for the early inspiration around OpenAI-compatible ASR workflows and deployment ergonomics.
voice-stack/
├── src/
│ ├── asr/ # ASR microservice
│ ├── tts/ # TTS microservice
│ ├── core/ # Shared settings/logging
│ └── utils/ # Audio, language, queues, text helpers
├── scripts/ # Installers, system helpers, tooling automation
├── docs/ # Architecture, concurrency, load testing, cheat sheets
├── voices/ # Sample voice seeds for TTS cloning
├── tests/ # Pytest suites (unit + integration)
├── Dockerfile # Unified production image
├── Dockerfile.ci # CI automation image
├── docker-compose.yml # Local multi-service orchestration
├── pyproject.toml # Hatch project file
└── .gitlab-ci.yml # Pipeline configuration
Voice Stack is released under the MIT License © 2025 Vitor Bordini Garbim. Building or distributing the TTS components also implies acceptance of the Coqui Public Model License; refer to scripts/accept_coqui_license.sh for the automated acceptance flow bundled with the project.