Echolancer is a multi-speaker, transformer decoder-only English TTS model. We use NeuCodec as the audio tokenizer.
We (me and my cat) release pretrained checkpoints, notebooks, and a technical report
| Name | Params | Training Data | Speaker Control | Download | Demo |
|---|---|---|---|---|---|
| Echolancer Stage 3 ZS | ~1.3B | Base+7k hours multi-speaker | ✔️ Zero-shot (ECAPA-TDNN) | HuggingFace | |
| Echolancer Stage 3 Base | ~1.3B | 30K+ hours multi-speaker | ❌ None (random) | HuggingFace | N/A |
| Echolancer Stage 2 ZS | ~550M | Base+7k hours multi-speaker | ✔️ Zero-shot (ECAPA-TDNN) | HuggingFace | |
| Echolancer Stage 2 Base | ~550M | 30K+ hours multi-speaker | ❌ None (random) | HuggingFace | N/A |
| Echolancer Stage 1 ZS | ~177M | Base+7k hours multi-speaker | ✔️ Zero-shot (ECAPA-TDNN) | HuggingFace | |
| Echolancer Stage 1 Base | ~177M | 30K+ hours multi-speaker | ❌ None (random) | HuggingFace |
For inference code, please see the Colab demos
Marked with ❌ means not currently available but is on high priority.
- ✔️ Base model without speaker conditioning
- ✔️ Inference notebook
- ✔️ Zero-shot
- ✔️ Multi-GPU training
- 🟡 LoRA finetuning (already capable - still need to write guide)
- ❌ Inference with KV cache
- ❌ ONNX export
The base model can be finetuned to adapt it to a new voice (or multiple). You can either do full finetuning or LoRA. For LoRA, we recommend at least 10 minutes of audio; for full tuning, much more.
python train.py --train_config config/train_config.yaml --model_config config/model_config.yaml --shards_dir /path/to/shards --out_dir outputtorchrun --nproc_per_node=NUM_GPUS train.py --train_config config/train_config.yaml --model_config config/model_config.yaml --shards_dir /path/to/shards --out_dir outputTODO: expand this
This codebase and model weights are released under the MIT license; basically, do what you want.
For any business/other formal inquiries, please e-mail nika109021@gmail.com