Speech AI @ NVIDIA. Working on training and deployment for speech models.
WeChat: ykzhang2020 | Email: zhangyuekai@foxmail.com
| Recipe | Description | Link |
|---|---|---|
| Whisper Fine-tuning | Fine-tune Whisper large-v2 on multi Chinese datasets using icefall | icefall/whisper |
| Speech LLM | Whisper encoder + Qwen2 LLM for Chinese ASR (Qwen-Audio style) | icefall/ASR_LLM |
| Speech-to-Speech | Qwen2.5-Omni-Like: Whisper → Adapter → Thinker LLM → Talker LLM → CosyVoice2 | mair-hub/qwen_omni_like |
| Direction | Method | Framework | Link |
|---|---|---|---|
| Speech Understanding | GRPO on Qwen2-Audio / Qwen2.5-Omni | WeST | west/examples/grpo |
| Speech Generation | GRPO / DAPO on CosyVoice2 LLM (veRL + SenseVoice reward) | veRL | mair-hub/cosyvoice_llm |
All solutions use NVIDIA Triton Inference Server with Docker Compose quick-start.
| Model | Backend | Streaming | Link |
|---|---|---|---|
| Whisper | TensorRT-LLM | — | sherpa/triton/whisper |
| Fun-ASR-Nano | vLLM | — | Fun-ASR-vllm |
| FireRedASR-AED | TensorRT-LLM | — | FireRedASR/triton_tensorrt |
| FireRedASR2-AED | TensorRT-LLM | — | FireRedASR2S/triton_tensorrt |
| SenseVoice / Paraformer | ONNX | — | FunASR/triton_gpu |
| Conformer (WeNet) | ONNX | Yes | wenet/runtime/gpu |
| Zipformer (Transducer) | TensorRT / ONNX | Yes | sherpa/triton |
| Model | Type | Streaming | Link |
|---|---|---|---|
| F5-TTS | Diffusion | — | F5-TTS/triton_trtllm |
| Spark-TTS | LLM | Yes | Spark-TTS/triton_trtllm |
| CosyVoice 2 | LLM + Diffusion | Yes | CosyVoice/triton_trtllm |
| ZipVoice | Diffusion (distilled) | — | ZipVoice/nvidia_triton |
| Tool | Description | Link |
|---|---|---|
| Triton-ASR-Client | Benchmarking client with CER/WER eval, streaming & concurrency support | Triton-ASR-Client |
| Triton-OpenAI-Speech | OpenAI-compatible /v1/audio/speech API for Triton TTS backends |
Triton-OpenAI-Speech |





