Add High-Level Python API with Automatic Voice Loading#159
Add High-Level Python API with Automatic Voice Loading#159iamthehimansh wants to merge 3 commits intomicrosoft:mainfrom
Conversation
Add easy-to-use Python inference API with one-line synthesis, automatic default voice loading, and comprehensive documentation. Key features: - synthesize_speech() one-line function - Automatic default voice loading (7 voices included) - Iterator support for LLM integration - Complete documentation and examples
@microsoft-github-policy-service agree |
|
Hey team, |
|
The two Markdown sections are quite verbose. Could you provide a more concise guide that focuses on the essentials? It would also be helpful to include a minimal, clear example. Is there any further room to simplify the code? |
|
Sure, let me look into this |
|
@YaoyaoChang can you review i made changes to doc. |
|
@YaoyaoChang can u check my implementation |
|
I’ve been busy these days and will take care of it as soon as I’m available. |
There was a problem hiding this comment.
Pull request overview
Adds a high-level Python inference API intended to make VibeVoice easier to use via one-call synthesis, automatic default voice loading, streaming playback utilities, and accompanying documentation/examples.
Changes:
- Added
vibevoice/inference.pywith a high-level streaming TTS wrapper, playback helper, and convenience functions (synthesize_speech,list_default_voices). - Updated
vibevoice/__init__.pyto export the new high-level API and define__version__. - Added
examples/simple_inference.pyanddocs/python_inference.mdto document and demonstrate the API.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 13 comments.
| File | Description |
|---|---|
| vibevoice/inference.py | Introduces high-level inference + playback API with default voice auto-loading and streaming generation/playback. |
| vibevoice/init.py | Exposes new high-level API at package top-level and sets version. |
| examples/simple_inference.py | Example script showing how to run streaming inference and playback. |
| docs/python_inference.md | API guide and usage examples for the new high-level interface. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # High-level API | ||
| from .inference import ( | ||
| VibeVoiceStreamingTTS, | ||
| AudioPlayer, | ||
| synthesize_speech, | ||
| list_default_voices | ||
| ) |
There was a problem hiding this comment.
vibevoice.__init__ eagerly imports the high-level inference module, which pulls in heavy deps (torch/numpy) and can trigger import-time side effects (e.g., sounddevice warning prints). Consider lazy-exporting these symbols (e.g., via __getattr__) or moving optional imports inside the functions/classes that need them, so importing vibevoice stays lightweight.
| default_voice_dir = Path(__file__).parent.parent / "demo" / "voices" / "streaming_model" | ||
| if default_voice_dir.exists(): | ||
| # Look for a default voice (prefer en-Mike_man.pt or first available) | ||
| default_voices = list(default_voice_dir.glob("*.pt")) | ||
| if default_voices: | ||
| # Prefer en-Mike_man.pt if available | ||
| preferred = default_voice_dir / "en-Mike_man.pt" | ||
| voice_path = preferred if preferred.exists() else default_voices[0] | ||
| print(f"Loading default voice prompt from {voice_path.name}") | ||
| self.voice_prompt = torch.load( | ||
| voice_path, | ||
| map_location=device, | ||
| weights_only=False | ||
| ) | ||
|
|
||
| if self.voice_prompt is None: | ||
| raise RuntimeError( | ||
| "No voice prompt provided and no default voices found. " | ||
| "Please provide a voice_prompt_path or ensure demo/voices/streaming_model/*.pt exists." | ||
| ) |
There was a problem hiding this comment.
Default voice loading relies on demo/voices/streaming_model relative to the repo root (Path(__file__).parent.parent / "demo" / ...). When the library is installed from a wheel/sdist, the demo/ directory and .pt assets typically won’t be included unless explicitly packaged, so this will raise at runtime for most users. Consider shipping these voice prompts as package data (e.g., under vibevoice/assets/... + include_package_data), or making the default-voice path configurable / downloadable.
| if voice_prompt_path and Path(voice_prompt_path).exists(): | ||
| print(f"Loading voice prompt from {voice_prompt_path}") | ||
| self.voice_prompt = torch.load( | ||
| voice_prompt_path, | ||
| map_location=device, | ||
| weights_only=False | ||
| ) | ||
| else: |
There was a problem hiding this comment.
If voice_prompt_path is provided but the file does not exist, the code silently falls back to a default voice. That makes typos hard to detect and can produce surprising voices. Consider raising FileNotFoundError when a non-None voice_prompt_path is invalid, and only falling back when the parameter is actually None.
| def run_generation(): | ||
| with torch.no_grad(): | ||
| self.model.generate( | ||
| **inputs, | ||
| audio_streamer=audio_streamer, | ||
| cfg_scale=cfg_scale, | ||
| tokenizer=self.processor.tokenizer, | ||
| generation_config={'do_sample': False}, | ||
| all_prefilled_outputs=copy.deepcopy(self.voice_prompt), | ||
| ) | ||
|
|
||
| generation_thread = Thread(target=run_generation, daemon=True) | ||
| generation_thread.start() | ||
|
|
There was a problem hiding this comment.
Exceptions in the background run_generation thread are not handled. If self.model.generate(...) raises, the streamer may never receive an end() signal, causing for audio_chunk in stream: to block indefinitely. Wrap generation in try/except/finally to call audio_streamer.end() and propagate the exception back to the caller (similar to demo/web/app.py).
| full_audio = [chunk for chunk in audio_stream] | ||
|
|
||
| return full_audio if output_file else None |
There was a problem hiding this comment.
In the streaming-playback path, player.play_stream(audio_stream, ...) fully consumes audio_stream, so the subsequent full_audio = [chunk for chunk in audio_stream] will always be empty and does extra work. Remove this, or if you need to both play and collect, tee the iterator (or have AudioPlayer optionally record while playing).
| full_audio = [chunk for chunk in audio_stream] | |
| return full_audio if output_file else None | |
| return None |
| # Collect text from iterator | ||
| text_chunks = list(text_iterator) | ||
| full_text = " ".join(text_chunks) | ||
|
|
There was a problem hiding this comment.
text_to_speech_streaming exhausts the entire text_iterator upfront (list(text_iterator)), which blocks until the iterator finishes and defeats the stated use case of LLM token streaming. Either accept only str here, or redesign to incrementally append text and trigger generation in chunks (and avoid inserting extra spaces via ' '.join(...) for already-spaced chunks).
| if output_file or not play_audio: | ||
| chunks = [] | ||
| for chunk in audio_stream: | ||
| chunks.append(chunk) |
There was a problem hiding this comment.
synthesize_speech concatenates chunks unconditionally. When the input text is empty/whitespace, text_to_speech_streaming yields nothing and np.concatenate([]) raises ValueError. Guard for if not chunks: and return None/empty audio early (or raise a clear error).
| chunks.append(chunk) | |
| chunks.append(chunk) | |
| if not chunks: | |
| print("No audio chunks were generated from the input text; nothing to save or play.") | |
| return None |
| def fill_buffer(): | ||
| nonlocal buffer, iterator_finished | ||
| for audio_chunk in audio_iterator: | ||
| with buffer_lock: | ||
| buffer = np.concatenate([buffer, audio_chunk]) | ||
| iterator_finished = True |
There was a problem hiding this comment.
Real-time playback buffer growth uses buffer = np.concatenate([buffer, audio_chunk]) for every chunk, which causes repeated reallocations/copies (quadratic behavior) and can become a bottleneck for long streams. Consider a deque/ring-buffer approach (e.g., list of chunks + read index) to avoid repeatedly copying the entire buffer.
|
|
||
| if play_audio and SOUNDDEVICE_AVAILABLE: | ||
| print("Playing audio...") | ||
| player = AudioPlayer(device_id=speaker_device_id) |
There was a problem hiding this comment.
Variable player is not used.
| player = AudioPlayer(device_id=speaker_device_id) |
|
|
||
| import copy | ||
| from pathlib import Path | ||
| from typing import Iterator, Generator, Optional |
There was a problem hiding this comment.
Import of 'Generator' is not used.
| from typing import Iterator, Generator, Optional | |
| from typing import Iterator, Optional |
Add easy-to-use Python inference API with one-line synthesis, automatic
default voice loading, and comprehensive documentation.
New Features
High-Level API (
vibevoice/inference.py)synthesize_speech(): One-line function for text-to-speech synthesislist_default_voices(): Helper to list available voice presetsVibeVoiceStreamingTTS: High-level TTS class with streaming supportdemo/voices/streaming_model/en-Mike_man.pt, falls back to first availableAudioPlayer: Audio playback with speaker selectionAutomatic Voice Loading
en-Davis_man, en-Frank_man, en-Grace_woman, in-Samuel_man
Module Exports (
vibevoice/__init__.py)📊 Changes Summary
Lines of Code
Impact
🎯 Key Features Being Added
1. One-Line Synthesis
2. Automatic Voice Loading
3. LLM Integration
4. Complete Documentation