-
Notifications
You must be signed in to change notification settings - Fork 2.6k
Add High-Level Python API with Automatic Voice Loading #159
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,195 @@ | ||
| # VibeVoice Python Inference Guide | ||
|
|
||
| Complete API reference for VibeVoice text-to-speech. | ||
|
|
||
| ## Table of Contents | ||
|
|
||
| - [Quick Start](#quick-start) | ||
| - [API Reference](#api-reference) | ||
| - [synthesize_speech()](#synthesize_speech) | ||
| - [list_default_voices()](#list_default_voices) | ||
| - [VibeVoiceStreamingTTS](#vibevoicestreamingtts) | ||
| - [AudioPlayer](#audioplayer) | ||
|
|
||
| --- | ||
|
|
||
| ## Quick Start | ||
|
|
||
| ```python | ||
| from vibevoice import synthesize_speech | ||
|
|
||
| # Simplest | ||
| synthesize_speech("Hello world!") | ||
|
|
||
| # With device | ||
| synthesize_speech(text="Hello world!", device="cuda") | ||
| ``` | ||
|
|
||
| --- | ||
|
|
||
| ## API Reference | ||
|
|
||
| ### synthesize_speech() | ||
|
|
||
| One-line function for text-to-speech. | ||
|
|
||
| ```python | ||
| synthesize_speech( | ||
| text: str | Iterator[str], | ||
| device: str = "cuda", | ||
| output_file: str = None, | ||
| voice_prompt_path: str = None, | ||
| inference_steps: int = 5, | ||
| cfg_scale: float = 1.5, | ||
| **kwargs | ||
| ) | ||
| ``` | ||
|
|
||
| **Key Parameters:** | ||
|
|
||
| - `text` - Text or iterator | ||
| - `device` - "cuda", "mps", or "cpu" | ||
| - `output_file` - Save path (optional) | ||
| - `inference_steps` - 5 (fast) to 50 (quality) | ||
| - `cfg_scale` - 1.0-2.0 (quality) | ||
|
|
||
| **Examples:** | ||
|
|
||
| ```python | ||
| # Basic | ||
| synthesize_speech(text="Hello", device="cuda") | ||
|
|
||
| # Iterator (LLM streaming) | ||
| synthesize_speech(text=["Hello", "world"], device="cuda") | ||
|
|
||
| # Save file | ||
| synthesize_speech(text="Hello", device="cuda", output_file="out.wav") | ||
|
|
||
| # Custom voice | ||
| synthesize_speech( | ||
| text="Hello", | ||
| device="cuda", | ||
| voice_prompt_path="voices/custom.pt" | ||
| ) | ||
|
|
||
| # High quality | ||
| synthesize_speech(text="Hello", device="cuda", inference_steps=50, cfg_scale=2.0) | ||
| ``` | ||
|
|
||
| --- | ||
|
|
||
| ### list_default_voices() | ||
|
|
||
| List available voice presets. | ||
|
|
||
| ```python | ||
| voices = list_default_voices() | ||
| # Returns: ['en-Carter_man', 'en-Davis_man', 'en-Emma_woman', ...] | ||
| ``` | ||
|
|
||
| --- | ||
|
|
||
| ### VibeVoiceStreamingTTS | ||
|
|
||
| High-level TTS class for advanced usage. | ||
|
|
||
| **Constructor:** | ||
|
|
||
| ```python | ||
| tts = VibeVoiceStreamingTTS( | ||
| model_path="microsoft/VibeVoice-Realtime-0.5B", | ||
| device="cuda", | ||
| voice_prompt_path=None, # Auto-loads default | ||
| inference_steps=5 | ||
| ) | ||
| ``` | ||
|
|
||
| **Parameters:** | ||
|
|
||
| - `model_path` - HuggingFace model ID | ||
| - `device` - "cuda", "mps", "cpu" | ||
| - `voice_prompt_path` - Voice file (optional, auto-loads if None) | ||
| - `inference_steps` - 5-50 (speed vs quality) | ||
|
|
||
| **Methods:** | ||
|
|
||
| #### `text_to_speech_streaming(text_iterator, cfg_scale=1.5)` | ||
|
|
||
| Generate speech from iterator. | ||
|
|
||
| ```python | ||
| def text_gen(): | ||
| yield "Hello world" | ||
|
|
||
| audio = tts.text_to_speech_streaming(text_gen()) | ||
| # Returns: Iterator[np.ndarray] | ||
| ``` | ||
|
|
||
| #### `save_audio(audio, output_path)` | ||
|
|
||
| Save audio to WAV file. | ||
|
|
||
| ```python | ||
| import numpy as np | ||
|
|
||
| chunks = list(tts.text_to_speech_streaming(text_gen())) | ||
| audio = np.concatenate(chunks) | ||
| tts.save_audio(audio, "output.wav") | ||
| ``` | ||
|
|
||
| --- | ||
|
|
||
| ### AudioPlayer | ||
|
|
||
| Audio playback with speaker selection. | ||
|
|
||
| **Constructor:** | ||
|
|
||
| ```python | ||
| player = AudioPlayer(device_id=None, sample_rate=24000) | ||
| ``` | ||
|
|
||
| **Methods:** | ||
|
|
||
| #### `list_devices()` [static] | ||
|
|
||
| ```python | ||
| AudioPlayer.list_devices() | ||
| # Shows available speakers | ||
| ``` | ||
|
|
||
| #### `play_stream(audio_iterator, realtime=True)` | ||
|
|
||
| ```python | ||
| player.play_stream(audio, realtime=True) # Streaming | ||
| player.play_stream(audio, realtime=False) # Buffered | ||
| ``` | ||
|
|
||
| --- | ||
|
|
||
| ## Quick Reference | ||
|
|
||
| | Function | Purpose | | ||
| |----------|---------| | ||
| | `synthesize_speech()` | One-line TTS | | ||
| | `list_default_voices()` | See available voices | | ||
| | `VibeVoiceStreamingTTS` | Advanced TTS class | | ||
| | `AudioPlayer` | Audio playback | | ||
|
|
||
| **Devices:** | ||
| - `"cuda"` - NVIDIA GPU (fastest) | ||
| - `"mps"` - Apple Silicon | ||
| - `"cpu"` - CPU (slower) | ||
|
|
||
| **Quality Settings:** | ||
| - Fast: `inference_steps=5`, `cfg_scale=1.5` | ||
| - Quality: `inference_steps=50`, `cfg_scale=2.0` | ||
|
|
||
| **Default Voices:** | ||
| - en-Mike_man, en-Emma_woman, en-Carter_man, en-Davis_man, en-Frank_man, en-Grace_woman, in-Samuel_man | ||
|
|
||
| --- | ||
|
|
||
| ## License | ||
|
|
||
| See [LICENSE](../LICENSE) for details. | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,66 @@ | ||
| """ | ||
| Simple VibeVoice Inference Example | ||
|
|
||
| This script demonstrates basic usage of the VibeVoice Python API. | ||
|
|
||
| Run from VibeVoice root: | ||
| python examples/simple_inference.py | ||
| """ | ||
|
|
||
| from vibevoice import VibeVoiceStreamingTTS, AudioPlayer | ||
|
|
||
|
|
||
| def main(): | ||
| print("="*60) | ||
| print("VibeVoice Simple Inference Example") | ||
| print("="*60) | ||
| print() | ||
|
|
||
| # Configuration | ||
| MODEL_PATH = "microsoft/VibeVoice-Realtime-0.5B" | ||
| VOICE_PROMPT_PATH = "demo/voices/streaming_model/en-Emma_woman.pt" # Optional | ||
| DEVICE = "cuda" # or "cpu" or "mps" | ||
|
|
||
| # Initialize TTS | ||
| print("Initializing VibeVoice...") | ||
| tts = VibeVoiceStreamingTTS( | ||
| model_path=MODEL_PATH, | ||
| voice_prompt_path=VOICE_PROMPT_PATH, | ||
| device=DEVICE, | ||
| inference_steps=5 # Fast inference | ||
| ) | ||
| print() | ||
|
|
||
| # Initialize audio player | ||
| print("Initializing audio player...") | ||
| player = AudioPlayer() | ||
| print() | ||
|
|
||
| # List available devices | ||
| print("Available audio devices:") | ||
| AudioPlayer.list_devices() | ||
| print() | ||
|
|
||
| # Generate text | ||
| def text_generator(): | ||
| """Simple text generator""" | ||
| text = "Hello! This is VibeVoice speaking. I can generate speech in real time." | ||
| for word in text.split(): | ||
| yield word | ||
|
|
||
| # Generate and play | ||
| print("Generating and playing speech...") | ||
| print("Text: 'Hello! This is VibeVoice speaking. I can generate speech in real time.'") | ||
| print() | ||
|
|
||
| audio_stream = tts.text_to_speech_streaming(text_generator()) | ||
| player.play_stream(audio_stream, realtime=True) | ||
|
|
||
| print() | ||
| print("="*60) | ||
| print("Done!") | ||
| print("="*60) | ||
|
|
||
|
|
||
| if __name__ == "__main__": | ||
| main() |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,4 +1,14 @@ | ||
| # vibevoice/__init__.py | ||
|
|
||
| # High-level API | ||
| from .inference import ( | ||
| VibeVoiceStreamingTTS, | ||
| AudioPlayer, | ||
| synthesize_speech, | ||
| list_default_voices | ||
| ) | ||
|
Comment on lines
+3
to
+9
|
||
|
|
||
| # Low-level API | ||
| from vibevoice.modular import ( | ||
| VibeVoiceStreamingForConditionalGenerationInference, | ||
| VibeVoiceStreamingConfig, | ||
|
|
@@ -7,10 +17,24 @@ | |
| VibeVoiceStreamingProcessor, | ||
| VibeVoiceTokenizerProcessor, | ||
| ) | ||
| from .modular.streamer import ( | ||
| AudioStreamer, | ||
| AsyncAudioStreamer | ||
| ) | ||
|
|
||
| __all__ = [ | ||
| # High-level API | ||
| 'VibeVoiceStreamingTTS', | ||
| 'AudioPlayer', | ||
| 'synthesize_speech', | ||
| 'list_default_voices', | ||
| # Low-level API | ||
| "VibeVoiceStreamingForConditionalGenerationInference", | ||
| "VibeVoiceStreamingConfig", | ||
| "VibeVoiceStreamingProcessor", | ||
| "VibeVoiceTokenizerProcessor", | ||
| ] | ||
| 'AudioStreamer', | ||
| 'AsyncAudioStreamer', | ||
| ] | ||
|
|
||
| __version__ = '0.0.1' | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docs show
synthesize_speech(..., **kwargs)but the actual function signature does not accept**kwargs. Either add**kwargs(and document what is supported) or update the documentation signature to match the implementation.