microsoft · iamthehimansh · Dec 9, 2025 · Dec 9, 2025 · Dec 9, 2025 · Copilot
diff --git a/docs/python_inference.md b/docs/python_inference.md
@@ -0,0 +1,195 @@
+# VibeVoice Python Inference Guide
+
+Complete API reference for VibeVoice text-to-speech.
+
+## Table of Contents
+
+- [Quick Start](#quick-start)
+- [API Reference](#api-reference)
+  - [synthesize_speech()](#synthesize_speech)
+  - [list_default_voices()](#list_default_voices)
+  - [VibeVoiceStreamingTTS](#vibevoicestreamingtts)
+  - [AudioPlayer](#audioplayer)
+
+---
+
+## Quick Start
+
+```python
+from vibevoice import synthesize_speech
+
+# Simplest
+synthesize_speech("Hello world!")
+
+# With device
+synthesize_speech(text="Hello world!", device="cuda")
+```
+
+---
+
+## API Reference
+
+### synthesize_speech()
+
+One-line function for text-to-speech.
+
+```python
+synthesize_speech(
+    text: str | Iterator[str],
+    device: str = "cuda",
+    output_file: str = None,
+    voice_prompt_path: str = None,
+    inference_steps: int = 5,
+    cfg_scale: float = 1.5,
+    **kwargs
+)
+```
+
+**Key Parameters:**
+
+- `text` - Text or iterator
+- `device` - "cuda", "mps", or "cpu"
+- `output_file` - Save path (optional)
+- `inference_steps` - 5 (fast) to 50 (quality)
+- `cfg_scale` - 1.0-2.0 (quality)
+
+**Examples:**
+
+```python
+# Basic
+synthesize_speech(text="Hello", device="cuda")
+
+# Iterator (LLM streaming)
+synthesize_speech(text=["Hello", "world"], device="cuda")
+
+# Save file
+synthesize_speech(text="Hello", device="cuda", output_file="out.wav")
+
+# Custom voice
+synthesize_speech(
+    text="Hello",
+    device="cuda",
+    voice_prompt_path="voices/custom.pt"
+)
+
+# High quality
+synthesize_speech(text="Hello", device="cuda", inference_steps=50, cfg_scale=2.0)
+```
+
+---
+
+### list_default_voices()
+
+List available voice presets.
+
+```python
+voices = list_default_voices()
+# Returns: ['en-Carter_man', 'en-Davis_man', 'en-Emma_woman', ...]
+```
+
+---
+
+### VibeVoiceStreamingTTS
+
+High-level TTS class for advanced usage.
+
+**Constructor:**
+
+```python
+tts = VibeVoiceStreamingTTS(
+    model_path="microsoft/VibeVoice-Realtime-0.5B",
+    device="cuda",
+    voice_prompt_path=None,  # Auto-loads default
+    inference_steps=5
+)
+```
+
+**Parameters:**
+
+- `model_path` - HuggingFace model ID
+- `device` - "cuda", "mps", "cpu"
+- `voice_prompt_path` - Voice file (optional, auto-loads if None)
+- `inference_steps` - 5-50 (speed vs quality)
+
+**Methods:**
+
+#### `text_to_speech_streaming(text_iterator, cfg_scale=1.5)`
+
+Generate speech from iterator.
+
+```python
+def text_gen():
+    yield "Hello world"
+
+audio = tts.text_to_speech_streaming(text_gen())
+# Returns: Iterator[np.ndarray]
+```
+
+#### `save_audio(audio, output_path)`
+
+Save audio to WAV file.
+
+```python
+import numpy as np
+
+chunks = list(tts.text_to_speech_streaming(text_gen()))
+audio = np.concatenate(chunks)
+tts.save_audio(audio, "output.wav")
+```
+
+---
+
+### AudioPlayer
+
+Audio playback with speaker selection.
+
+**Constructor:**
+
+```python
+player = AudioPlayer(device_id=None, sample_rate=24000)
+```
+
+**Methods:**
+
+#### `list_devices()` [static]
+
+```python
+AudioPlayer.list_devices()
+# Shows available speakers
+```
+
+#### `play_stream(audio_iterator, realtime=True)`
+
+```python
+player.play_stream(audio, realtime=True)  # Streaming
+player.play_stream(audio, realtime=False)  # Buffered
+```
+
+---
+
+## Quick Reference
+
+| Function | Purpose |
+|----------|---------|
+| `synthesize_speech()` | One-line TTS |
+| `list_default_voices()` | See available voices |
+| `VibeVoiceStreamingTTS` | Advanced TTS class |
+| `AudioPlayer` | Audio playback |
+
+**Devices:**
+- `"cuda"` - NVIDIA GPU (fastest)
+- `"mps"` - Apple Silicon
+- `"cpu"` - CPU (slower)
+
+**Quality Settings:**
+- Fast: `inference_steps=5`, `cfg_scale=1.5`
+- Quality: `inference_steps=50`, `cfg_scale=2.0`
+
+**Default Voices:**
+- en-Mike_man, en-Emma_woman, en-Carter_man, en-Davis_man, en-Frank_man, en-Grace_woman, in-Samuel_man
+
+---
+
+## License
+
+See [LICENSE](../LICENSE) for details.
diff --git a/examples/simple_inference.py b/examples/simple_inference.py
@@ -0,0 +1,66 @@
+"""
+Simple VibeVoice Inference Example
+
+This script demonstrates basic usage of the VibeVoice Python API.
+
+Run from VibeVoice root:
+    python examples/simple_inference.py
+"""
+
+from vibevoice import VibeVoiceStreamingTTS, AudioPlayer
+
+
+def main():
+    print("="*60)
+    print("VibeVoice Simple Inference Example")
+    print("="*60)
+    print()
+
+    # Configuration
+    MODEL_PATH = "microsoft/VibeVoice-Realtime-0.5B"
+    VOICE_PROMPT_PATH = "demo/voices/streaming_model/en-Emma_woman.pt"  # Optional
+    DEVICE = "cuda"  # or "cpu" or "mps"
+
+    # Initialize TTS
+    print("Initializing VibeVoice...")
+    tts = VibeVoiceStreamingTTS(
+        model_path=MODEL_PATH,
+        voice_prompt_path=VOICE_PROMPT_PATH,
+        device=DEVICE,
+        inference_steps=5  # Fast inference
+    )
+    print()
+
+    # Initialize audio player
+    print("Initializing audio player...")
+    player = AudioPlayer()
+    print()
+
+    # List available devices
+    print("Available audio devices:")
+    AudioPlayer.list_devices()
+    print()
+
+    # Generate text
+    def text_generator():
+        """Simple text generator"""
+        text = "Hello! This is VibeVoice speaking. I can generate speech in real time."
+        for word in text.split():
+            yield word
+
+    # Generate and play
+    print("Generating and playing speech...")
+    print("Text: 'Hello! This is VibeVoice speaking. I can generate speech in real time.'")
+    print()
+
+    audio_stream = tts.text_to_speech_streaming(text_generator())
+    player.play_stream(audio_stream, realtime=True)
+
+    print()
+    print("="*60)
+    print("Done!")
+    print("="*60)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/vibevoice/__init__.py b/vibevoice/__init__.py
@@ -1,4 +1,14 @@
 # vibevoice/__init__.py
+
+# High-level API
+from .inference import (
+    VibeVoiceStreamingTTS,
+    AudioPlayer,
+    synthesize_speech,
+    list_default_voices
+)
+
+# Low-level API
 from vibevoice.modular import (
     VibeVoiceStreamingForConditionalGenerationInference,
     VibeVoiceStreamingConfig,
@@ -7,10 +17,24 @@
     VibeVoiceStreamingProcessor,
     VibeVoiceTokenizerProcessor,
 )
+from .modular.streamer import (
+    AudioStreamer,
+    AsyncAudioStreamer
+)
 
 __all__ = [
+    # High-level API
+    'VibeVoiceStreamingTTS',
+    'AudioPlayer',
+    'synthesize_speech',
+    'list_default_voices',
+    # Low-level API
     "VibeVoiceStreamingForConditionalGenerationInference",
     "VibeVoiceStreamingConfig",
     "VibeVoiceStreamingProcessor",
     "VibeVoiceTokenizerProcessor",
-]
+    'AudioStreamer',
+    'AsyncAudioStreamer',
+]
+
+__version__ = '0.0.1'