-
Notifications
You must be signed in to change notification settings - Fork 2.6k
Description
Proposal: Native /v1/audio/transcriptions Support for VibeVoice (vLLM)
Hi,
I’d like to ask about the possibility of using the OpenAI-compatible /v1/audio/transcriptions endpoint with VibeVoice ASR, instead of the current chat-completions multimodal path. My goal is to understand what currently prevents this, and whether changes in vLLM and/or VibeVoice would be acceptable to enable native support.
Current Behavior
- The VibeVoice ASR vLLM plugin currently relies on
/v1/chat/completions, sendingaudio_urlplus prompt text and returning structured JSON output (Who/When/What). - The OpenAI
/v1/audio/transcriptionsendpoint, by contrast, expects a file upload and a simplepromptstring, and returns a standardized response schema.
What currently blocks /v1/audio/transcriptions?
Based on my investigation, the main limitations appear to be:
-
Pipeline mismatch
VibeVoice ASR is integrated as a multimodal chat model, rather than as a native ASR transcription model (i.e., it does not implement theSupportsTranscriptionpipeline). -
Output schema mismatch
The chat output uses a custom JSON format (“Who/When/What”), while the OpenAIdiarized_jsoncontract expects:speaker,start,end,text,type = transcript.text.segment- response type
TranscriptionDiarized
-
Prompt / hotwords handling
VibeVoice ASR encodes hotwords in the prompt text. While transcriptions support aprompt, there is no dedicated “hotwords” parameter.
Potential Changes (Conceptual)
vLLM
- Expose
response_format="diarized_json"in the transcriptions endpoint, aligned with the OpenAI contract. - Restrict this option to models that explicitly declare diarization support.
- Support diarized stream events where applicable.
VibeVoice
- Implement a native ASR interface (
SupportsTranscription) so it can run through the transcriptions pipeline. - Add a post-processing step to map VibeVoice output into the OpenAI diarized schema (
speaker/start/end/text/type).
Potential Benefits
- Standardized endpoint: tools and clients using
/v1/audio/transcriptionswould work out of the box. - Better compatibility: closer alignment with OpenAI-style SDKs and response schemas.
- Simpler integration: users would not need to rely on chat prompts or audio data URLs.
- More consistent diarization via
response_format="diarized_json".
I’d like to get your feedback on the following points:
- Is there any technical or design reason to prefer staying exclusively on
/v1/chat/completionsfor VibeVoice ASR? - Would a PR adding native transcription support (including
diarized_json) for VibeVoice ASR be something you would consider? - Are there any constraints, expectations, or preferred implementation details I should be aware of before exploring this further?
Thanks in advance!