[Vibevoice-ASR]: OpenAI transcriptions endpoint in vLLM

# Proposal: Native `/v1/audio/transcriptions` Support for VibeVoice (vLLM)

Hi,

I’d like to ask about the possibility of using the OpenAI-compatible `/v1/audio/transcriptions` endpoint with VibeVoice ASR, instead of the current chat-completions multimodal path. My goal is to understand what currently prevents this, and whether changes in vLLM and/or VibeVoice would be acceptable to enable native support.

## Current Behavior
- The VibeVoice ASR vLLM plugin currently relies on `/v1/chat/completions`, sending `audio_url` plus prompt text and returning structured JSON output (Who/When/What).
- The OpenAI `/v1/audio/transcriptions` endpoint, by contrast, expects a file upload and a simple `prompt` string, and returns a standardized response schema.

## What currently blocks `/v1/audio/transcriptions`?
Based on my investigation, the main limitations appear to be:

1) Pipeline mismatch  
   VibeVoice ASR is integrated as a multimodal chat model, rather than as a native ASR transcription model (i.e., it does not implement the `SupportsTranscription` pipeline).

2) Output schema mismatch  
   The chat output uses a custom JSON format (“Who/When/What”), while the OpenAI `diarized_json` contract expects:
   - `speaker`, `start`, `end`, `text`, `type = transcript.text.segment`
   - response type `TranscriptionDiarized`

3) Prompt / hotwords handling  
   VibeVoice ASR encodes hotwords in the prompt text. While transcriptions support a `prompt`, there is no dedicated “hotwords” parameter.

## Potential Changes (Conceptual)
### vLLM
- Expose `response_format="diarized_json"` in the transcriptions endpoint, aligned with the OpenAI contract.
- Restrict this option to models that explicitly declare diarization support.
- Support diarized stream events where applicable.

### VibeVoice
- Implement a native ASR interface (`SupportsTranscription`) so it can run through the transcriptions pipeline.
- Add a post-processing step to map VibeVoice output into the OpenAI diarized schema (`speaker/start/end/text/type`).

## Potential Benefits
- Standardized endpoint: tools and clients using `/v1/audio/transcriptions` would work out of the box.
- Better compatibility: closer alignment with OpenAI-style SDKs and response schemas.
- Simpler integration: users would not need to rely on chat prompts or audio data URLs.
- More consistent diarization via `response_format="diarized_json"`.

---
I’d like to get your feedback on the following points:

- Is there any technical or design reason to prefer staying exclusively on `/v1/chat/completions` for VibeVoice ASR?
- Would a PR adding native transcription support (including `diarized_json`) for VibeVoice ASR be something you would consider?
- Are there any constraints, expectations, or preferred implementation details I should be aware of before exploring this further?

Thanks in advance!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Vibevoice-ASR]: OpenAI transcriptions endpoint in vLLM #234

Proposal: Native `/v1/audio/transcriptions` Support for VibeVoice (vLLM)

Current Behavior

What currently blocks `/v1/audio/transcriptions`?

Potential Changes (Conceptual)

vLLM

VibeVoice

Potential Benefits

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Vibevoice-ASR]: OpenAI transcriptions endpoint in vLLM #234

Description

Proposal: Native /v1/audio/transcriptions Support for VibeVoice (vLLM)

Current Behavior

What currently blocks /v1/audio/transcriptions?

Potential Changes (Conceptual)

vLLM

VibeVoice

Potential Benefits

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Proposal: Native `/v1/audio/transcriptions` Support for VibeVoice (vLLM)

What currently blocks `/v1/audio/transcriptions`?