Skip to content

[Vibevoice-ASR]: OpenAI transcriptions endpoint in vLLM #234

@lucaschr21

Description

@lucaschr21

Proposal: Native /v1/audio/transcriptions Support for VibeVoice (vLLM)

Hi,

I’d like to ask about the possibility of using the OpenAI-compatible /v1/audio/transcriptions endpoint with VibeVoice ASR, instead of the current chat-completions multimodal path. My goal is to understand what currently prevents this, and whether changes in vLLM and/or VibeVoice would be acceptable to enable native support.

Current Behavior

  • The VibeVoice ASR vLLM plugin currently relies on /v1/chat/completions, sending audio_url plus prompt text and returning structured JSON output (Who/When/What).
  • The OpenAI /v1/audio/transcriptions endpoint, by contrast, expects a file upload and a simple prompt string, and returns a standardized response schema.

What currently blocks /v1/audio/transcriptions?

Based on my investigation, the main limitations appear to be:

  1. Pipeline mismatch
    VibeVoice ASR is integrated as a multimodal chat model, rather than as a native ASR transcription model (i.e., it does not implement the SupportsTranscription pipeline).

  2. Output schema mismatch
    The chat output uses a custom JSON format (“Who/When/What”), while the OpenAI diarized_json contract expects:

    • speaker, start, end, text, type = transcript.text.segment
    • response type TranscriptionDiarized
  3. Prompt / hotwords handling
    VibeVoice ASR encodes hotwords in the prompt text. While transcriptions support a prompt, there is no dedicated “hotwords” parameter.

Potential Changes (Conceptual)

vLLM

  • Expose response_format="diarized_json" in the transcriptions endpoint, aligned with the OpenAI contract.
  • Restrict this option to models that explicitly declare diarization support.
  • Support diarized stream events where applicable.

VibeVoice

  • Implement a native ASR interface (SupportsTranscription) so it can run through the transcriptions pipeline.
  • Add a post-processing step to map VibeVoice output into the OpenAI diarized schema (speaker/start/end/text/type).

Potential Benefits

  • Standardized endpoint: tools and clients using /v1/audio/transcriptions would work out of the box.
  • Better compatibility: closer alignment with OpenAI-style SDKs and response schemas.
  • Simpler integration: users would not need to rely on chat prompts or audio data URLs.
  • More consistent diarization via response_format="diarized_json".

I’d like to get your feedback on the following points:

  • Is there any technical or design reason to prefer staying exclusively on /v1/chat/completions for VibeVoice ASR?
  • Would a PR adding native transcription support (including diarized_json) for VibeVoice ASR be something you would consider?
  • Are there any constraints, expectations, or preferred implementation details I should be aware of before exploring this further?

Thanks in advance!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions