[Blog] Add blog about streaming#158
[Blog] Add blog about streaming#158patrickvonplaten wants to merge 24 commits intovllm-project:mainfrom
Conversation
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
Add a diagram for explaining the key streaming semantics Signed-off-by: Yu Luo <ErickLuo90@gmail.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com> Signed-off-by: Nick Hill <nickhill123@gmail.com>
| - **Live transcription services** need to display text as speech is recognized | ||
| - **Robotics and embodied AI** need to process continuous sensor streams (cameras, microphones, LIDAR) and generate control actions with minimal delay to interact safely with the physical world | ||
|
|
||
| For these applications, the traditional batch paradigm introduces unacceptable delays. We need infrastructure that can process input incrementally and begin generating output before all input has arrived. |
There was a problem hiding this comment.
There could also be use cases where we are not generating output before all of the input has arrived, but where it's still beneficial to stream it as it becomes available so that we can process it early and reduce the eventual TTFT.
This case could be applicable to any long-context model and does not require any special architecture (since it's essentially chunked prefill where the chunking is external).
| ### The Anchor Request Pattern | ||
|
|
||
| ``` | ||
| ┌─────────────────────────────────────────────────────────────────────────────┐ |
Co-authored-by: Michael Goin <mgoin64@gmail.com> Co-authored-by: Nick Hill <nickhill123@gmail.com> Signed-off-by: Patrick von Platen <patrick.v.platen@gmail.com>
|
|
||
| **Why is the last token (Y) discarded?** | ||
|
|
||
| In streaming models, the final sampled token for each chunk can be typically considered as a special token that signals "I'm done processing this input chunk and waiting for more." When the next input chunk arrives, this speculative token becomes meaningless—it was a placeholder indicating "waiting for input" that is now superseded by actual new input. The model will generate a fresh continuation based on the new context, so the speculative token is discarded rather than kept in the prompt. Note, This behavior is model-specific behavior and can be changed if needed. |
There was a problem hiding this comment.
Is this accurate? So far we are just generating based on max-tokens, and if this was special stop-token related, stop tokens already aren't included in output anyhow (at least when detokenizing, may need to check the behaviour w.r.t. token ids).
I thought the reason was more related to kvcache? - we want to generate one more token than needed so that we have the kvcache for all tokens. The final token has just been sampled and so doesn't have it's own kvcache.
There was a problem hiding this comment.
Good point! yeah, I think there are two angles for that. just updated a bit for more clarity and more emphasis for the kv cache part.
Recognize contributors from Mistral AI, Meta, and vLLM Core teams, as well as prior implementations that inspired this work. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Reflect that this covers multiple capabilities (streaming input support and Realtime API), and reframe other implementations as alternatives rather than inspirations. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Signed-off-by: Yu Luo <ErickLuo90@gmail.com>
|
|
||
| For these applications, the traditional batch paradigm introduces unacceptable delays. We need infrastructure that can process input incrementally and begin generating output before all input has arrived. | ||
|
|
||
| ## Architectural Requirements for Streaming |
There was a problem hiding this comment.
Should this just be "Requirements for Streaming"? The second sub-section is on training, rather than archi?
| Internally, vLLM handles streaming input by treating each chunk as a separate request with a cumulative prompt. As new chunks arrive, the engine: | ||
|
|
||
| 1. Extends the prompt with the new content | ||
| 2. Reuses cached KV values for the prefix (via prefix caching) |
There was a problem hiding this comment.
the kv cache isn't necessarily reused via prefix caching for streaming, but the kv cache is preserved (kept in memory) during the duration of the streaming session across its multiple requests
|
|
||
| The last sampled token hasn't been processed as input to the model yet—it was just output from the most recent forward pass. Since the KV cache only contains entries for tokens that have been processed, this token has no KV cache entry. Discarding it is essentially "free": we're not invalidating any cached state, and it would need to be recomputed anyway if we kept it. | ||
|
|
||
| *Caveat:* Some models emit special stop tokens that the model requires to properly continue generation. In such cases, the scheduling logic needs to accommodate +1 token to recompute the stop token before processing the new input chunk. |
There was a problem hiding this comment.
good caveat to add, we may want to mention that we can accommodate this use case in the future. to do this we would join the last output token with the new prompt of the streaming update
| Output stream: D1, C2, D2, E2, C3, D3 | ||
| ``` | ||
|
|
||
| The key insight is that early output tokens provide immediate feedback to the user, even though they may be revised as more context arrives. This dramatically reduces perceived latency. |
There was a problem hiding this comment.
The key insight is
This sounds super like an LLM wrote it :) could we reword it?
|
I just pushed an update with a few corrections/additions:
|
|
|
||
| The attention mechanism determines whether a model can process input incrementally or must wait for the entire sequence. | ||
|
|
||
| - Causal attention (uni-directional mask) restricts each position t to attend only to tokens at positions ≤ t. Because future tokens are excluded, the model’s output token at time t is final once token t arrives. This makes true streaming possible: each new token can be processed immediately, and earlier outputs never need to be revised. |
There was a problem hiding this comment.
We should also add formatting to the symbols e.g.
| 3. **AsyncLLM**: Processes streaming input, generates output | ||
| 4. **Response Stream**: Sends generated tokens back through WebSocket | ||
|
|
||
| ## Server Setup |
There was a problem hiding this comment.
I think we should also have a section called "quick start" and link it at the start/intro if the quick start is added at the end of the blog. The blog is very detailed and long, and I think for many casual readers, like I am, would like to know how to use the feature in vLLM first, before diving into the technical details.
No description provided.