Skip to content

[Blog] Add blog about streaming#158

Draft
patrickvonplaten wants to merge 24 commits intovllm-project:mainfrom
patrickvonplaten:add_streaming_input_blog
Draft

[Blog] Add blog about streaming#158
patrickvonplaten wants to merge 24 commits intovllm-project:mainfrom
patrickvonplaten:add_streaming_input_blog

Conversation

@patrickvonplaten
Copy link

No description provided.

@chatgpt-codex-connector
Copy link

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

Add a diagram for explaining the key streaming semantics

Signed-off-by: Yu Luo <ErickLuo90@gmail.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Signed-off-by: Nick Hill <nickhill123@gmail.com>
- **Live transcription services** need to display text as speech is recognized
- **Robotics and embodied AI** need to process continuous sensor streams (cameras, microphones, LIDAR) and generate control actions with minimal delay to interact safely with the physical world

For these applications, the traditional batch paradigm introduces unacceptable delays. We need infrastructure that can process input incrementally and begin generating output before all input has arrived.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There could also be use cases where we are not generating output before all of the input has arrived, but where it's still beneficial to stream it as it becomes available so that we can process it early and reduce the eventual TTFT.

This case could be applicable to any long-context model and does not require any special architecture (since it's essentially chunked prefill where the chunking is external).

### The Anchor Request Pattern

```
┌─────────────────────────────────────────────────────────────────────────────┐
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's super cool!

Co-authored-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: Nick Hill <nickhill123@gmail.com>
Signed-off-by: Patrick von Platen <patrick.v.platen@gmail.com>
Signed-off-by: Patrick von Platen <patrick.v.platen@gmail.com>
Signed-off-by: Patrick von Platen <patrick.v.platen@gmail.com>
Signed-off-by: Patrick von Platen <patrick.v.platen@gmail.com>
Signed-off-by: Patrick von Platen <patrick.v.platen@gmail.com>
Signed-off-by: Patrick von Platen <patrick.v.platen@gmail.com>
Signed-off-by: Patrick von Platen <patrick.v.platen@gmail.com>
Signed-off-by: Patrick von Platen <patrick.v.platen@gmail.com>

**Why is the last token (Y) discarded?**

In streaming models, the final sampled token for each chunk can be typically considered as a special token that signals "I'm done processing this input chunk and waiting for more." When the next input chunk arrives, this speculative token becomes meaningless—it was a placeholder indicating "waiting for input" that is now superseded by actual new input. The model will generate a fresh continuation based on the new context, so the speculative token is discarded rather than kept in the prompt. Note, This behavior is model-specific behavior and can be changed if needed.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this accurate? So far we are just generating based on max-tokens, and if this was special stop-token related, stop tokens already aren't included in output anyhow (at least when detokenizing, may need to check the behaviour w.r.t. token ids).

I thought the reason was more related to kvcache? - we want to generate one more token than needed so that we have the kvcache for all tokens. The final token has just been sampled and so doesn't have it's own kvcache.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! yeah, I think there are two angles for that. just updated a bit for more clarity and more emphasis for the kv cache part.

Signed-off-by: Patrick von Platen <patrick.v.platen@gmail.com>
ywang96 and others added 3 commits February 3, 2026 13:02
Recognize contributors from Mistral AI, Meta, and vLLM Core teams,
as well as prior implementations that inspired this work.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Reflect that this covers multiple capabilities (streaming input support
and Realtime API), and reframe other implementations as alternatives
rather than inspirations.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Signed-off-by: Roger Wang <hey@rogerw.io>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Signed-off-by: Yu Luo <ErickLuo90@gmail.com>

For these applications, the traditional batch paradigm introduces unacceptable delays. We need infrastructure that can process input incrementally and begin generating output before all input has arrived.

## Architectural Requirements for Streaming

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this just be "Requirements for Streaming"? The second sub-section is on training, rather than archi?

Internally, vLLM handles streaming input by treating each chunk as a separate request with a cumulative prompt. As new chunks arrive, the engine:

1. Extends the prompt with the new content
2. Reuses cached KV values for the prefix (via prefix caching)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the kv cache isn't necessarily reused via prefix caching for streaming, but the kv cache is preserved (kept in memory) during the duration of the streaming session across its multiple requests


The last sampled token hasn't been processed as input to the model yet—it was just output from the most recent forward pass. Since the KV cache only contains entries for tokens that have been processed, this token has no KV cache entry. Discarding it is essentially "free": we're not invalidating any cached state, and it would need to be recomputed anyway if we kept it.

*Caveat:* Some models emit special stop tokens that the model requires to properly continue generation. In such cases, the scheduling logic needs to accommodate +1 token to recompute the stop token before processing the new input chunk.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good caveat to add, we may want to mention that we can accommodate this use case in the future. to do this we would join the last output token with the new prompt of the streaming update

Output stream: D1, C2, D2, E2, C3, D3
```

The key insight is that early output tokens provide immediate feedback to the user, even though they may be revised as more context arrives. This dramatically reduces perceived latency.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The key insight is

This sounds super like an LLM wrote it :) could we reword it?

@njhill
Copy link
Member

njhill commented Feb 4, 2026

I just pushed an update with a few corrections/additions:

  • Added example of AsyncLLM.generate() usage with AsyncGenerator
  • Fixed incorrect statement "Each StreamingInput contains the cumulative prompt up to that point." - the StreamingInputs are in fact incremental
  • Added some explanation of the sequencing/termination semantics of using the input async generator
  • Made some clarifications to the example flow
  • Wrote the "Performance Considerations" section with more information about the KV cache implications


The attention mechanism determines whether a model can process input incrementally or must wait for the entire sequence.

- Causal attention (uni-directional mask) restricts each position t to attend only to tokens at positions ≤ t. Because future tokens are excluded, the model’s output token at time t is final once token t arrives. This makes true streaming possible: each new token can be processed immediately, and earlier outputs never need to be revised.
Copy link
Contributor

@tjtanaa tjtanaa Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also add formatting to the symbols e.g. $t$

3. **AsyncLLM**: Processes streaming input, generates output
4. **Response Stream**: Sends generated tokens back through WebSocket

## Server Setup
Copy link
Contributor

@tjtanaa tjtanaa Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should also have a section called "quick start" and link it at the start/intro if the quick start is added at the end of the blog. The blog is very detailed and long, and I think for many casual readers, like I am, would like to know how to use the feature in vLLM first, before diving into the technical details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants