[Blog] Add blog about streaming by patrickvonplaten · Pull Request #158 · vllm-project/vllm-project.github.io

patrickvonplaten · 2026-02-01T17:57:39Z

No description provided.

chatgpt-codex-connector · 2026-02-01T17:57:43Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

Add a diagram for explaining the key streaming semantics Signed-off-by: Yu Luo <ErickLuo90@gmail.com>

_posts/2026-01-31-streaming-realtime.md

Co-authored-by: Michael Goin <mgoin64@gmail.com> Signed-off-by: Nick Hill <nickhill123@gmail.com>

njhill · 2026-02-03T00:15:08Z

_posts/2026-01-31-streaming-realtime.md

+- **Live transcription services** need to display text as speech is recognized
+- **Robotics and embodied AI** need to process continuous sensor streams (cameras, microphones, LIDAR) and generate control actions with minimal delay to interact safely with the physical world
+
+For these applications, the traditional batch paradigm introduces unacceptable delays. We need infrastructure that can process input incrementally and begin generating output before all input has arrived.


There could also be use cases where we are not generating output before all of the input has arrived, but where it's still beneficial to stream it as it becomes available so that we can process it early and reduce the eventual TTFT.

This case could be applicable to any long-context model and does not require any special architecture (since it's essentially chunked prefill where the chunking is external).

_posts/2026-01-31-streaming-realtime.md

patrickvonplaten · 2026-02-03T18:58:28Z

_posts/2026-01-31-streaming-realtime.md

+### The Anchor Request Pattern
+
+```
+┌─────────────────────────────────────────────────────────────────────────────┐


that's super cool!

Co-authored-by: Michael Goin <mgoin64@gmail.com> Co-authored-by: Nick Hill <nickhill123@gmail.com> Signed-off-by: Patrick von Platen <patrick.v.platen@gmail.com>

Signed-off-by: Patrick von Platen <patrick.v.platen@gmail.com>

njhill · 2026-02-03T19:21:02Z

_posts/2026-01-31-streaming-realtime.md

+
+**Why is the last token (Y) discarded?**
+
+In streaming models, the final sampled token for each chunk can be typically considered as a special token that signals "I'm done processing this input chunk and waiting for more." When the next input chunk arrives, this speculative token becomes meaningless—it was a placeholder indicating "waiting for input" that is now superseded by actual new input. The model will generate a fresh continuation based on the new context, so the speculative token is discarded rather than kept in the prompt. Note, This behavior is model-specific behavior and can be changed if needed.


Is this accurate? So far we are just generating based on max-tokens, and if this was special stop-token related, stop tokens already aren't included in output anyhow (at least when detokenizing, may need to check the behaviour w.r.t. token ids).

I thought the reason was more related to kvcache? - we want to generate one more token than needed so that we have the kvcache for all tokens. The final token has just been sampled and so doesn't have it's own kvcache.

Good point! yeah, I think there are two angles for that. just updated a bit for more clarity and more emphasis for the kv cache part.

Signed-off-by: Patrick von Platen <patrick.v.platen@gmail.com>

Recognize contributors from Mistral AI, Meta, and vLLM Core teams, as well as prior implementations that inspired this work. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Reflect that this covers multiple capabilities (streaming input support and Realtime API), and reframe other implementations as alternatives rather than inspirations. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Signed-off-by: Roger Wang <hey@rogerw.io>

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Signed-off-by: Yu Luo <ErickLuo90@gmail.com>

sanchit-gandhi · 2026-02-04T13:40:22Z

_posts/2026-01-31-streaming-realtime.md

+
+For these applications, the traditional batch paradigm introduces unacceptable delays. We need infrastructure that can process input incrementally and begin generating output before all input has arrived.
+
+## Architectural Requirements for Streaming


Should this just be "Requirements for Streaming"? The second sub-section is on training, rather than archi?

joshuadeng · 2026-02-04T20:48:53Z

_posts/2026-01-31-streaming-realtime.md

+Internally, vLLM handles streaming input by treating each chunk as a separate request with a cumulative prompt. As new chunks arrive, the engine:
+
+1. Extends the prompt with the new content
+2. Reuses cached KV values for the prefix (via prefix caching)


the kv cache isn't necessarily reused via prefix caching for streaming, but the kv cache is preserved (kept in memory) during the duration of the streaming session across its multiple requests

joshuadeng · 2026-02-04T20:50:40Z

_posts/2026-01-31-streaming-realtime.md

+
+The last sampled token hasn't been processed as input to the model yet—it was just output from the most recent forward pass. Since the KV cache only contains entries for tokens that have been processed, this token has no KV cache entry. Discarding it is essentially "free": we're not invalidating any cached state, and it would need to be recomputed anyway if we kept it.
+
+*Caveat:* Some models emit special stop tokens that the model requires to properly continue generation. In such cases, the scheduling logic needs to accommodate +1 token to recompute the stop token before processing the new input chunk.


good caveat to add, we may want to mention that we can accommodate this use case in the future. to do this we would join the last output token with the new prompt of the streaming update

njhill · 2026-02-04T22:26:34Z

_posts/2026-01-31-streaming-realtime.md

+Output stream: D1, C2, D2, E2, C3, D3
+```
+
+The key insight is that early output tokens provide immediate feedback to the user, even though they may be revised as more context arrives. This dramatically reduces perceived latency.


The key insight is

This sounds super like an LLM wrote it :) could we reword it?

njhill · 2026-02-04T23:26:00Z

I just pushed an update with a few corrections/additions:

Added example of AsyncLLM.generate() usage with AsyncGenerator
Fixed incorrect statement "Each StreamingInput contains the cumulative prompt up to that point." - the StreamingInputs are in fact incremental
Added some explanation of the sequencing/termination semantics of using the input async generator
Made some clarifications to the example flow
Wrote the "Performance Considerations" section with more information about the KV cache implications

tjtanaa · 2026-02-05T04:35:37Z

_posts/2026-01-31-streaming-realtime.md

+
+The attention mechanism determines whether a model can process input incrementally or must wait for the entire sequence.
+
+- Causal attention (uni-directional mask) restricts each position t to attend only to tokens at positions ≤ t. Because future tokens are excluded, the model’s output token at time t is final once token t arrives. This makes true streaming possible: each new token can be processed immediately, and earlier outputs never need to be revised.


We should also add formatting to the symbols e.g. $t$

tjtanaa · 2026-02-05T04:50:33Z

_posts/2026-01-31-streaming-realtime.md

+3. **AsyncLLM**: Processes streaming input, generates output
+4. **Response Stream**: Sends generated tokens back through WebSocket
+
+## Server Setup


I think we should also have a section called "quick start" and link it at the start/intro if the quick start is added at the end of the blog. The blog is very detailed and long, and I think for many casual readers, like I am, would like to know how to use the feature in vLLM first, before diving into the technical details.

patrickvonplaten added 3 commits January 31, 2026 19:14

WIP

d6e6b42

WIP

b5af5d6

WIP

8641a18

patrickvonplaten marked this pull request as draft February 1, 2026 17:58

vercel bot deployed to Preview February 1, 2026 17:58 View deployment

Update 2026-01-31-streaming-realtime.md

4fe5307

Add a diagram for explaining the key streaming semantics Signed-off-by: Yu Luo <ErickLuo90@gmail.com>

vercel bot deployed to Preview February 2, 2026 06:51 View deployment

mgoin reviewed Feb 2, 2026

View reviewed changes

_posts/2026-01-31-streaming-realtime.md Show resolved Hide resolved

_posts/2026-01-31-streaming-realtime.md Outdated Show resolved Hide resolved

_posts/2026-01-31-streaming-realtime.md Show resolved Hide resolved

fix math rendering

e131a4a

vercel bot deployed to Preview February 3, 2026 01:21 View deployment

fix math rendering

79ac4cb

vercel bot deployed to Preview February 3, 2026 01:36 View deployment

Update _posts/2026-01-31-streaming-realtime.md

6ba95b0

Co-authored-by: Michael Goin <mgoin64@gmail.com> Signed-off-by: Nick Hill <nickhill123@gmail.com>

vercel bot deployed to Preview February 3, 2026 01:37 View deployment

fix math rendering

bba56ea

vercel bot deployed to Preview February 3, 2026 01:41 View deployment

njhill reviewed Feb 3, 2026

View reviewed changes

patrickvonplaten commented Feb 3, 2026

View reviewed changes

Apply suggestions from code review

f0707ed

Co-authored-by: Michael Goin <mgoin64@gmail.com> Co-authored-by: Nick Hill <nickhill123@gmail.com> Signed-off-by: Patrick von Platen <patrick.v.platen@gmail.com>

vercel bot deployed to Preview February 3, 2026 18:59 View deployment

up

42ca153

Signed-off-by: Patrick von Platen <patrick.v.platen@gmail.com>

vercel bot deployed to Preview February 3, 2026 19:01 View deployment

up

2400091

Signed-off-by: Patrick von Platen <patrick.v.platen@gmail.com>

vercel bot deployed to Preview February 3, 2026 19:02 View deployment

up

6f885bb

Signed-off-by: Patrick von Platen <patrick.v.platen@gmail.com>

vercel bot deployed to Preview February 3, 2026 19:04 View deployment

patrickvonplaten added 2 commits February 3, 2026 20:04

up

0f35e86

Signed-off-by: Patrick von Platen <patrick.v.platen@gmail.com>

up

c375006

Signed-off-by: Patrick von Platen <patrick.v.platen@gmail.com>

vercel bot deployed to Preview February 3, 2026 19:05 View deployment

up

005675a

Signed-off-by: Patrick von Platen <patrick.v.platen@gmail.com>

vercel bot deployed to Preview February 3, 2026 19:06 View deployment

up

ae9701e

Signed-off-by: Patrick von Platen <patrick.v.platen@gmail.com>

vercel bot deployed to Preview February 3, 2026 19:12 View deployment

njhill reviewed Feb 3, 2026

View reviewed changes

up

a9d95e0

Signed-off-by: Patrick von Platen <patrick.v.platen@gmail.com>

vercel bot deployed to Preview February 3, 2026 20:12 View deployment

ywang96 and others added 3 commits February 3, 2026 13:02

Add acknowledgements section to streaming realtime blog post

b7b430b

Recognize contributors from Mistral AI, Meta, and vLLM Core teams, as well as prior implementations that inspired this work. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Update acknowledgements wording in streaming blog post

bcc48a1

Reflect that this covers multiple capabilities (streaming input support and Realtime API), and reframe other implementations as alternatives rather than inspirations. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

add

d753100

Signed-off-by: Roger Wang <hey@rogerw.io>

vercel bot deployed to Preview February 3, 2026 21:24 View deployment

Fix acknowledgements formatting with blank lines between orgs

2908f96

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

vercel bot deployed to Preview February 3, 2026 21:26 View deployment

clarify last token behavior

88885f4

Signed-off-by: Yu Luo <ErickLuo90@gmail.com>

vercel bot deployed to Preview February 4, 2026 04:18 View deployment

sanchit-gandhi reviewed Feb 4, 2026

View reviewed changes

sanchit-gandhi approved these changes Feb 4, 2026

View reviewed changes

joshuadeng reviewed Feb 4, 2026

View reviewed changes

various corrections/additions

321306c

vercel bot deployed to Preview February 4, 2026 23:20 View deployment

njhill reviewed Feb 4, 2026

View reviewed changes

minor

0a7ebbb

vercel bot deployed to Preview February 4, 2026 23:31 View deployment

tjtanaa reviewed Feb 5, 2026

View reviewed changes


		Why is the last token (Y) discarded?

		In streaming models, the final sampled token for each chunk can be typically considered as a special token that signals "I'm done processing this input chunk and waiting for more." When the next input chunk arrives, this speculative token becomes meaningless—it was a placeholder indicating "waiting for input" that is now superseded by actual new input. The model will generate a fresh continuation based on the new context, so the speculative token is discarded rather than kept in the prompt. Note, This behavior is model-specific behavior and can be changed if needed.


		For these applications, the traditional batch paradigm introduces unacceptable delays. We need infrastructure that can process input incrementally and begin generating output before all input has arrived.

		## Architectural Requirements for Streaming


		The last sampled token hasn't been processed as input to the model yet—it was just output from the most recent forward pass. Since the KV cache only contains entries for tokens that have been processed, this token has no KV cache entry. Discarding it is essentially "free": we're not invalidating any cached state, and it would need to be recomputed anyway if we kept it.

		Caveat: Some models emit special stop tokens that the model requires to properly continue generation. In such cases, the scheduling logic needs to accommodate +1 token to recompute the stop token before processing the new input chunk.


		The attention mechanism determines whether a model can process input incrementally or must wait for the entire sequence.

		- Causal attention (uni-directional mask) restricts each position t to attend only to tokens at positions ≤ t. Because future tokens are excluded, the model’s output token at time t is final once token t arrives. This makes true streaming possible: each new token can be processed immediately, and earlier outputs never need to be revised.

Conversation

patrickvonplaten commented Feb 1, 2026

Uh oh!

chatgpt-codex-connector bot commented Feb 1, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

njhill commented Feb 4, 2026

Uh oh!

tjtanaa Feb 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tjtanaa Feb 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

tjtanaa Feb 5, 2026 •

edited

Loading

tjtanaa Feb 5, 2026 •

edited

Loading