-
Notifications
You must be signed in to change notification settings - Fork 115
fix(llm): support Responses streaming via on_token #1761
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(llm): support Responses streaming via on_token #1761
Conversation
Enable streaming for LLM.responses() when requested (stream or LLM.stream). When LiteLLM returns a Responses streaming iterator, drain it, forward best-effort text deltas to on_token (as ModelResponseStream chunks), and return the completed ResponsesAPIResponse.
Avoid getattr() by relying on LiteLLM's ResponsesAPIStreamingResponse event types (e.g., OutputTextDeltaEvent) and ResponseCompletedEvent.
all-hands-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The streaming implementation looks solid overall! I found three areas worth discussing: a redundant condition check, handling of non-streaming responses when streaming is requested, and error handling for the callback. Details in inline comments below.
| on_token( | ||
| ModelResponseStream( | ||
| choices=[ | ||
| StreamingChoices( | ||
| delta=Delta(content=delta) | ||
| ) | ||
| ] | ||
| ) | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🟡 Suggestion: The on_token callback is invoked without error handling. If the callback raises an exception, the entire streaming response would fail and the completed response would be lost.
Consider wrapping the on_token call in a try-except block to make the system more robust:
try:
on_token(
ModelResponseStream(
choices=[
StreamingChoices(
delta=Delta(content=delta)
)
]
)
)
except Exception as e:
# Log the error but don't fail the entire request
logger.warning(f"on_token callback failed: {e}")This way, a faulty callback won't prevent the agent from receiving the completed response.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. For now I’m intentionally not swallowing on_token callback exceptions so behavior matches the existing Chat Completions streaming path (where callback exceptions currently propagate).
If we want “best-effort streaming” (callback failures don’t abort the request), I’d prefer to implement that consistently for both LLM.completion() and LLM.responses() (and log a warning / rate-limit), rather than making Responses special-cased.
Avoid per-call imports in the Responses streaming path; rely on LiteLLM's typed event classes.
In subscription mode, Responses stream=True can be forced by options even when the caller didn’t request streaming, so keep gating on user_enable_streaming but avoid redundant checks inside the loop.
If streaming was requested but LiteLLM returns a non-streaming ResponsesAPIResponse, emit a warning so behavior is explicit.
Enable LLM stream mode and wire a simple token callback so users can see incremental output when using ChatGPT subscription login.
|
Looks like there are a few issues preventing this PR from being merged!
If you'd like me to help, just leave a comment, like Feel free to include any additional details that might help me get this PR into a better state. You can manage your notification settings |
fef7d02
into
feat/openai-subscription-auth

Follow-up to #1682.
LLM.responses()to run withstream=truewhen requested.on_tokenand return the completedResponsesAPIResponseso agent execution can continue.Tested locally by running
examples/01_standalone_sdk/34_subscription_login.pyin subscription mode.