-
Notifications
You must be signed in to change notification settings - Fork 10
Shrey/modelquality #353
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Shrey/modelquality #353
Conversation
| if delta.content: | ||
| stream_content_parts.append(delta.content) | ||
| if delta.tool_calls: | ||
| stream_tool_calls = delta.tool_calls |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Streaming tool calls overwritten instead of accumulated
In test_streaming_output_consistency, the streaming tool calls handling overwrites stream_tool_calls on each chunk with stream_tool_calls = delta.tool_calls instead of accumulating deltas. OpenAI's streaming API returns tool calls as incremental deltas that need to be merged by index across multiple chunks. The current code only preserves the last chunk's tool call data, causing the tool call comparison at line 2162 to compare incomplete data against the non-streaming response, potentially producing false positives or false negatives.
name: Pull Request
about: Propose changes to the codebase
title: "Brief description of changes"
labels: ''
assignees: ''
Description
Please include a summary of the change and which issue is fixed or feature is implemented. Please also include relevant motivation and context. List any dependencies that are required for this change.
Fixes # (issue)
Implements # (issue)
Type of change
Please delete options that are not relevant.
How Has This Been Tested?
Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration.
Test Configuration:
Checklist:
black .,isort .,flake8 .)Screenshots (if applicable)
If applicable, add screenshots to help showcase your changes.
Additional context
Add any other context about the PR here.
Note
Add comprehensive streaming compliance benchmark and workflow, and enhance rollout/models to record finish_reason/tool_call_count and reasoning/tool-call data.
eval_protocol/benchmarks/test_glm_streaming_compliance.pywith streaming and non-streaming tests for:/.github/workflows/streaming_compliance.ymlto run the benchmark (configurable inputs) and upload JSON artifacts.eval_protocol/pytest/default_single_turn_rollout_process.pyto:reasoning_effortviaextra_body, disable cache per request.reasoning_contentand normalizedtool_callsto assistantMessage.row.execution_metadata.finish_reasonandtool_call_count.ExecutionMetadatawithfinish_reasonandtool_call_countfields.Written by Cursor Bugbot for commit 688e87f. This will update automatically on new commits. Configure here.