Shrey/modelquality #353

shreymodi1 · 2025-12-02T00:51:15Z

name: Pull Request
about: Propose changes to the codebase
title: "Brief description of changes"
labels: ''
assignees: ''

Description

Please include a summary of the change and which issue is fixed or feature is implemented. Please also include relevant motivation and context. List any dependencies that are required for this change.

Fixes # (issue)
Implements # (issue)

Type of change

Please delete options that are not relevant.

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
This change requires a documentation update
Refactoring/Code cleanup
Build/CI/CD related changes
Other (please describe):

How Has This Been Tested?

Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration.

Test A
Test B

Test Configuration:

Firmware version:
Hardware:
Toolchain:
SDK:

Checklist:

My code follows the style guidelines of this project (ran black ., isort ., flake8 .)
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes
Any dependent changes have been merged and published in downstream modules
I have checked my code and corrected any misspellings

Screenshots (if applicable)

If applicable, add screenshots to help showcase your changes.

Additional context

Add any other context about the PR here.

Note

Add comprehensive streaming compliance benchmark and workflow, and enhance rollout/models to record finish_reason/tool_call_count and reasoning/tool-call data.

Benchmarks:
- Add eval_protocol/benchmarks/test_glm_streaming_compliance.py with streaming and non-streaming tests for:
  - Structured JSON output, tool-call correctness, multi-tool calls, argument/type validation, and JSON preservation.
  - Reasoning effort on/off behavior and tools+reasoning combinations.
  - Streaming vs non-streaming output consistency.
- Include helpers for tool-call normalization, JSON parsing, XML/forbidden tag and reasoning-leakage checks.
CI:
- Add GitHub Actions workflow /.github/workflows/streaming_compliance.yml to run the benchmark (configurable inputs) and upload JSON artifacts.
Runtime/Eval Engine:
- Update eval_protocol/pytest/default_single_turn_rollout_process.py to:
  - Forward reasoning_effort via extra_body, disable cache per request.
  - Extract and attach reasoning_content and normalized tool_calls to assistant Message.
  - Populate row.execution_metadata.finish_reason and tool_call_count.
Models:
- Extend ExecutionMetadata with finish_reason and tool_call_count fields.

^{Written by Cursor Bugbot for commit 688e87f. This will update automatically on new commits. Configure here.}

cursor · 2025-12-02T01:16:41Z

eval_protocol/benchmarks/test_glm_streaming_compliance.py

+                    if delta.content:
+                        stream_content_parts.append(delta.content)
+                    if delta.tool_calls:
+                        stream_tool_calls = delta.tool_calls


Bug: Streaming tool calls overwritten instead of accumulated

In test_streaming_output_consistency, the streaming tool calls handling overwrites stream_tool_calls on each chunk with stream_tool_calls = delta.tool_calls instead of accumulating deltas. OpenAI's streaming API returns tool calls as incremental deltas that need to be merged by index across multiple chunks. The current code only preserves the last chunk's tool call data, causing the tool call comparison at line 2162 to compare incomplete data against the non-streaming response, potentially producing false positives or false negatives.

shreymodi1 added 13 commits November 6, 2025 13:28

added model quality gha

9f6aa7b

fixes

7d6d905

streaming

c699532

fixes

bb743b7

fix

275f992

fix

4315f1e

fix

fcc7e10

fix

a1a2046

fix

cf0ab9d

df

67c2619

streaming ouput

2900b87

changes

406ed5b

yo

688e87f

cursor bot reviewed Dec 2, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Shrey/modelquality #353

Shrey/modelquality #353

Uh oh!

shreymodi1 commented Dec 2, 2025 •

edited by cursor bot

Loading

Uh oh!

cursor bot Dec 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Shrey/modelquality #353

Are you sure you want to change the base?

Shrey/modelquality #353

Uh oh!

Conversation

shreymodi1 commented Dec 2, 2025 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

How Has This Been Tested?

Checklist:

Screenshots (if applicable)

Additional context

Uh oh!

cursor bot Dec 2, 2025

Choose a reason for hiding this comment

Bug: Streaming tool calls overwritten instead of accumulated

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

shreymodi1 commented Dec 2, 2025 •

edited by cursor bot

Loading