updated tests #382

shreymodi1 · 2025-12-18T00:45:11Z

Note

Adds feature-flagged reasoning and multi-tool support, introduces tool_choice (none/required) tests, and uses raw_output with conditional reasoning handling across streaming and non-streaming benchmarks.

Feature Flags & Infra:
- Add env-driven flags SUPPORTS_MULTIPLE_TOOL_CALLS and SUPPORTS_REASONING with pytest.mark.skipif gating.
- Replace ChatCompletionContentPartTextParam with ChatCompletionContentPartParam.
- Introduce _maybe_add_reasoning_effort and pass reasoning_effort conditionally; extend passthrough to tool_choice.
- Enable raw_output=True broadly to inspect prompts; add checks against raw_output.prompt_fragments.
Tests Added/Expanded:
- Tool choice: tool_choice=none and tool_choice=required (streaming and non-streaming) ensuring no/required tool calls and prompt hygiene.
- Reasoning + structured JSON and multiple tools (both streaming and non-streaming) with conditional reasoning assertions.
- Multi-tool-call tests gated by capability flag for streaming and non-streaming.
Validation/Metrics Adjustments:
- Normalize tool calls via helpers; improve argument validation.
- Conditionally include reasoning metrics/checks only when supported; tighten forbidden/XML tag and leakage checks.
- Minor parameter tweaks (e.g., adding raw_output, refining finish_reason expectations).

^{Written by Cursor Bugbot for commit 982eba2. This will update automatically on new commits. Configure here.}

cursor · 2025-12-18T00:49:35Z

eval_protocol/benchmarks/test_glm_streaming_compliance.py

+            "stream": True,
+            "temperature": 0.0,
+            "max_tokens": DEFAULT_MAX_TOKENS,
+            "reasoning_effort": "none",


Bug: New tool_choice tests hardcode reasoning_effort unconditionally

The new tool_choice tests (test_streaming_tool_choice_none, test_non_streaming_tool_choice_none, test_streaming_tool_choice_required, test_non_streaming_tool_choice_required) hardcode "reasoning_effort": "none" in their completion_params. This contradicts the SUPPORTS_REASONING feature flag design, which states that when EP_SUPPORTS_REASONING=0, the reasoning_effort parameter should NOT be passed at all. These tests should use _maybe_add_reasoning_effort to conditionally include the parameter, matching the pattern used by other tests in this file.

Additional Locations (2)

eval_protocol/benchmarks/test_glm_streaming_compliance.py#L3845-L3846

eval_protocol/benchmarks/test_glm_streaming_compliance.py#L3994-L3995

updated tests

982eba2

cursor bot reviewed Dec 18, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

updated tests #382

updated tests #382

Uh oh!

shreymodi1 commented Dec 18, 2025 •

edited by cursor bot

Loading

Uh oh!

cursor bot Dec 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

updated tests #382

Are you sure you want to change the base?

updated tests #382

Uh oh!

Conversation

shreymodi1 commented Dec 18, 2025 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cursor bot Dec 18, 2025

Choose a reason for hiding this comment

Bug: New tool_choice tests hardcode reasoning_effort unconditionally

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

shreymodi1 commented Dec 18, 2025 •

edited by cursor bot

Loading