-
Notifications
You must be signed in to change notification settings - Fork 7
[BUG] evals aren't working + are poorly implemented #162
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| "Check conversation history above for details.", | ||
| len(history), | ||
| ) | ||
| return AgentResult(answer=str(answer), tool_calls=history) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Agent output None converts to literal string "None"
When retrieving the agent output with getattr(agent_run, "output", ""), if the output attribute exists but is None, the code receives None rather than the default empty string. Then at line 202, str(None) converts this to the literal string "None". If the agent fails to produce output, the answer field would contain "None" instead of an empty string, which could cause unexpected behavior in assertions like answer_contains that check the response content.
Description
Example:
Example:
The final result looks like this:
Type of Change
Testing
uv run pytestuv run ruff check .uv run black --check .uv run mypy .Security Considerations
Breaking Changes
If this is a breaking change, describe what users need to do to migrate:
Additional Notes
Any additional context or screenshots.
Note
Reworks evals to use a PydanticAI agent with tool calling and LLM-based grading, adds model options/config updates, fixes path/language handling, improves CLI output, and expands tests/docs.
LLMExecutorusing PydanticAIAgent, structured tool calls, retries, and tool arg validation; introduceProviderConfig.evaluate_expected_answerfor LLM-based grading ofexpected_answerwith optional separate model.ParameterDefinitionwith JSON Schema.options(e.g.,api: responses,body:*,header:*), supportexpected_answer_model, suite-levelsystem_prompt, and detailed failure formatting.EndpointWithPath; useprepare_source_for_executionand repo-aware path resolution; fix Python file execution (append function name), robust SQL/file handling.typeliteral to"anthropic"|"openai"; add per-modeloptionsfield; propagate timeouts; docs updated accordingly.options,expected_answer,expected_answer_model,system_prompt, and Responses API examples; minor copy fixes.pydantic-ai-slim[anthropic,openai].Written by Cursor Bugbot for commit 3aec2be. This will update automatically on new commits. Configure here.