[BUG] evals aren't working + are poorly implemented #162

alexzerntev · 2025-11-28T13:07:46Z

Description

Migrated evals to PydanticAI agent. Now for each eval, an agent is spawned which has all the tools in its disposal. The agent provides an answer based on the tools and on the prompt. Since the agent can take some time (depending on the model), progress visualization is added.
Added expected_answer and expected_answer_model to the evals definition. When expected answer is provided, after getting the agent answer, a "judge" agent is called in order to evaluate if the answer that the first agent returned, is the same (semantically) as the expected_answer. The expected_answer_model, allows using different models e.g. a slow model for finding the answer and fast model for "judging" the result.
Example:

Use a fast grader:
mxcp: 1
suite: faq_checks
model: gpt-4o
expected_answer_model: gpt-4o-mini
tests:
  - name: expected_answer_grading
    prompt: "What are your support hours?"
    assertions:
      expected_answer: "Our support team is available Monday to Friday, 9am-5pm local time."

Added options for model definitions. (E.g. reasoning: "low")
Example:

models:
  gpt-5:
    type: openai
    options:
      api: responses
      body:
        reasoning:
          effort: medium
      header:custom-feature: foo
  claude-opus:
    type: anthropic
    options:
      body:
        output_config:
          effort: medium
      header:anthropic-beta: effort-2025-11-24

Fixed a path bug when calling tools in evals

The final result looks like this:

🧪 Running eval suite: faq_checks
🧪 Running suite 'faq_checks' with 2 tests using model 'gpt-5'...
  ✓ [faq_checks] 1/2 • testing_sequence_of_tools (9.18s)
  ✗ [faq_checks] 2/2 • testing_sequence_of_tools2 (7.27s)


🧪 Eval Execution Summary
   Evaluated 1 suite
   • 1 failed

❌ Failed tests:

  ✗ faq_checks (evals/test.eval.yml)
    ✓ testing_sequence_of_tools (9.18s)
    ✗ testing_sequence_of_tools2 (7.27s)
      💡 LLM Answer: Alice Johnson’s department (Engineering) has $847,000 remaining in its budget.
         Expected: The remaining budget for Alice Johnson's department (Engineering) is $900,000.
         Grade: partially correct
         Comment: The candidate answer states a different remaining budget amount than the expected answer.
         Reasoning: While both answers refer to the budget remaining for Alice Johnson's department in Engineering, the amounts mentioned differ: $847,000 vs $900,000. Thus, the meaning isn't fully aligned.

💡 Tip: Run 'mxcp evals <suite_name>' to see detailed results for a specific suite

⏱️  Total time: 16.48s

Type of Change

🐛 Bug fix (non-breaking change which fixes an issue)
✨ New feature (non-breaking change which adds functionality)
💥 Breaking change (fix or feature that would cause existing functionality to not work as expected)
📝 Documentation update
🔧 Refactoring (no functional changes, no api changes)
⚡ Performance improvement
🧪 Test improvement
🔒 Security fix

Testing

Tests pass locally with uv run pytest
Linting passes with uv run ruff check .
Code formatting passes with uv run black --check .
Type checking passes with uv run mypy .
Added tests for new functionality (if applicable)
Updated documentation (if applicable)

Security Considerations

This change does not introduce security vulnerabilities
Sensitive data handling reviewed (if applicable)
Policy enforcement implications considered (if applicable)

Breaking Changes

If this is a breaking change, describe what users need to do to migrate:

Additional Notes

Any additional context or screenshots.

Note

Reworks evals to use a PydanticAI agent with tool calling and LLM-based grading, adds model options/config updates, fixes path/language handling, improves CLI output, and expands tests/docs.

Evals runtime (SDK):
- Agent-based executor: Replace prompt parsing with LLMExecutor using PydanticAI Agent, structured tool calls, retries, and tool arg validation; introduce ProviderConfig.
- Grading: Add evaluate_expected_answer for LLM-based grading of expected_answer with optional separate model.
- Types: Simplify eval types; remove old per-provider model configs; enhance ParameterDefinition with JSON Schema.
Server evals service/CLI:
- Service: Build model settings from user config options (e.g., api: responses, body:*, header:*), support expected_answer_model, suite-level system_prompt, and detailed failure formatting.
- CLI: Add TTY-aware progress renderer and suppress noisy logs; improved human-readable summaries.
Endpoint execution:
- Tool executor: Switch to EndpointWithPath; use prepare_source_for_execution and repo-aware path resolution; fix Python file execution (append function name), robust SQL/file handling.
- Utils: Improve path resolution and language detection; clearer errors.
Configuration models:
- Change model type literal to "anthropic"|"openai"; add per-model options field; propagate timeouts; docs updated accordingly.
Docs:
- Update configuration and quality guides: new model IDs/types, options, expected_answer, expected_answer_model, system_prompt, and Responses API examples; minor copy fixes.
Python executor:
- Lower some logs to debug and refine error wrapping.
Dependencies:
- Add pydantic-ai-slim[anthropic,openai].
Tests:
- Add/expand tests for executor agent loop, grading, model settings/options, tool executor path handling, and user config options.

^{Written by Cursor Bugbot for commit 3aec2be. This will update automatically on new commits. Configure here.}

…issue-135