Skip to content

Conversation

@dharapandya85
Copy link

@dharapandya85 dharapandya85 commented Dec 10, 2025

Description

Please include a summary of the changes and the related issue. Include relevant motivation and context.
This PR adds a new basic evaluation example using FakeAdapter to run simple evaluations using AgentUnit.
Fixes #9 (issue)

Type of Change

Please delete options that are not relevant:

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Performance improvement
  • Code refactoring
  • Test coverage improvement
  • Build/CI improvement

Changes Made

Please provide a detailed list of changes:

  • Added a new example: basic_evaluation.py
  • applied YAML formatting fixes to .pre-commit-config.yaml
  • introduced a minimal evaluation flow using FakeAdapter

Testing

Please describe the tests you ran to verify your changes:

  • Existing test suite passes (pytest)
  • Added new tests for new functionality
  • Manual testing performed
  • Tested with multiple Python versions (3.10, 3.11, 3.12)

Test Configuration

  • Python version: 3.12
  • Operating System: Windows (WSL2/ Ubuntu)
  • Relevant adapters tested: FakeAdapter

Test Results

# Paste relevant test output or results

Code Quality

  • My code follows the project's style guidelines (Ruff, Black)
  • I have run pre-commit run --all-files and addressed any issues
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings or errors
  • I have added type hints where appropriate
  • I have updated docstrings following Google style

Documentation

  • I have updated relevant documentation (README, docs/, inline comments)
  • I have added/updated docstrings for new/modified functions
  • I have updated CHANGELOG.md under [Unreleased]
  • I have added examples if introducing new features

Breaking Changes

If this PR introduces breaking changes, please describe:

  • What breaks:
  • Migration path for users:
  • Deprecation warnings added (if applicable):

Dependencies

  • No new dependencies added
  • New dependencies added (list below with justification)

New dependencies:

  • package-name: Reason for adding

Performance Impact

Describe any performance implications:

  • No performance impact
  • Performance improvement (describe and provide benchmarks)
  • Potential performance regression (describe and justify)

Screenshots/Recordings (if applicable)

Add screenshots or recordings to help explain your changes:

Additional Context

Add any other context about the PR here:

  • Links to related issues or PRs
  • References to external documentation
  • Design decisions and trade-offs
  • Known limitations or future work

Checklist

  • I have read the CONTRIBUTING.md guide
  • My branch name follows the convention (feature/, fix/, docs/, etc.)
  • My commit messages follow the conventional commit format
  • I have tested my changes locally
  • I have updated the documentation
  • I have added tests that prove my fix is effective or that my feature works
  • All new and existing tests pass
  • I have checked for security vulnerabilities in any new dependencies

Reviewer Notes

Please pay special attention to:

Post-Merge Tasks

Tasks to complete after merging (if any):

  • Update external documentation
  • Announce in discussions/community
  • Create follow-up issues for future work

Summary by CodeRabbit

  • Documentation
    • Added a runnable example demonstrating a complete end-to-end evaluation workflow using a minimal dataset and a mock adapter. Shows how to define evaluation cases and scenarios, execute a run, and produce a readable summary that includes the scenario name and success rate.

✏️ Tip: You can customize this high-level summary in your review settings.

@continue
Copy link

continue bot commented Dec 10, 2025

Keep this PR in a mergeable state →

Learn more

All Green is an AI agent that automatically:

✅ Addresses code review comments

✅ Fixes failing CI checks

✅ Resolves merge conflicts

@coderabbitai
Copy link

coderabbitai bot commented Dec 10, 2025

Walkthrough

Adds a new example script at src/agentunit/examples/basic_evaluation.py that implements a minimal end-to-end evaluation: defines FakeAdapter (with prepare, execute, cleanup), builds a single-case dataset, constructs a Scenario and Runner, runs the suite, and prints a summary via main().

Changes

Cohort / File(s) Summary
Example script
src/agentunit/examples/basic_evaluation.py
New file: introduces FakeAdapter(BaseAdapter) with __init__(response: str), prepare(), execute(case, trace_log) -> AdapterOutcome, and cleanup(). Adds main() that creates one DatasetCase and dataset source, constructs a Scenario using the fake adapter, runs evaluation with Runner.run_suite(), extracts the scenario result, and prints the scenario name and success rate.

Sequence Diagram(s)

sequenceDiagram
  participant User
  participant Script as basic_evaluation.py
  participant Runner
  participant Scenario
  participant Adapter as FakeAdapter
  participant Evaluator

  rect rgb(240,248,255)
    User->>Script: python examples/basic_evaluation.py
    Script->>Scenario: construct(datasetSource, adapter=FakeAdapter)
    Script->>Runner: Runner([Scenario]).run_suite()
  end

  rect rgb(245,255,245)
    loop per dataset case
      Runner->>Scenario: run_case(case)
      Scenario->>Adapter: prepare()
      Adapter-->>Scenario: prepared
      Scenario->>Adapter: execute(case, trace_log)
      Adapter-->>Scenario: AdapterOutcome (response, trace)
      Scenario->>Evaluator: evaluate(expected, response)
      Evaluator-->>Scenario: result (pass/fail, metrics)
      Scenario->>Adapter: cleanup()
    end
  end

  rect rgb(255,248,240)
    Runner-->>Script: aggregated scenario results
    Script->>User: print summary (scenario name, success rate)
  end
Loading

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

Suggested reviewers

  • aviralgarg05

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Out of Scope Changes check ⚠️ Warning PR includes YAML formatting fixes to .pre-commit-config.yaml beyond the scope of issue #9, which requested only the basic evaluation example. Remove the unrelated .pre-commit-config.yaml changes or justify them as necessary maintenance in a separate issue/PR.
Docstring Coverage ⚠️ Warning Docstring coverage is 56.64% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed Title clearly and concisely describes the main change: adding a basic evaluation example with FakeAdapter.
Description check ✅ Passed PR description covers key sections but is missing testing results, updated docstrings, documentation updates, CHANGELOG, and local testing confirmation.
Linked Issues check ✅ Passed PR successfully addresses issue #9: creates examples/basic_evaluation.py with FakeAdapter, includes comments, confirms no new dependencies, and reports manual testing.
✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3a2bce5 and 0ecd431.

📒 Files selected for processing (2)
  • .pre-commit-config.yaml (1 hunks)
  • src/agentunit/examples/basic_evaluation.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
src/agentunit/examples/basic_evaluation.py (1)
src/agentunit/datasets/base.py (1)
  • name (38-39)
🔇 Additional comments (4)
src/agentunit/examples/basic_evaluation.py (3)

1-1: LGTM!

The imports are clean and appropriate for the example.


4-9: Verify the adapter interface signature.

The FakeAdapter.run() method accepts a single prompt: str parameter. Please verify that this matches the expected interface for model adapters in the agentunit library, as the actual adapter interface might expect a different signature (e.g., batch processing with multiple prompts, additional parameters, or a different return type).


12-32: Verify the Evaluation API parameter name and type in the Evaluation class definition.

The code at line 20 creates an evaluation with task=[task], passing a list to a parameter named task (singular). Confirm:

  • Whether the parameter should be named task or tasks
  • Whether it expects a single Task object or a list

The logic and comments are otherwise clear and well-structured for a beginner example.

.pre-commit-config.yaml (1)

1-82: Confirm the Python version change from python3.10 to python3 is intentional.

The formatting changes align well with the YAML formatting fixes mentioned in the PR. However, verify that changing default_language_version.python from python3.10 to python3 on line 68 is intentional and compatible with your project's minimum Python version requirement. Using python3 is more flexible but less specific—ensure this doesn't create issues if your CI environment has multiple Python versions installed.

Copy link
Owner

@aviralgarg05 aviralgarg05 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solve the issues pls

@codecov-commenter
Copy link

codecov-commenter commented Dec 23, 2025

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🧹 Nitpick comments (13)
README.md (1)

44-50: Improve formatting of "Run CI locally" section.

The section has minor formatting issues:

  • Line 45: Missing space after comma in "Python 3.10+,Poetry"
  • The section title uses ## but content lacks proper structure
🔎 Suggested formatting improvements
-##  Run CI locally
-Prerequisites: Python 3.10+,Poetry
+## Run CI locally
+
+Prerequisites: Python 3.10+, Poetry
+
 - poetry install --with dev
 - poetry check
 - poetry run ruff check .
 - poetry run ruff format --check .
 - poetry run pytest
tests/integration/conftest.py (2)

8-16: Add type hints to pytest hook functions.

The hook functions are missing type annotations for their parameters and return types. While pytest hooks have flexible signatures, adding type hints improves code clarity and enables better IDE support.

🔎 Proposed type hints
-def pytest_configure(config):
+def pytest_configure(config: pytest.Config) -> None:
     """Configure pytest markers for integration tests."""
     config.addinivalue_line(

19-23: Add type hints to pytest hook function.

The hook function is missing type annotations. Additionally, consider whether the string matching approach on line 22 is robust enough for your use case.

🔎 Proposed type hints
-def pytest_collection_modifyitems(config, items):
+def pytest_collection_modifyitems(config: pytest.Config, items: list[pytest.Item]) -> None:
     """Automatically mark integration tests."""
tests/test_pytest_plugin.py (2)

12-31: Consider adding type hints to test adapter.

The SimpleTestAdapter class and its methods lack type annotations. While not strictly required for test code, type hints improve maintainability and catch potential type-related errors during development.

🔎 Proposed type hints
+from typing import Any, Callable
+from agentunit.datasets.base import DatasetCase
+from agentunit.core.trace import TraceLog
+
 class SimpleTestAdapter(BaseAdapter):
     """Simple adapter for testing."""
 
     name = "test"
 
-    def __init__(self, agent_func):
+    def __init__(self, agent_func: Callable[[dict[str, Any]], dict[str, Any]]) -> None:
         self.agent_func = agent_func
 
-    def prepare(self):
+    def prepare(self) -> None:
         pass
 
-    def execute(self, case, trace):
+    def execute(self, case: DatasetCase, trace: TraceLog) -> AdapterOutcome:

33-57: Consider adding type hints to mock classes.

The mock classes (MockConfig, MockSession, MockParent) lack type annotations. Adding type hints would improve test readability and help catch mock-related issues during development.

src/agentunit/pytest/cli.py (2)

46-157: Consider externalizing template strings.

The 106-line Python code template (lines 50-155) is embedded directly in the code, which impacts maintainability:

  • Hard to read and edit the template
  • Syntax highlighting doesn't work inside the string
  • Difficult to test the template independently
  • Changes to the template clutter git diffs

Consider moving templates to external files (e.g., a templates/ directory) or at minimum extracting them to module-level constants.

💡 Example approach
# At module level or in a separate templates.py file
EXAMPLE_SCENARIO_TEMPLATE = '''"""Example AgentUnit scenarios for pytest plugin."""
...
'''

def _create_example_files(eval_dir: Path) -> None:
    """Create example scenario files."""
    example_file = eval_dir / "example_scenarios.py"
    if not example_file.exists():
        example_file.write_text(EXAMPLE_SCENARIO_TEMPLATE)
        click.echo(f"Created {example_file}")

Alternatively, use a templates directory with importlib.resources or pathlib to load template files at runtime.


159-191: Consider externalizing README template.

Similar to the Python template above, the README template (29 lines) should be externalized for better maintainability.

src/agentunit/pytest/plugin.py (3)

47-63: Consider using a public attribute or pytest's stash for cache storage.

Storing on config._agentunit_cache uses a private attribute pattern (_ prefix) which works but is unconventional. Pytest's config.stash (introduced in pytest 7.0) is the recommended approach for plugins to store data.

🔎 Optional: Use pytest stash for cleaner integration
+from _pytest.stash import StashKey
+
+_cache_key: StashKey[ScenarioCache] = StashKey()
+
 def pytest_configure(config: Config) -> None:
     """Configure pytest with AgentUnit markers."""
     config.addinivalue_line("markers", "agentunit: mark test as an AgentUnit scenario evaluation")
     config.addinivalue_line("markers", "scenario(name): mark test with specific scenario name")

     # Initialize cache
     root_path = config.rootpath
     no_cache = config.getoption("agentunit_no_cache", default=False)
     clear_cache = config.getoption("agentunit_clear_cache", default=False)

     cache = ScenarioCache(root_path, enabled=not no_cache)
-    config._agentunit_cache = cache  # type: ignore[attr-defined]
+    config.stash[_cache_key] = cache

79-82: Verify directory matching logic handles edge cases.

The check "tests" in parts and "eval" in parts will match files like some/eval/other/tests/file.py where "tests" and "eval" aren't adjacent. Consider checking for the specific pattern tests/eval.

🔎 Proposed fix for stricter directory matching
 def _is_eval_directory(file_path: Path) -> bool:
     """Check if file is in tests/eval/ directory."""
     parts = file_path.parts
-    return "tests" in parts and "eval" in parts
+    # Check for "tests" followed directly by "eval" in the path
+    for i, part in enumerate(parts[:-1]):
+        if part == "tests" and parts[i + 1] == "eval":
+            return True
+    return False

88-101: Broad exception handling may hide important errors.

Catching all Exception types when discovering scenarios and silently converting to a failing test item could mask configuration issues. Consider logging the exception or being more selective about which exceptions to catch.

🔎 Proposed improvement for better error visibility
         try:
             scenarios = self._discover_scenarios()
             for scenario in scenarios:
                 yield AgentUnitItem.from_parent(self, name=scenario.name, scenario=scenario)
         except Exception as e:
             # If we can't load scenarios, create a single failing test
+            logger.warning(f"Failed to load scenarios from {self.path}: {e}")
             yield AgentUnitItem.from_parent(
                 self,
                 name=f"load_error_{self.path.stem}",
                 scenario=None,
                 load_error=str(e),
             )
src/agentunit/pytest/cache.py (1)

107-143: Cache invalidation logic may have a gap.

On lines 127-132, the source hash comparison only invalidates if both hashes exist. If the cached entry has a source_hash but the current file doesn't exist (current_source_hash is None), the cache will still be used. This could be intentional, but it might be surprising if the source file is deleted.

🔎 Consider invalidating when source file is removed
             if current_source_hash and cached_source_hash:
                 if current_source_hash != cached_source_hash:
                     logger.info(
                         f"Cache invalidated for scenario '{scenario.name}' (source file changed)"
                     )
                     return None
+            elif cached_source_hash and not current_source_hash:
+                # Source file was removed or became unreadable
+                logger.info(
+                    f"Cache invalidated for scenario '{scenario.name}' (source file removed)"
+                )
+                return None
tests/integration/test_langgraph_integration.py (2)

93-106: Hardcoded path may cause issues in different environments.

Line 96 uses a hardcoded path "tests/integration/simple_langgraph_agent.py". This assumes the test is run from the repository root. Consider using Path(__file__).parent for a more robust path resolution.

🔎 Proposed fix for portable path resolution
     def test_langgraph_scenario_from_python_file(self):
         """Test creating a scenario from a Python file."""
-        # Create a temporary Python file with the agent
-        agent_file = Path("tests/integration/simple_langgraph_agent.py")
+        # Use path relative to this test file
+        agent_file = Path(__file__).parent / "simple_langgraph_agent.py"

         scenario = Scenario.load_langgraph(
             path=agent_file,

170-186: Test doesn't verify retry behavior.

The test sets scenario.retries = 2 but uses a non-failing agent, so retries are never exercised. This verifies the retries attribute can be set but doesn't test the actual retry mechanism.

Consider adding a test with an agent that fails intermittently to verify the retry logic is actually invoked.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0ecd431 and fcd60d9.

⛔ Files ignored due to path filters (1)
  • poetry.lock is excluded by !**/*.lock
📒 Files selected for processing (30)
  • README.md
  • docs/pytest-plugin.md
  • examples/basic_evaluation.py
  • pyproject.toml
  • src/agentunit/adapters/base.py
  • src/agentunit/core/__init__.py
  • src/agentunit/core/exceptions.py
  • src/agentunit/core/replay.py
  • src/agentunit/core/runner.py
  • src/agentunit/core/scenario.py
  • src/agentunit/core/trace.py
  • src/agentunit/examples/basic_evaluation.py
  • src/agentunit/py.typed
  • src/agentunit/pytest/__init__.py
  • src/agentunit/pytest/cache.py
  • src/agentunit/pytest/cli.py
  • src/agentunit/pytest/plugin.py
  • tests/__init__.py
  • tests/eval/__init__.py
  • tests/eval/example_scenarios.py
  • tests/eval/failing_scenario.py
  • tests/integration/IMPLEMENTATION_SUMMARY.md
  • tests/integration/README.md
  • tests/integration/__init__.py
  • tests/integration/ci-example.yml
  • tests/integration/conftest.py
  • tests/integration/simple_langgraph_agent.py
  • tests/integration/test_integration_basic.py
  • tests/integration/test_langgraph_integration.py
  • tests/test_pytest_plugin.py
💤 Files with no reviewable changes (1)
  • src/agentunit/py.typed
✅ Files skipped from review due to trivial changes (8)
  • tests/eval/init.py
  • tests/integration/init.py
  • tests/init.py
  • src/agentunit/core/exceptions.py
  • src/agentunit/core/init.py
  • src/agentunit/core/replay.py
  • src/agentunit/core/runner.py
  • tests/integration/README.md
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/agentunit/examples/basic_evaluation.py
🧰 Additional context used
🧬 Code graph analysis (8)
src/agentunit/pytest/__init__.py (1)
src/agentunit/pytest/plugin.py (3)
  • pytest_addoption (28-44)
  • pytest_collect_file (66-76)
  • pytest_configure (47-63)
src/agentunit/adapters/base.py (2)
src/agentunit/datasets/base.py (1)
  • DatasetCase (19-27)
src/agentunit/core/trace.py (1)
  • TraceLog (26-78)
examples/basic_evaluation.py (1)
src/agentunit/adapters/base.py (1)
  • BaseAdapter (28-69)
tests/integration/test_integration_basic.py (1)
tests/integration/simple_langgraph_agent.py (1)
  • invoke_agent (95-129)
src/agentunit/pytest/cache.py (2)
src/agentunit/core/scenario.py (1)
  • Scenario (23-278)
src/agentunit/datasets/base.py (1)
  • name (38-39)
tests/eval/failing_scenario.py (3)
src/agentunit/core/scenario.py (1)
  • Scenario (23-278)
src/agentunit/adapters/base.py (2)
  • AdapterOutcome (18-25)
  • BaseAdapter (28-69)
src/agentunit/datasets/base.py (3)
  • DatasetCase (19-27)
  • DatasetSource (30-57)
  • name (38-39)
tests/integration/simple_langgraph_agent.py (1)
tests/test_framework_adapters.py (1)
  • agent (90-92)
src/agentunit/pytest/plugin.py (4)
src/agentunit/core/scenario.py (1)
  • Scenario (23-278)
src/agentunit/core/runner.py (1)
  • run_suite (101-111)
src/agentunit/core/exceptions.py (1)
  • AgentUnitError (8-11)
src/agentunit/pytest/cache.py (3)
  • ScenarioCache (33-192)
  • clear (178-192)
  • get (107-143)
🪛 LanguageTool
docs/pytest-plugin.md

[grammar] ~29-~29: Ensure spelling is correct
Context: ...l, .json`): Loads scenarios using the nocode module ### Python Scenario Files Crea...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)

🔇 Additional comments (38)
src/agentunit/core/trace.py (1)

1-3: LGTM! Docstring formatting improvements.

The conversion from single-line to multi-line docstrings improves readability and aligns with PEP 257 conventions for class and module documentation.

Also applies to: 16-18, 27-29

src/agentunit/adapters/base.py (1)

35-42: LGTM! Enhanced docstrings improve adapter implementation guidance.

The expanded docstrings clearly document method contracts, parameter types, and return values, making it easier for developers to implement custom adapters.

Also applies to: 46-55, 58-66

README.md (1)

203-220: LGTM! Expanded test documentation improves developer experience.

The expanded test command examples and integration test documentation provide clear guidance for contributors on running different test suites.

tests/integration/simple_langgraph_agent.py (1)

8-34: LGTM! Well-structured integration test helper with proper fallback handling.

The implementation provides excellent defensive programming:

  • Graceful degradation when LangGraph is unavailable
  • Clear mock responses for testing without dependencies
  • Proper type stubs for import fallback scenarios

This pattern enables running tests without requiring all optional dependencies installed, which aligns well with the integration test design mentioned in the README.

Also applies to: 46-93, 95-129

src/agentunit/core/scenario.py (1)

1-3: LGTM! Consistent docstring formatting improvements.

The multi-line docstring formatting aligns with the changes across other core modules and improves documentation readability.

Also applies to: 24-26, 82-84, 100-102

src/agentunit/pytest/__init__.py (1)

1-6: LGTM! Clean pytest plugin entry point.

The module correctly exposes pytest hooks for plugin registration. The structure follows pytest plugin conventions with explicit __all__ exports.

tests/integration/ci-example.yml (1)

30-30: Pytest markers are properly registered. Both integration and langgraph markers are defined in pyproject.toml under [tool.pytest.ini_options].markers (lines 65-66), and the configuration includes --strict-markers which enforces marker registration to prevent warnings.

examples/basic_evaluation.py (1)

10-11: The imports in basic_evaluation.py are already correct and no changes are needed.

The current file imports from agentunit directly (from agentunit import DatasetCase, Runner, Scenario), which correctly uses the exported classes from the top-level package. The problematic imports mentioned in the review (agentunit.core.adapters and agentunit.core.evaluator) do not exist in the current version of this file. The script runs without errors as-is.

Likely an incorrect or invalid review comment.

tests/test_pytest_plugin.py (1)

236-446: Excellent cache test coverage!

The cache tests comprehensively cover creation, hits, disabled mode, invalidation on source changes, clearing, failure storage, and key uniqueness. This thorough testing ensures the caching functionality is reliable.

tests/integration/test_integration_basic.py (2)

10-20: LGTM - Clear directory structure validation.

The test verifies that all required integration test files are present, which is useful for catching incomplete test setups.


23-47: LGTM - Comprehensive fallback testing.

The test properly validates both the success and fallback paths for the LangGraph integration, ensuring the agent works with and without the optional LangGraph dependency.

src/agentunit/pytest/cli.py (1)

10-43: LGTM - Clean CLI command structure.

The init_eval command is well-structured with clear options and helpful user guidance. The implementation correctly creates the directory structure and provides next steps.

pyproject.toml (4)

37-41: LGTM - Well-structured optional dependencies.

The addition of langgraph as an optional dependency with a dedicated integration-tests extra follows best practices. This allows users to install the package without LangGraph unless they specifically need integration testing capabilities.


50-53: LGTM - Proper pytest plugin registration.

The script entry point and pytest plugin registration are correctly configured:

  • agentunit-init-eval script points to the CLI command
  • pytest11 plugin entry point properly registers the AgentUnit plugin

This follows the standard pytest plugin registration pattern.


63-78: Excellent pytest configuration!

The pytest configuration is comprehensive and follows best practices:

  • Clear marker definitions with descriptions
  • Appropriate testpaths configuration
  • Standard test discovery patterns
  • Strict markers and config enforcement
  • Useful reporting options with -ra

This configuration provides a solid foundation for the test infrastructure.


27-27: The codebase does not directly use the langchain package. It imports langsmith (a separate LangChain-maintained package) for monitoring and tracing. The langchain dependency appears to be a transitive dependency through crewai. Since there are no direct langchain APIs in the codebase, the widened version constraint poses no compatibility risk.

Likely an incorrect or invalid review comment.

tests/integration/IMPLEMENTATION_SUMMARY.md (1)

1-131: Excellent implementation documentation!

This summary provides a clear and comprehensive overview of the integration test implementation, including completed tasks, acceptance criteria, file structure, test coverage, and usage examples. This will be valuable for maintainers and contributors.

docs/pytest-plugin.md (1)

69-82: Remove duplicate MyDataset definition.

The MyDataset class definition is duplicated with inconsistent formatting, making the example confusing.

🔎 Proposed fix to remove duplication
 class MyDataset(DatasetSource):
     def __init__(self):
         super().__init__(name="my-dataset", loader=self._generate_cases)
-
     
     def _generate_cases(self):
         return [

Likely an incorrect or invalid review comment.

src/agentunit/pytest/plugin.py (4)

1-26: LGTM! Clean module setup with proper type annotations.

The imports are well-organized with TYPE_CHECKING guard for type-only imports, and the logging setup follows best practices.


28-44: LGTM! Command-line options are well-defined.

The cache options (--no-cache, --clear-cache) are appropriately grouped and documented.


199-245: Good implementation of scenario execution with caching.

The runtest method correctly handles cached results, runs fresh scenarios when needed, collects failures, and caches results. The flow is clear and well-structured.


247-255: LGTM! Standard pytest item methods.

repr_failure and reportinfo are correctly implemented for proper test reporting.

tests/eval/example_scenarios.py (5)

1-6: LGTM! Clean imports for example scenarios.

The imports correctly reference the agentunit package components needed for the examples.


8-26: Well-structured adapter implementation.

The SimpleAdapter correctly wraps function-based agents, handles execution results, and properly catches exceptions to return failure outcomes.


29-49: Good example dataset with clear test cases.

The ExampleDataset provides meaningful test cases with appropriate metadata. The structure is easy to understand for newcomers.


52-69: Example agent and scenario are well-documented.

The example_agent demonstrates a simple pattern, and the example_scenario is properly configured for auto-discovery by the pytest plugin.


72-105: Good factory function pattern demonstration.

The scenario_math_focused factory shows how to create scenarios dynamically with inner classes and functions. This is a useful pattern for the pytest plugin's discovery mechanism.

src/agentunit/pytest/cache.py (6)

1-21: LGTM! Well-organized module setup.

Good use of constants for cache directory name and version, enabling future cache format migrations.


23-30: LGTM! Clean dataclass definition.

CachedResult is properly defined with appropriate fields and optional source_hash.


41-48: Good practice adding .gitignore to cache directory.

This prevents accidental commits of cache files. The implementation is clean.


50-90: Potential side effect: iter_cases() may consume generator or trigger I/O.

Line 72 calls list(scenario.dataset.iter_cases()) which could trigger side effects (e.g., file reads, API calls) just for computing a cache key. If the dataset is expensive to iterate or is a one-time-use generator, this could cause issues.

Consider whether caching the case count or a dataset fingerprint separately would be safer, or document this behavior.


145-176: LGTM! Robust cache writing with proper error handling.

The set method correctly handles disabled cache, ensures directory exists, and catches write errors without crashing.


178-192: LGTM! Clean cache clearing implementation.

The clear method handles missing directory gracefully and reports the count of removed files.

tests/integration/test_langgraph_integration.py (5)

1-15: LGTM! Proper conditional test skipping.

Using pytest.importorskip for LangGraph ensures tests only run when the dependency is available, which is the correct pattern for optional integration tests.


17-51: LGTM! Well-structured test dataset.

The SimpleDataset provides diverse test cases with appropriate metadata and difficulty levels for comprehensive testing.


59-91: Good coverage of scenario creation and execution.

The test verifies scenario creation from a callable, adapter preparation, and execution with assertion on output content.


188-204: LGTM! Good test for immutability of clone.

The test correctly verifies that cloning creates a new scenario with modified attributes while preserving the original.


206-239: LGTM! Good coverage of adapter registry and multiple scenarios.

The module-level tests verify adapter registration and the ability to run multiple scenarios together.

Comment on lines 41 to 67
class SimpleAdapter(BaseAdapter):
name = "simple"

def __init__(self, agent_func):
self.agent_func = agent_func

def prepare(self):
pass

"""Simple adapter for function-based agents."""

name = "simple"

def __init__(self, agent_func):
self.agent_func = agent_func

def prepare(self):
pass

def execute(self, case, trace):
try:
result = self.agent_func({"query": case.query})
output = result.get("result", "")
success = output == case.expected_output
return AdapterOutcome(success=success, output=output)
except Exception as e:
return AdapterOutcome(success=False, output=None, error=str(e))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Remove duplicate SimpleAdapter definition.

The SimpleAdapter class is defined twice in the same code example:

  • Lines 41-50: First definition (incomplete)
  • Lines 50-67: Second definition (complete)

This duplication makes the documentation confusing and harder to maintain.

🔎 Proposed fix to remove duplication
 class SimpleAdapter(BaseAdapter):
+    """Simple adapter for function-based agents."""
+    
     name = "simple"
-
+    
     def __init__(self, agent_func):
         self.agent_func = agent_func
-
+    
     def prepare(self):
         pass
-
-    """Simple adapter for function-based agents."""
-    
-    name = "simple"
-    
-    def __init__(self, agent_func):
-        self.agent_func = agent_func
-    
-    def prepare(self):
-        pass
     
     def execute(self, case, trace):
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
class SimpleAdapter(BaseAdapter):
name = "simple"
def __init__(self, agent_func):
self.agent_func = agent_func
def prepare(self):
pass
"""Simple adapter for function-based agents."""
name = "simple"
def __init__(self, agent_func):
self.agent_func = agent_func
def prepare(self):
pass
def execute(self, case, trace):
try:
result = self.agent_func({"query": case.query})
output = result.get("result", "")
success = output == case.expected_output
return AdapterOutcome(success=success, output=output)
except Exception as e:
return AdapterOutcome(success=False, output=None, error=str(e))
class SimpleAdapter(BaseAdapter):
"""Simple adapter for function-based agents."""
name = "simple"
def __init__(self, agent_func):
self.agent_func = agent_func
def prepare(self):
pass
def execute(self, case, trace):
try:
result = self.agent_func({"query": case.query})
output = result.get("result", "")
success = output == case.expected_output
return AdapterOutcome(success=success, output=output)
except Exception as e:
return AdapterOutcome(success=False, output=None, error=str(e))
🤖 Prompt for AI Agents
In docs/pytest-plugin.md around lines 41 to 67 there is a duplicated
SimpleAdapter class (first incomplete definition at ~41–50 and a full definition
at ~50–67); remove the earlier incomplete duplicate so only the complete class
definition remains (keep the docstring, name, __init__, prepare, execute and
exception handling from the full definition), and ensure there are no leftover
duplicate imports/comments after deletion.

Comment on lines 236 to 266
## Configuration

Add pytest configuration in `pyproject.toml`:

```toml
[tool.pytest.ini_options]
markers = [
"agentunit: marks test as an AgentUnit scenario evaluation",
"scenario(name): marks test with specific scenario name",
]
testpaths = ["tests", "tests/eval"]
```

## Example Directory Structure

```
project/
├── tests/
│ ├── eval/ # AgentUnit scenarios
│ │ ├── __init__.py
│ │ ├── basic_scenarios.py
│ │ └── advanced_scenarios.py
│ └── test_regular.py
├── src/
│ └── myproject/
└── pyproject.toml
```

# Run with coverage
pytest tests/eval/ --cov=myproject
```
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Remove duplicate sections.

The "Configuration" and "Example Directory Structure" sections (lines 236-262) are duplicated from earlier in the document (lines 148-175). Additionally, there's an incomplete code block at the end (lines 264-266) that appears to be a fragment from another section.

🔎 Proposed fix

Remove lines 236-266 entirely, as these sections are already covered earlier in the documentation.

🤖 Prompt for AI Agents
In docs/pytest-plugin.md around lines 236 to 266, the "Configuration" and
"Example Directory Structure" sections are duplicates of earlier content and the
trailing code block fragment is incomplete; delete lines 236–266 entirely
(removing both duplicated sections and the stray code block), ensure surrounding
spacing/markdown remains valid after removal, and confirm the original
occurrences earlier in the file still contain the intended content.

Comment on lines 14 to 22
class FakeAdapter(BaseAdapter):
"""
A simple mock adapter used only for demonstration.
It returns a predictable output so evaluation is easy to understand.
"""

def generate(self, prompt: str) -> str:
# Always returns the same answer for simplicity
return "Hello, this is a fake response!"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

FakeAdapter does not implement the BaseAdapter interface correctly.

FakeAdapter must implement the abstract methods required by BaseAdapter:

  1. prepare(self) -> None - for lazy setup
  2. execute(self, case: DatasetCase, trace: TraceLog) -> AdapterOutcome - to process dataset cases
  3. cleanup(self) -> None - for resource cleanup (optional, has default no-op)

The current implementation only defines a generate(prompt: str) method, which is not part of the BaseAdapter interface. This will cause a TypeError at instantiation since abstract methods are not implemented.

🔎 Proposed fix: Implement required BaseAdapter methods
+from agentunit.adapters.base import BaseAdapter, AdapterOutcome
+from agentunit.core.trace import TraceLog
+from agentunit.datasets.base import DatasetCase
+

 class FakeAdapter(BaseAdapter):
     """
     A simple mock adapter used only for demonstration.
     It returns a predictable output so evaluation is easy to understand.
     """
+
+    name = "fake"
+
+    def prepare(self) -> None:
+        """No setup required for the fake adapter."""
+        pass
+
+    def execute(self, case: DatasetCase, trace: TraceLog) -> AdapterOutcome:
+        """Execute with a fake response."""
+        response = "Hello, this is a fake response!"
+        trace.record("agent_response", content=response)
+        return AdapterOutcome(
+            success=True,
+            output=response,
+            tool_calls=[],
+        )
+
+    def cleanup(self) -> None:
+        """No cleanup required for the fake adapter."""
+        pass
-
-    def generate(self, prompt: str) -> str:
-        # Always returns the same answer for simplicity
-        return "Hello, this is a fake response!"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
class FakeAdapter(BaseAdapter):
"""
A simple mock adapter used only for demonstration.
It returns a predictable output so evaluation is easy to understand.
"""
def generate(self, prompt: str) -> str:
# Always returns the same answer for simplicity
return "Hello, this is a fake response!"
from agentunit.adapters.base import BaseAdapter, AdapterOutcome
from agentunit.core.trace import TraceLog
from agentunit.datasets.base import DatasetCase
class FakeAdapter(BaseAdapter):
"""
A simple mock adapter used only for demonstration.
It returns a predictable output so evaluation is easy to understand.
"""
name = "fake"
def prepare(self) -> None:
"""No setup required for the fake adapter."""
pass
def execute(self, case: DatasetCase, trace: TraceLog) -> AdapterOutcome:
"""Execute with a fake response."""
response = "Hello, this is a fake response!"
trace.record("agent_response", content=response)
return AdapterOutcome(
success=True,
output=response,
tool_calls=[],
)
def cleanup(self) -> None:
"""No cleanup required for the fake adapter."""
pass
🤖 Prompt for AI Agents
In examples/basic_evaluation.py around lines 14 to 22, FakeAdapter currently
defines only generate(prompt: str) which does not satisfy the BaseAdapter
abstract interface; implement the required methods by adding prepare(self) ->
None for lazy setup, execute(self, case: DatasetCase, trace: TraceLog) ->
AdapterOutcome that uses the existing generate logic to produce an
AdapterOutcome for the given DatasetCase (populate outputs, scores, and trace as
appropriate), and optionally cleanup(self) -> None (can be a no-op); ensure
method signatures and return types exactly match BaseAdapter so the class can be
instantiated without TypeError.

Comment on lines 155 to 170
def _import_module(self) -> Any:
"""Import Python module from file path."""
try:
import importlib.util
import sys

spec = importlib.util.spec_from_file_location(self.path.stem, self.path)
if spec is None or spec.loader is None:
return None

module = importlib.util.module_from_spec(spec)
sys.modules[self.path.stem] = module
spec.loader.exec_module(module)
return module
except Exception:
return None
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Module name collision risk when importing multiple files with the same stem.

Using self.path.stem as the module name in sys.modules could cause collisions if two files have the same filename in different directories (e.g., tests/eval/foo/scenarios.py and tests/eval/bar/scenarios.py).

🔎 Proposed fix using unique module names
     def _import_module(self) -> Any:
         """Import Python module from file path."""
         try:
             import importlib.util
             import sys

-            spec = importlib.util.spec_from_file_location(self.path.stem, self.path)
+            # Use full path as module name to avoid collisions
+            module_name = f"agentunit_eval.{self.path.stem}_{hash(str(self.path))}"
+            spec = importlib.util.spec_from_file_location(module_name, self.path)
             if spec is None or spec.loader is None:
                 return None

             module = importlib.util.module_from_spec(spec)
-            sys.modules[self.path.stem] = module
+            sys.modules[module_name] = module
             spec.loader.exec_module(module)
             return module
         except Exception:
             return None
🤖 Prompt for AI Agents
In src/agentunit/pytest/plugin.py around lines 155 to 170, the code uses
self.path.stem as the module name which can collide when different files share
the same filename; change the import to use a unique module name (for example a
hash of the absolute path, the full path string, or append a UUID) when calling
spec_from_file_location and when inserting into sys.modules so each file gets a
distinct module identity; also ensure you use that same unique name for
module_from_spec and consider removing the entry from sys.modules on exceptions
or after use to avoid leaking stale entries.

Comment on lines 50 to 59
def always_wrong_agent(payload):
"""Agent that can answer the meaning of life question."""
query = payload.get("query", "").lower()

# Handle the meaning of life question
if "meaning of life" in query:
return {"result": "42"}

# Default response for other queries
return {"result": "I don't know"}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Critical logic error: scenario will pass, not fail.

The function name, file purpose, dataset name, and scenario name all indicate this should be a failing scenario for testing. However, the agent returns "42" when the query contains "meaning of life" (line 56), which exactly matches the expected output "42" (line 44). This will cause the scenario to PASS, not fail.

To create a proper failing scenario, the agent should return an incorrect answer for this test case.

🔎 Proposed fix to make the scenario actually fail
 def always_wrong_agent(payload):
-    """Agent that can answer the meaning of life question."""
+    """Agent that always gives wrong answers for testing failures."""
     query = payload.get("query", "").lower()
 
-    # Handle the meaning of life question
+    # Intentionally return wrong answer for the meaning of life question
     if "meaning of life" in query:
-        return {"result": "42"}
+        return {"result": "I don't know"}
 
     # Default response for other queries
-    return {"result": "I don't know"}
+    return {"result": "42"}
🤖 Prompt for AI Agents
In tests/eval/failing_scenario.py around lines 50 to 59, the agent currently
returns the correct answer "42" for queries containing "meaning of life", which
makes the scenario pass; change the handler so that when the query contains
"meaning of life" it returns an incorrect result (e.g., any value other than
"42" or a clearly wrong string) to ensure the test actually fails; keep the rest
of the default behavior unchanged.

Comment on lines 50 to 55
@pytest.mark.integration
def test_pytest_markers_configured():
"""Test that pytest markers are properly configured."""

# This test itself should have the integration marker
# We can't easily test this programmatically, but the test runner will validate it
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Remove or enhance placeholder test.

This test contains no assertions and the comment admits it can't validate anything programmatically. A test that doesn't assert anything provides no value and can give false confidence in test coverage. Either remove it or add actual validation logic.

💡 Suggestion

Consider removing this test entirely, or enhance it to programmatically check for marker presence:

@pytest.mark.integration
def test_pytest_markers_configured():
    """Test that pytest markers are properly configured."""
    import inspect
    
    # Get the current test function
    current_test = inspect.currentframe()
    
    # This test should have the integration marker
    # If it runs, the marker is working (though we can't easily assert it here)
    # Instead, rely on the marker tests in test_pytest_plugin.py
    assert True  # Placeholder - marker validation happens via pytest collection

Or simply remove it and rely on tests/test_pytest_plugin.py::TestPytestPlugin::test_pytest_markers for marker validation.

🤖 Prompt for AI Agents
In tests/integration/test_integration_basic.py around lines 50-55 the test is a
placeholder with no assertions; either delete the test or replace it with a
programmatic check that the current test has the integration marker by accepting
the pytest "request" fixture and asserting that any(marker.name == "integration"
for marker in request.node.iter_markers()); update the docstring accordingly if
you keep it.

Copy link
Owner

@aviralgarg05 aviralgarg05 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix the issues pls

@dharapandya85 dharapandya85 force-pushed the add-basic-evaluation-example branch from fcd60d9 to aa46eb6 Compare December 24, 2025 19:09
@dharapandya85
Copy link
Author

dharapandya85 commented Dec 24, 2025

Fix the issues pls

I had so many commits redirected to this branch, which I hard reset, now only my commit is visible, this happened during git fetch upstream.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (1)
src/agentunit/examples/basic_evaluation.py (1)

12-36: Enhance comments for newcomer understanding.

The PR objectives specify that the example should "include explanatory comments for each step to aid newcomers." While the current comments mark each step, they could be more explanatory to help users who are new to AgentUnit understand what each component does and why.

Consider expanding the comments to explain:

  • What FakeAdapter is and why it's useful for testing (a simple mock that returns a fixed response)
  • The purpose and structure of DatasetCase fields (id, query, expected_output)
  • What the Runner does (orchestrates evaluation of scenarios)
  • What information the evaluation result provides
🔎 Suggested enhancement for explanatory comments
 def main() -> None:
-    # define simple dataset
+    # Define a simple dataset with test cases
+    # Each DatasetCase represents a single test with an input query and expected output
     cases = [
         DatasetCase(
             id="echo-task",
             query="Say hello",
             expected_output="hello",
         )
     ]
-    # create a scenario using the fake adapter
+    
+    # Create a scenario using FakeAdapter (a simple mock adapter for testing)
+    # FakeAdapter always returns the same fixed response, useful for demonstrations
     scenario = Scenario(
         name="Basic Evaluation Example",
         adapter=FakeAdapter(response="hello"),
         dataset=cases,
     )

-    # run evaluation
+    # Run the evaluation
+    # Runner orchestrates the execution of all test cases in the scenario
     runner = Runner(scenarios=[scenario])
     result = runner.run()

-    # print summary
+    # Print evaluation summary
+    # The result contains metrics like success rate and individual test outcomes
     print("=== Evaluation Summary ===")
     print(f"Scenario: {scenario.name}")
     print(f"Success rate: {result.success_rate:.0%}")
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between aa46eb6 and 119ac8c.

📒 Files selected for processing (1)
  • src/agentunit/examples/basic_evaluation.py
🧰 Additional context used
🧬 Code graph analysis (1)
src/agentunit/examples/basic_evaluation.py (3)
src/agentunit/datasets/base.py (2)
  • DatasetCase (19-27)
  • name (38-39)
src/agentunit/core/runner.py (1)
  • Runner (38-86)
src/agentunit/core/scenario.py (1)
  • Scenario (23-278)
🔇 Additional comments (2)
src/agentunit/examples/basic_evaluation.py (2)

1-1: LGTM!

The imports are correctly structured for the example.


38-39: LGTM!

The main guard is correctly positioned at module level, allowing the script to execute when run directly. The indentation issue from the previous review has been properly addressed.

Copy link
Owner

@aviralgarg05 aviralgarg05 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fix the issues

@dharapandya85 dharapandya85 force-pushed the add-basic-evaluation-example branch from 119ac8c to 3a427d0 Compare January 2, 2026 18:46
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3a427d0 and 158dfdf.

📒 Files selected for processing (1)
  • src/agentunit/examples/basic_evaluation.py
🧰 Additional context used
🧬 Code graph analysis (1)
src/agentunit/examples/basic_evaluation.py (3)
src/agentunit/core/runner.py (1)
  • Runner (38-86)
src/agentunit/core/scenario.py (1)
  • Scenario (23-278)
src/agentunit/datasets/base.py (4)
  • DatasetCase (19-27)
  • DatasetSource (30-57)
  • from_list (49-50)
  • name (38-39)
🔇 Additional comments (3)
src/agentunit/examples/basic_evaluation.py (3)

26-26: LGTM!

Correctly uses DatasetSource.from_list(cases) to wrap the dataset, matching the API shown in the relevant code snippets.


29-37: LGTM!

The Scenario and Runner setup correctly matches the API contracts from the relevant code snippets:

  • Scenario receives the required name, adapter, and dataset parameters
  • Runner is initialized with a list of scenarios
  • runner.run() is called with no arguments as expected

39-47: LGTM!

The result access pattern and main guard are correctly implemented:

  • Line 39 extracts the first scenario result from the suite
  • Line 43 accesses success_rate on the scenario result (matches the resolved pattern from past reviews)
  • Lines 46-47 place the main guard at module level with proper formatting

Comment on lines 21 to 24
DatasetCase(
input="hello",
expected="hello",
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Fix DatasetCase field names to match the API.

The DatasetCase constructor arguments don't match the actual field names defined in the class. From the relevant code snippets, DatasetCase has:

  • id: str (required)
  • query: str (required, not input)
  • expected_output: str | None (optional, not expected)

This will cause a TypeError at runtime, violating the acceptance criteria that the script must run without errors.

🔎 Proposed fix for DatasetCase construction
     cases = [
         DatasetCase(
-            input="hello",
-            expected="hello",
+            id="case_1",
+            query="hello",
+            expected_output="hello",
         )
     ]
🤖 Prompt for AI Agents
In src/agentunit/examples/basic_evaluation.py around lines 21 to 24, the
DatasetCase constructor is using incorrect field names; replace the kwargs to
match the class API by adding a required id (e.g., "case-1"), rename input to
query, and rename expected to expected_output (or omit expected_output if you
want None). Ensure the resulting call is DatasetCase(id="...", query="hello",
expected_output="hello") or DatasetCase(id="...", query="hello") if no expected
output is needed.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
src/agentunit/examples/basic_evaluation.py (1)

22-48: Fix incorrect DatasetCase field name and add explanatory comments.

Two issues prevent this example from meeting the acceptance criteria:

  1. Critical: Line 27 uses input="hello", but DatasetCase expects query (not input). This will cause a TypeError at runtime.

  2. Essential: The acceptance criteria require "explanatory comments for each step aimed at newcomers," but the main function lacks these comments.

🔎 Proposed fix
 def main() -> None:
-    # define simple dataset
+    # Step 1: Define a simple dataset with one test case
+    # Each case has an id, a query (input), and expected output
     cases = [
         DatasetCase(
             id="case_1",
-            input="hello",
+            query="hello",
             expected_output="hello",
         )
     ]
     
-    # create a scenario using the fake adapter
+    # Step 2: Create a scenario that combines the dataset with an adapter
+    # The FakeAdapter will return "hello" for any input
     scenario = Scenario(
         name="Basic Evaluation Example",
         adapter=FakeAdapter(response="hello"),
         dataset=DatasetSource.from_list(cases),
     )
 
-    # run evaluation
+    # Step 3: Run the evaluation using the Runner
+    # The runner executes all scenarios and collects results
     runner = Runner([scenario])
     result = runner.run()
 
+    # Step 4: Extract and print the results
     scenario_result = result.scenarios[0]
-    # print summary
     print("=== Evaluation Summary ===")
     print(f"Scenario: {scenario.name}")
     print(f"Success rate: {scenario_result.success_rate:.0%}")
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 158dfdf and 9e67660.

📒 Files selected for processing (1)
  • src/agentunit/examples/basic_evaluation.py
🧰 Additional context used
🧬 Code graph analysis (1)
src/agentunit/examples/basic_evaluation.py (5)
src/agentunit/adapters/base.py (2)
  • BaseAdapter (28-69)
  • AdapterOutcome (18-25)
src/agentunit/core/trace.py (1)
  • TraceLog (26-78)
src/agentunit/core/runner.py (1)
  • Runner (38-86)
src/agentunit/core/scenario.py (1)
  • Scenario (23-278)
src/agentunit/datasets/base.py (4)
  • DatasetCase (19-27)
  • DatasetSource (30-57)
  • name (38-39)
  • from_list (49-50)

@@ -0,0 +1,51 @@
from agentunit.adapters.base import BaseAdapter, AdapterOutcome
from agentunit.core.tracelog import TraceLog
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Fix the incorrect import path for TraceLog.

The import references agentunit.core.tracelog, but based on the codebase structure, TraceLog is defined in agentunit.core.trace (not tracelog). This will cause an ImportError at runtime, preventing the script from executing.

🔎 Proposed fix
-from agentunit.core.tracelog import TraceLog
+from agentunit.core.trace import TraceLog
🤖 Prompt for AI Agents
In src/agentunit/examples/basic_evaluation.py around line 2, the import path is
incorrect: it imports TraceLog from agentunit.core.tracelog but TraceLog is
defined in agentunit.core.trace; update the import to import TraceLog from
agentunit.core.trace so the module path matches the codebase and prevents
ImportError at runtime.

Comment on lines +8 to +19
class FakeAdapter(BaseAdapter):
def __init__(self, response: str):
self.response = response

def prepare(self):
pass

def execute(self, case, trace_log: TraceLog) -> AdapterOutcome:
return AdapterOutcome(success=True, output=self.response, error=None)

def cleanup(self):
pass
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Add explanatory comments and type hints for newcomers.

The acceptance criteria require that the code be "well-commented and understandable for newcomers." The FakeAdapter class lacks explanatory comments describing its purpose and how it implements the adapter pattern. Additionally, the case parameter in execute() is missing its type hint.

🔎 Proposed improvements
 class FakeAdapter(BaseAdapter):
+    """
+    A minimal adapter that returns a fixed response for any input.
+    Used for demonstration purposes in this example.
+    """
+    
     def __init__(self, response: str):
         self.response = response
 
     def prepare(self):
+        """Set up the adapter before execution (no-op for this example)."""
         pass
 
-    def execute(self, case, trace_log: TraceLog) -> AdapterOutcome:
+    def execute(self, case: DatasetCase, trace_log: TraceLog) -> AdapterOutcome:
+        """Execute a test case by returning the fixed response."""
         return AdapterOutcome(success=True, output=self.response, error=None)
 
     def cleanup(self):
+        """Clean up resources after execution (no-op for this example)."""
         pass
🤖 Prompt for AI Agents
In src/agentunit/examples/basic_evaluation.py around lines 8 to 19, the
FakeAdapter class lacks explanatory comments and the execute() method is missing
a type hint for the case parameter; add a brief docstring for the class
describing its purpose (a simple testing adapter that returns a fixed response),
inline comments for each lifecycle method (prepare, execute, cleanup) explaining
when they're called, and annotate execute(self, case: <appropriate_case_type>,
trace_log: TraceLog) -> AdapterOutcome using the correct Case type from the
project (import it if needed) so newcomers can understand usage and types.

Copy link
Owner

@aviralgarg05 aviralgarg05 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pls fix the issues

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add basic evaluation example script

3 participants