feat: add basic evaluation example using FakeAdapter #35

dharapandya85 · 2025-12-10T13:08:06Z

Description

Please include a summary of the changes and the related issue. Include relevant motivation and context.
This PR adds a new basic evaluation example using FakeAdapter to run simple evaluations using AgentUnit.
Fixes #9 (issue)

Type of Change

Please delete options that are not relevant:

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation update
Performance improvement
Code refactoring
Test coverage improvement
Build/CI improvement

Changes Made

Please provide a detailed list of changes:

Added a new example: basic_evaluation.py
applied YAML formatting fixes to .pre-commit-config.yaml
introduced a minimal evaluation flow using FakeAdapter

Testing

Please describe the tests you ran to verify your changes:

Existing test suite passes (pytest)
Added new tests for new functionality
Manual testing performed
Tested with multiple Python versions (3.10, 3.11, 3.12)

Test Configuration

Python version: 3.12
Operating System: Windows (WSL2/ Ubuntu)
Relevant adapters tested: FakeAdapter

Test Results

# Paste relevant test output or results

Code Quality

My code follows the project's style guidelines (Ruff, Black)
I have run pre-commit run --all-files and addressed any issues
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings or errors
I have added type hints where appropriate
I have updated docstrings following Google style

Documentation

I have updated relevant documentation (README, docs/, inline comments)
I have added/updated docstrings for new/modified functions
I have updated CHANGELOG.md under [Unreleased]
I have added examples if introducing new features

Breaking Changes

If this PR introduces breaking changes, please describe:

What breaks:
Migration path for users:
Deprecation warnings added (if applicable):

Dependencies

No new dependencies added
New dependencies added (list below with justification)

New dependencies:

package-name: Reason for adding

Performance Impact

Describe any performance implications:

No performance impact
Performance improvement (describe and provide benchmarks)
Potential performance regression (describe and justify)

Screenshots/Recordings (if applicable)

Add screenshots or recordings to help explain your changes:

Additional Context

Add any other context about the PR here:

Links to related issues or PRs
References to external documentation
Design decisions and trade-offs
Known limitations or future work

Checklist

I have read the CONTRIBUTING.md guide
My branch name follows the convention (feature/, fix/, docs/, etc.)
My commit messages follow the conventional commit format
I have tested my changes locally
I have updated the documentation
I have added tests that prove my fix is effective or that my feature works
All new and existing tests pass
I have checked for security vulnerabilities in any new dependencies

Reviewer Notes

Please pay special attention to:

Post-Merge Tasks

Tasks to complete after merging (if any):

Update external documentation
Announce in discussions/community
Create follow-up issues for future work

Summary by CodeRabbit

Documentation
- Added a runnable example demonstrating a complete end-to-end evaluation workflow using a minimal dataset and a mock adapter. Shows how to define evaluation cases and scenarios, execute a run, and produce a readable summary that includes the scenario name and success rate.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

continue · 2025-12-10T13:08:09Z

Keep this PR in a mergeable state →

Learn more

All Green is an AI agent that automatically:

✅ Addresses code review comments

✅ Fixes failing CI checks

✅ Resolves merge conflicts

coderabbitai · 2025-12-10T13:08:18Z

Walkthrough

Adds a new example script at src/agentunit/examples/basic_evaluation.py that implements a minimal end-to-end evaluation: defines FakeAdapter (with prepare, execute, cleanup), builds a single-case dataset, constructs a Scenario and Runner, runs the suite, and prints a summary via main().

Changes

Cohort / File(s)	Summary
Example script `src/agentunit/examples/basic_evaluation.py`	New file: introduces `FakeAdapter(BaseAdapter)` with `__init__(response: str)`, `prepare()`, `execute(case, trace_log) -> AdapterOutcome`, and `cleanup()`. Adds `main()` that creates one `DatasetCase` and dataset source, constructs a `Scenario` using the fake adapter, runs evaluation with `Runner.run_suite()`, extracts the scenario result, and prints the scenario name and success rate.

Sequence Diagram(s)

sequenceDiagram
  participant User
  participant Script as basic_evaluation.py
  participant Runner
  participant Scenario
  participant Adapter as FakeAdapter
  participant Evaluator

  rect rgb(240,248,255)
    User->>Script: python examples/basic_evaluation.py
    Script->>Scenario: construct(datasetSource, adapter=FakeAdapter)
    Script->>Runner: Runner([Scenario]).run_suite()
  end

  rect rgb(245,255,245)
    loop per dataset case
      Runner->>Scenario: run_case(case)
      Scenario->>Adapter: prepare()
      Adapter-->>Scenario: prepared
      Scenario->>Adapter: execute(case, trace_log)
      Adapter-->>Scenario: AdapterOutcome (response, trace)
      Scenario->>Evaluator: evaluate(expected, response)
      Evaluator-->>Scenario: result (pass/fail, metrics)
      Scenario->>Adapter: cleanup()
    end
  end

  rect rgb(255,248,240)
    Runner-->>Script: aggregated scenario results
    Script->>User: print summary (scenario name, success rate)
  end

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

Add Google-style docstrings to BaseAdapter methods #37: Adds a similar examples/basic_evaluation.py with a FakeAdapter example class.
Add quickstart instructions for running CI checks locally #41: Adds a basic_evaluation example defining a FakeAdapter and main evaluation workflow.

Suggested reviewers

aviralgarg05

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Out of Scope Changes check	⚠️ Warning	PR includes YAML formatting fixes to .pre-commit-config.yaml beyond the scope of issue #9, which requested only the basic evaluation example.	Remove the unrelated .pre-commit-config.yaml changes or justify them as necessary maintenance in a separate issue/PR.
Docstring Coverage	⚠️ Warning	Docstring coverage is 56.64% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	Title clearly and concisely describes the main change: adding a basic evaluation example with FakeAdapter.
Description check	✅ Passed	PR description covers key sections but is missing testing results, updated docstrings, documentation updates, CHANGELOG, and local testing confirmation.
Linked Issues check	✅ Passed	PR successfully addresses issue #9: creates examples/basic_evaluation.py with FakeAdapter, includes comments, confirms no new dependencies, and reports manual testing.

✨ Finishing touches

📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3a2bce5 and 0ecd431.

📒 Files selected for processing (2)

.pre-commit-config.yaml (1 hunks)
src/agentunit/examples/basic_evaluation.py (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

src/agentunit/examples/basic_evaluation.py (1)

src/agentunit/datasets/base.py (1)

name (38-39)

🔇 Additional comments (4)

src/agentunit/examples/basic_evaluation.py (3)

1-1: LGTM!

The imports are clean and appropriate for the example.

4-9: Verify the adapter interface signature.

The FakeAdapter.run() method accepts a single prompt: str parameter. Please verify that this matches the expected interface for model adapters in the agentunit library, as the actual adapter interface might expect a different signature (e.g., batch processing with multiple prompts, additional parameters, or a different return type).

12-32: Verify the Evaluation API parameter name and type in the Evaluation class definition.

The code at line 20 creates an evaluation with task=[task], passing a list to a parameter named task (singular). Confirm:

Whether the parameter should be named task or tasks

Whether it expects a single Task object or a list

The logic and comments are otherwise clear and well-structured for a beginner example.

.pre-commit-config.yaml (1)

1-82: Confirm the Python version change from python3.10 to python3 is intentional.

The formatting changes align well with the YAML formatting fixes mentioned in the PR. However, verify that changing default_language_version.python from python3.10 to python3 on line 68 is intentional and compatible with your project's minimum Python version requirement. Using python3 is more flexible but less specific—ensure this doesn't create issues if your CI environment has multiple Python versions installed.

src/agentunit/examples/basic_evaluation.py

aviralgarg05

Solve the issues pls

codecov-commenter · 2025-12-23T13:03:44Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

coderabbitai

Actionable comments posted: 7

🧹 Nitpick comments (13)

README.md (1)
44-50: Improve formatting of "Run CI locally" section.

The section has minor formatting issues:

Line 45: Missing space after comma in "Python 3.10+,Poetry"

The section title uses ## but content lacks proper structure
🔎 Suggested formatting improvements
-##  Run CI locally
-Prerequisites: Python 3.10+,Poetry
+## Run CI locally
+
+Prerequisites: Python 3.10+, Poetry
+
 - poetry install --with dev
 - poetry check
 - poetry run ruff check .
 - poetry run ruff format --check .
 - poetry run pytest
tests/integration/conftest.py (2)
8-16: Add type hints to pytest hook functions.

The hook functions are missing type annotations for their parameters and return types. While pytest hooks have flexible signatures, adding type hints improves code clarity and enables better IDE support.
🔎 Proposed type hints
-def pytest_configure(config):
+def pytest_configure(config: pytest.Config) -> None:
     """Configure pytest markers for integration tests."""
     config.addinivalue_line(
19-23: Add type hints to pytest hook function.

The hook function is missing type annotations. Additionally, consider whether the string matching approach on line 22 is robust enough for your use case.
🔎 Proposed type hints
-def pytest_collection_modifyitems(config, items):
+def pytest_collection_modifyitems(config: pytest.Config, items: list[pytest.Item]) -> None:
     """Automatically mark integration tests."""
tests/test_pytest_plugin.py (2)
12-31: Consider adding type hints to test adapter.

The SimpleTestAdapter class and its methods lack type annotations. While not strictly required for test code, type hints improve maintainability and catch potential type-related errors during development.
🔎 Proposed type hints
+from typing import Any, Callable
+from agentunit.datasets.base import DatasetCase
+from agentunit.core.trace import TraceLog
+
 class SimpleTestAdapter(BaseAdapter):
     """Simple adapter for testing."""
 
     name = "test"
 
-    def __init__(self, agent_func):
+    def __init__(self, agent_func: Callable[[dict[str, Any]], dict[str, Any]]) -> None:
         self.agent_func = agent_func
 
-    def prepare(self):
+    def prepare(self) -> None:
         pass
 
-    def execute(self, case, trace):
+    def execute(self, case: DatasetCase, trace: TraceLog) -> AdapterOutcome:
33-57: Consider adding type hints to mock classes.

The mock classes (MockConfig, MockSession, MockParent) lack type annotations. Adding type hints would improve test readability and help catch mock-related issues during development.
src/agentunit/pytest/cli.py (2)
46-157: Consider externalizing template strings.

The 106-line Python code template (lines 50-155) is embedded directly in the code, which impacts maintainability:

Hard to read and edit the template

Syntax highlighting doesn't work inside the string

Difficult to test the template independently

Changes to the template clutter git diffs

Consider moving templates to external files (e.g., a templates/ directory) or at minimum extracting them to module-level constants.
💡 Example approach
# At module level or in a separate templates.py file
EXAMPLE_SCENARIO_TEMPLATE = '''"""Example AgentUnit scenarios for pytest plugin."""
...
'''

def _create_example_files(eval_dir: Path) -> None:
    """Create example scenario files."""
    example_file = eval_dir / "example_scenarios.py"
    if not example_file.exists():
        example_file.write_text(EXAMPLE_SCENARIO_TEMPLATE)
        click.echo(f"Created {example_file}")
Alternatively, use a templates directory with importlib.resources or pathlib to load template files at runtime.
159-191: Consider externalizing README template.

Similar to the Python template above, the README template (29 lines) should be externalized for better maintainability.
src/agentunit/pytest/plugin.py (3)
47-63: Consider using a public attribute or pytest's stash for cache storage.

Storing on config._agentunit_cache uses a private attribute pattern (_ prefix) which works but is unconventional. Pytest's config.stash (introduced in pytest 7.0) is the recommended approach for plugins to store data.
🔎 Optional: Use pytest stash for cleaner integration
+from _pytest.stash import StashKey
+
+_cache_key: StashKey[ScenarioCache] = StashKey()
+
 def pytest_configure(config: Config) -> None:
     """Configure pytest with AgentUnit markers."""
     config.addinivalue_line("markers", "agentunit: mark test as an AgentUnit scenario evaluation")
     config.addinivalue_line("markers", "scenario(name): mark test with specific scenario name")

     # Initialize cache
     root_path = config.rootpath
     no_cache = config.getoption("agentunit_no_cache", default=False)
     clear_cache = config.getoption("agentunit_clear_cache", default=False)

     cache = ScenarioCache(root_path, enabled=not no_cache)
-    config._agentunit_cache = cache  # type: ignore[attr-defined]
+    config.stash[_cache_key] = cache
79-82: Verify directory matching logic handles edge cases.

The check "tests" in parts and "eval" in parts will match files like some/eval/other/tests/file.py where "tests" and "eval" aren't adjacent. Consider checking for the specific pattern tests/eval.
🔎 Proposed fix for stricter directory matching
 def _is_eval_directory(file_path: Path) -> bool:
     """Check if file is in tests/eval/ directory."""
     parts = file_path.parts
-    return "tests" in parts and "eval" in parts
+    # Check for "tests" followed directly by "eval" in the path
+    for i, part in enumerate(parts[:-1]):
+        if part == "tests" and parts[i + 1] == "eval":
+            return True
+    return False
88-101: Broad exception handling may hide important errors.

Catching all Exception types when discovering scenarios and silently converting to a failing test item could mask configuration issues. Consider logging the exception or being more selective about which exceptions to catch.
🔎 Proposed improvement for better error visibility
         try:
             scenarios = self._discover_scenarios()
             for scenario in scenarios:
                 yield AgentUnitItem.from_parent(self, name=scenario.name, scenario=scenario)
         except Exception as e:
             # If we can't load scenarios, create a single failing test
+            logger.warning(f"Failed to load scenarios from {self.path}: {e}")
             yield AgentUnitItem.from_parent(
                 self,
                 name=f"load_error_{self.path.stem}",
                 scenario=None,
                 load_error=str(e),
             )
src/agentunit/pytest/cache.py (1)
107-143: Cache invalidation logic may have a gap.

On lines 127-132, the source hash comparison only invalidates if both hashes exist. If the cached entry has a source_hash but the current file doesn't exist (current_source_hash is None), the cache will still be used. This could be intentional, but it might be surprising if the source file is deleted.
🔎 Consider invalidating when source file is removed
             if current_source_hash and cached_source_hash:
                 if current_source_hash != cached_source_hash:
                     logger.info(
                         f"Cache invalidated for scenario '{scenario.name}' (source file changed)"
                     )
                     return None
+            elif cached_source_hash and not current_source_hash:
+                # Source file was removed or became unreadable
+                logger.info(
+                    f"Cache invalidated for scenario '{scenario.name}' (source file removed)"
+                )
+                return None
tests/integration/test_langgraph_integration.py (2)
93-106: Hardcoded path may cause issues in different environments.

Line 96 uses a hardcoded path "tests/integration/simple_langgraph_agent.py". This assumes the test is run from the repository root. Consider using Path(__file__).parent for a more robust path resolution.
🔎 Proposed fix for portable path resolution
     def test_langgraph_scenario_from_python_file(self):
         """Test creating a scenario from a Python file."""
-        # Create a temporary Python file with the agent
-        agent_file = Path("tests/integration/simple_langgraph_agent.py")
+        # Use path relative to this test file
+        agent_file = Path(__file__).parent / "simple_langgraph_agent.py"

         scenario = Scenario.load_langgraph(
             path=agent_file,
170-186: Test doesn't verify retry behavior.

The test sets scenario.retries = 2 but uses a non-failing agent, so retries are never exercised. This verifies the retries attribute can be set but doesn't test the actual retry mechanism.

Consider adding a test with an agent that fails intermittently to verify the retry logic is actually invoked.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0ecd431 and fcd60d9.

⛔ Files ignored due to path filters (1)

poetry.lock is excluded by !**/*.lock

📒 Files selected for processing (30)

README.md
docs/pytest-plugin.md
examples/basic_evaluation.py
pyproject.toml
src/agentunit/adapters/base.py
src/agentunit/core/__init__.py
src/agentunit/core/exceptions.py
src/agentunit/core/replay.py
src/agentunit/core/runner.py
src/agentunit/core/scenario.py
src/agentunit/core/trace.py
src/agentunit/examples/basic_evaluation.py
src/agentunit/py.typed
src/agentunit/pytest/__init__.py
src/agentunit/pytest/cache.py
src/agentunit/pytest/cli.py
src/agentunit/pytest/plugin.py
tests/__init__.py
tests/eval/__init__.py
tests/eval/example_scenarios.py
tests/eval/failing_scenario.py
tests/integration/IMPLEMENTATION_SUMMARY.md
tests/integration/README.md
tests/integration/__init__.py
tests/integration/ci-example.yml
tests/integration/conftest.py
tests/integration/simple_langgraph_agent.py
tests/integration/test_integration_basic.py
tests/integration/test_langgraph_integration.py
tests/test_pytest_plugin.py

💤 Files with no reviewable changes (1)

src/agentunit/py.typed

✅ Files skipped from review due to trivial changes (8)

tests/eval/init.py
tests/integration/init.py
tests/init.py
src/agentunit/core/exceptions.py
src/agentunit/core/init.py
src/agentunit/core/replay.py
src/agentunit/core/runner.py
tests/integration/README.md

🚧 Files skipped from review as they are similar to previous changes (1)

src/agentunit/examples/basic_evaluation.py

🧰 Additional context used

🧬 Code graph analysis (8)

src/agentunit/pytest/__init__.py (1)

src/agentunit/pytest/plugin.py (3)

pytest_addoption (28-44)

pytest_collect_file (66-76)

pytest_configure (47-63)

src/agentunit/adapters/base.py (2)

src/agentunit/datasets/base.py (1)

DatasetCase (19-27)

src/agentunit/core/trace.py (1)

TraceLog (26-78)

examples/basic_evaluation.py (1)

src/agentunit/adapters/base.py (1)

BaseAdapter (28-69)

tests/integration/test_integration_basic.py (1)

tests/integration/simple_langgraph_agent.py (1)

invoke_agent (95-129)

src/agentunit/pytest/cache.py (2)

src/agentunit/core/scenario.py (1)

Scenario (23-278)

src/agentunit/datasets/base.py (1)

name (38-39)

tests/eval/failing_scenario.py (3)

src/agentunit/core/scenario.py (1)

Scenario (23-278)

src/agentunit/adapters/base.py (2)

AdapterOutcome (18-25)

BaseAdapter (28-69)

src/agentunit/datasets/base.py (3)

DatasetCase (19-27)

DatasetSource (30-57)

name (38-39)

tests/integration/simple_langgraph_agent.py (1)

tests/test_framework_adapters.py (1)

agent (90-92)

src/agentunit/pytest/plugin.py (4)

src/agentunit/core/scenario.py (1)

Scenario (23-278)

src/agentunit/core/runner.py (1)

run_suite (101-111)

src/agentunit/core/exceptions.py (1)

AgentUnitError (8-11)

src/agentunit/pytest/cache.py (3)

ScenarioCache (33-192)

clear (178-192)

get (107-143)

🪛 LanguageTool

docs/pytest-plugin.md

[grammar] ~29-~29: Ensure spelling is correct
Context: ...l, .json`): Loads scenarios using the nocode module ### Python Scenario Files Crea...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)

🔇 Additional comments (38)

src/agentunit/core/trace.py (1)

1-3: LGTM! Docstring formatting improvements.

The conversion from single-line to multi-line docstrings improves readability and aligns with PEP 257 conventions for class and module documentation.

Also applies to: 16-18, 27-29

src/agentunit/adapters/base.py (1)

35-42: LGTM! Enhanced docstrings improve adapter implementation guidance.

The expanded docstrings clearly document method contracts, parameter types, and return values, making it easier for developers to implement custom adapters.

Also applies to: 46-55, 58-66

README.md (1)

203-220: LGTM! Expanded test documentation improves developer experience.

The expanded test command examples and integration test documentation provide clear guidance for contributors on running different test suites.

tests/integration/simple_langgraph_agent.py (1)

8-34: LGTM! Well-structured integration test helper with proper fallback handling.

The implementation provides excellent defensive programming:

Graceful degradation when LangGraph is unavailable

Clear mock responses for testing without dependencies

Proper type stubs for import fallback scenarios

This pattern enables running tests without requiring all optional dependencies installed, which aligns well with the integration test design mentioned in the README.

Also applies to: 46-93, 95-129

src/agentunit/core/scenario.py (1)

1-3: LGTM! Consistent docstring formatting improvements.

The multi-line docstring formatting aligns with the changes across other core modules and improves documentation readability.

Also applies to: 24-26, 82-84, 100-102

src/agentunit/pytest/__init__.py (1)

1-6: LGTM! Clean pytest plugin entry point.

The module correctly exposes pytest hooks for plugin registration. The structure follows pytest plugin conventions with explicit __all__ exports.

tests/integration/ci-example.yml (1)

30-30: Pytest markers are properly registered. Both integration and langgraph markers are defined in pyproject.toml under [tool.pytest.ini_options].markers (lines 65-66), and the configuration includes --strict-markers which enforces marker registration to prevent warnings.

examples/basic_evaluation.py (1)

10-11: The imports in basic_evaluation.py are already correct and no changes are needed.

The current file imports from agentunit directly (from agentunit import DatasetCase, Runner, Scenario), which correctly uses the exported classes from the top-level package. The problematic imports mentioned in the review (agentunit.core.adapters and agentunit.core.evaluator) do not exist in the current version of this file. The script runs without errors as-is.

Likely an incorrect or invalid review comment.

tests/test_pytest_plugin.py (1)

236-446: Excellent cache test coverage!

The cache tests comprehensively cover creation, hits, disabled mode, invalidation on source changes, clearing, failure storage, and key uniqueness. This thorough testing ensures the caching functionality is reliable.

tests/integration/test_integration_basic.py (2)

10-20: LGTM - Clear directory structure validation.

The test verifies that all required integration test files are present, which is useful for catching incomplete test setups.

23-47: LGTM - Comprehensive fallback testing.

The test properly validates both the success and fallback paths for the LangGraph integration, ensuring the agent works with and without the optional LangGraph dependency.

src/agentunit/pytest/cli.py (1)

10-43: LGTM - Clean CLI command structure.

The init_eval command is well-structured with clear options and helpful user guidance. The implementation correctly creates the directory structure and provides next steps.

pyproject.toml (4)

37-41: LGTM - Well-structured optional dependencies.

The addition of langgraph as an optional dependency with a dedicated integration-tests extra follows best practices. This allows users to install the package without LangGraph unless they specifically need integration testing capabilities.

50-53: LGTM - Proper pytest plugin registration.

The script entry point and pytest plugin registration are correctly configured:

agentunit-init-eval script points to the CLI command

pytest11 plugin entry point properly registers the AgentUnit plugin

This follows the standard pytest plugin registration pattern.

63-78: Excellent pytest configuration!

The pytest configuration is comprehensive and follows best practices:

Clear marker definitions with descriptions

Appropriate testpaths configuration

Standard test discovery patterns

Strict markers and config enforcement

Useful reporting options with -ra

This configuration provides a solid foundation for the test infrastructure.

27-27: The codebase does not directly use the langchain package. It imports langsmith (a separate LangChain-maintained package) for monitoring and tracing. The langchain dependency appears to be a transitive dependency through crewai. Since there are no direct langchain APIs in the codebase, the widened version constraint poses no compatibility risk.

Likely an incorrect or invalid review comment.

tests/integration/IMPLEMENTATION_SUMMARY.md (1)

1-131: Excellent implementation documentation!

This summary provides a clear and comprehensive overview of the integration test implementation, including completed tasks, acceptance criteria, file structure, test coverage, and usage examples. This will be valuable for maintainers and contributors.
docs/pytest-plugin.md (1)
69-82: Remove duplicate MyDataset definition.

The MyDataset class definition is duplicated with inconsistent formatting, making the example confusing.
🔎 Proposed fix to remove duplication
 class MyDataset(DatasetSource):
     def __init__(self):
         super().__init__(name="my-dataset", loader=self._generate_cases)
-
     
     def _generate_cases(self):
         return [
Likely an incorrect or invalid review comment.
src/agentunit/pytest/plugin.py (4)

1-26: LGTM! Clean module setup with proper type annotations.

The imports are well-organized with TYPE_CHECKING guard for type-only imports, and the logging setup follows best practices.

28-44: LGTM! Command-line options are well-defined.

The cache options (--no-cache, --clear-cache) are appropriately grouped and documented.

199-245: Good implementation of scenario execution with caching.

The runtest method correctly handles cached results, runs fresh scenarios when needed, collects failures, and caches results. The flow is clear and well-structured.

247-255: LGTM! Standard pytest item methods.

repr_failure and reportinfo are correctly implemented for proper test reporting.

tests/eval/example_scenarios.py (5)

1-6: LGTM! Clean imports for example scenarios.

The imports correctly reference the agentunit package components needed for the examples.

8-26: Well-structured adapter implementation.

The SimpleAdapter correctly wraps function-based agents, handles execution results, and properly catches exceptions to return failure outcomes.

29-49: Good example dataset with clear test cases.

The ExampleDataset provides meaningful test cases with appropriate metadata. The structure is easy to understand for newcomers.

52-69: Example agent and scenario are well-documented.

The example_agent demonstrates a simple pattern, and the example_scenario is properly configured for auto-discovery by the pytest plugin.

72-105: Good factory function pattern demonstration.

The scenario_math_focused factory shows how to create scenarios dynamically with inner classes and functions. This is a useful pattern for the pytest plugin's discovery mechanism.

src/agentunit/pytest/cache.py (6)

1-21: LGTM! Well-organized module setup.

Good use of constants for cache directory name and version, enabling future cache format migrations.

23-30: LGTM! Clean dataclass definition.

CachedResult is properly defined with appropriate fields and optional source_hash.

41-48: Good practice adding .gitignore to cache directory.

This prevents accidental commits of cache files. The implementation is clean.

50-90: Potential side effect: iter_cases() may consume generator or trigger I/O.

Line 72 calls list(scenario.dataset.iter_cases()) which could trigger side effects (e.g., file reads, API calls) just for computing a cache key. If the dataset is expensive to iterate or is a one-time-use generator, this could cause issues.

Consider whether caching the case count or a dataset fingerprint separately would be safer, or document this behavior.

145-176: LGTM! Robust cache writing with proper error handling.

The set method correctly handles disabled cache, ensures directory exists, and catches write errors without crashing.

178-192: LGTM! Clean cache clearing implementation.

The clear method handles missing directory gracefully and reports the count of removed files.

tests/integration/test_langgraph_integration.py (5)

1-15: LGTM! Proper conditional test skipping.

Using pytest.importorskip for LangGraph ensures tests only run when the dependency is available, which is the correct pattern for optional integration tests.

17-51: LGTM! Well-structured test dataset.

The SimpleDataset provides diverse test cases with appropriate metadata and difficulty levels for comprehensive testing.

59-91: Good coverage of scenario creation and execution.

The test verifies scenario creation from a callable, adapter preparation, and execution with assertion on output content.

188-204: LGTM! Good test for immutability of clone.

The test correctly verifies that cloning creates a new scenario with modified attributes while preserving the original.

206-239: LGTM! Good coverage of adapter registry and multiple scenarios.

The module-level tests verify adapter registration and the ability to run multiple scenarios together.

coderabbitai · 2025-12-24T18:48:37Z

docs/pytest-plugin.md

+class SimpleAdapter(BaseAdapter):
+    name = "simple"
+
+    def __init__(self, agent_func):
+        self.agent_func = agent_func
+
+    def prepare(self):
+        pass
+
+    """Simple adapter for function-based agents."""
+
+    name = "simple"
+
+    def __init__(self, agent_func):
+        self.agent_func = agent_func
+
+    def prepare(self):
+        pass
+
+    def execute(self, case, trace):
+        try:
+            result = self.agent_func({"query": case.query})
+            output = result.get("result", "")
+            success = output == case.expected_output
+            return AdapterOutcome(success=success, output=output)
+        except Exception as e:
+            return AdapterOutcome(success=False, output=None, error=str(e))


⚠️ Potential issue | 🟠 Major

Remove duplicate SimpleAdapter definition.

The SimpleAdapter class is defined twice in the same code example:

Lines 41-50: First definition (incomplete)

Lines 50-67: Second definition (complete)

This duplication makes the documentation confusing and harder to maintain.

🔎 Proposed fix to remove duplication

class SimpleAdapter(BaseAdapter): + """Simple adapter for function-based agents.""" + name = "simple" - + def __init__(self, agent_func): self.agent_func = agent_func - + def prepare(self): pass - - """Simple adapter for function-based agents.""" - - name = "simple" - - def __init__(self, agent_func): - self.agent_func = agent_func - - def prepare(self): - pass def execute(self, case, trace):

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

class SimpleAdapter(BaseAdapter):

name = "simple"

def __init__(self, agent_func):

self.agent_func = agent_func

def prepare(self):

pass

"""Simple adapter for function-based agents."""

name = "simple"

def __init__(self, agent_func):

self.agent_func = agent_func

def prepare(self):

pass

def execute(self, case, trace):

try:

result = self.agent_func({"query": case.query})

output = result.get("result", "")

success = output == case.expected_output

return AdapterOutcome(success=success, output=output)

except Exception as e:

return AdapterOutcome(success=False, output=None, error=str(e))

class SimpleAdapter(BaseAdapter):

"""Simple adapter for function-based agents."""

name = "simple"

def __init__(self, agent_func):

self.agent_func = agent_func

def prepare(self):

pass

def execute(self, case, trace):

try:

result = self.agent_func({"query": case.query})

output = result.get("result", "")

success = output == case.expected_output

return AdapterOutcome(success=success, output=output)

except Exception as e:

return AdapterOutcome(success=False, output=None, error=str(e))

🤖 Prompt for AI Agents

In docs/pytest-plugin.md around lines 41 to 67 there is a duplicated SimpleAdapter class (first incomplete definition at ~41–50 and a full definition at ~50–67); remove the earlier incomplete duplicate so only the complete class definition remains (keep the docstring, name, __init__, prepare, execute and exception handling from the full definition), and ensure there are no leftover duplicate imports/comments after deletion.

coderabbitai · 2025-12-24T18:48:37Z

docs/pytest-plugin.md

+## Configuration
+
+Add pytest configuration in `pyproject.toml`:
+
+```toml
+[tool.pytest.ini_options]
+markers = [
+    "agentunit: marks test as an AgentUnit scenario evaluation",
+    "scenario(name): marks test with specific scenario name",
+]
+testpaths = ["tests", "tests/eval"]
+```
+
+## Example Directory Structure
+
+```
+project/
+├── tests/
+│   ├── eval/                    # AgentUnit scenarios
+│   │   ├── __init__.py
+│   │   ├── basic_scenarios.py
+│   │   └── advanced_scenarios.py
+│   └── test_regular.py
+├── src/
+│   └── myproject/
+└── pyproject.toml
+```
+
+# Run with coverage
+pytest tests/eval/ --cov=myproject
+```


⚠️ Potential issue | 🟠 Major

Remove duplicate sections.

The "Configuration" and "Example Directory Structure" sections (lines 236-262) are duplicated from earlier in the document (lines 148-175). Additionally, there's an incomplete code block at the end (lines 264-266) that appears to be a fragment from another section.

🔎 Proposed fix

Remove lines 236-266 entirely, as these sections are already covered earlier in the documentation.

🤖 Prompt for AI Agents

In docs/pytest-plugin.md around lines 236 to 266, the "Configuration" and "Example Directory Structure" sections are duplicates of earlier content and the trailing code block fragment is incomplete; delete lines 236–266 entirely (removing both duplicated sections and the stray code block), ensure surrounding spacing/markdown remains valid after removal, and confirm the original occurrences earlier in the file still contain the intended content.

coderabbitai · 2025-12-24T18:48:37Z

examples/basic_evaluation.py

+class FakeAdapter(BaseAdapter):
+    """
+    A simple mock adapter used only for demonstration.
+    It returns a predictable output so evaluation is easy to understand.
+    """
+
+    def generate(self, prompt: str) -> str:
+        # Always returns the same answer for simplicity
+        return "Hello, this is a fake response!"


⚠️ Potential issue | 🔴 Critical

FakeAdapter does not implement the BaseAdapter interface correctly.

FakeAdapter must implement the abstract methods required by BaseAdapter:

prepare(self) -> None - for lazy setup

execute(self, case: DatasetCase, trace: TraceLog) -> AdapterOutcome - to process dataset cases

cleanup(self) -> None - for resource cleanup (optional, has default no-op)

The current implementation only defines a generate(prompt: str) method, which is not part of the BaseAdapter interface. This will cause a TypeError at instantiation since abstract methods are not implemented.

🔎 Proposed fix: Implement required BaseAdapter methods

+from agentunit.adapters.base import BaseAdapter, AdapterOutcome +from agentunit.core.trace import TraceLog +from agentunit.datasets.base import DatasetCase + class FakeAdapter(BaseAdapter): """ A simple mock adapter used only for demonstration. It returns a predictable output so evaluation is easy to understand. """ + + name = "fake" + + def prepare(self) -> None: + """No setup required for the fake adapter.""" + pass + + def execute(self, case: DatasetCase, trace: TraceLog) -> AdapterOutcome: + """Execute with a fake response.""" + response = "Hello, this is a fake response!" + trace.record("agent_response", content=response) + return AdapterOutcome( + success=True, + output=response, + tool_calls=[], + ) + + def cleanup(self) -> None: + """No cleanup required for the fake adapter.""" + pass - - def generate(self, prompt: str) -> str: - # Always returns the same answer for simplicity - return "Hello, this is a fake response!"

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

class FakeAdapter(BaseAdapter):

"""

A simple mock adapter used only for demonstration.

It returns a predictable output so evaluation is easy to understand.

"""

def generate(self, prompt: str) -> str:

# Always returns the same answer for simplicity

return "Hello, this is a fake response!"

from agentunit.adapters.base import BaseAdapter, AdapterOutcome

from agentunit.core.trace import TraceLog

from agentunit.datasets.base import DatasetCase

class FakeAdapter(BaseAdapter):

"""

A simple mock adapter used only for demonstration.

It returns a predictable output so evaluation is easy to understand.

"""

name = "fake"

def prepare(self) -> None:

"""No setup required for the fake adapter."""

pass

def execute(self, case: DatasetCase, trace: TraceLog) -> AdapterOutcome:

"""Execute with a fake response."""

response = "Hello, this is a fake response!"

trace.record("agent_response", content=response)

return AdapterOutcome(

success=True,

output=response,

tool_calls=[],

)

def cleanup(self) -> None:

"""No cleanup required for the fake adapter."""

pass

🤖 Prompt for AI Agents

In examples/basic_evaluation.py around lines 14 to 22, FakeAdapter currently defines only generate(prompt: str) which does not satisfy the BaseAdapter abstract interface; implement the required methods by adding prepare(self) -> None for lazy setup, execute(self, case: DatasetCase, trace: TraceLog) -> AdapterOutcome that uses the existing generate logic to produce an AdapterOutcome for the given DatasetCase (populate outputs, scores, and trace as appropriate), and optionally cleanup(self) -> None (can be a no-op); ensure method signatures and return types exactly match BaseAdapter so the class can be instantiated without TypeError.

examples/basic_evaluation.py

coderabbitai · 2025-12-24T18:48:38Z

src/agentunit/pytest/plugin.py

+    def _import_module(self) -> Any:
+        """Import Python module from file path."""
+        try:
+            import importlib.util
+            import sys
+
+            spec = importlib.util.spec_from_file_location(self.path.stem, self.path)
+            if spec is None or spec.loader is None:
+                return None
+
+            module = importlib.util.module_from_spec(spec)
+            sys.modules[self.path.stem] = module
+            spec.loader.exec_module(module)
+            return module
+        except Exception:
+            return None


⚠️ Potential issue | 🟡 Minor

Module name collision risk when importing multiple files with the same stem.

Using self.path.stem as the module name in sys.modules could cause collisions if two files have the same filename in different directories (e.g., tests/eval/foo/scenarios.py and tests/eval/bar/scenarios.py).

🔎 Proposed fix using unique module names

def _import_module(self) -> Any: """Import Python module from file path.""" try: import importlib.util import sys - spec = importlib.util.spec_from_file_location(self.path.stem, self.path) + # Use full path as module name to avoid collisions + module_name = f"agentunit_eval.{self.path.stem}_{hash(str(self.path))}" + spec = importlib.util.spec_from_file_location(module_name, self.path) if spec is None or spec.loader is None: return None module = importlib.util.module_from_spec(spec) - sys.modules[self.path.stem] = module + sys.modules[module_name] = module spec.loader.exec_module(module) return module except Exception: return None

🤖 Prompt for AI Agents

In src/agentunit/pytest/plugin.py around lines 155 to 170, the code uses self.path.stem as the module name which can collide when different files share the same filename; change the import to use a unique module name (for example a hash of the absolute path, the full path string, or append a UUID) when calling spec_from_file_location and when inserting into sys.modules so each file gets a distinct module identity; also ensure you use that same unique name for module_from_spec and consider removing the entry from sys.modules on exceptions or after use to avoid leaking stale entries.

coderabbitai · 2025-12-24T18:48:38Z

tests/eval/failing_scenario.py

+def always_wrong_agent(payload):
+    """Agent that can answer the meaning of life question."""
+    query = payload.get("query", "").lower()
+
+    # Handle the meaning of life question
+    if "meaning of life" in query:
+        return {"result": "42"}
+
+    # Default response for other queries
+    return {"result": "I don't know"}


⚠️ Potential issue | 🔴 Critical

Critical logic error: scenario will pass, not fail.

The function name, file purpose, dataset name, and scenario name all indicate this should be a failing scenario for testing. However, the agent returns "42" when the query contains "meaning of life" (line 56), which exactly matches the expected output "42" (line 44). This will cause the scenario to PASS, not fail.

To create a proper failing scenario, the agent should return an incorrect answer for this test case.

🔎 Proposed fix to make the scenario actually fail

def always_wrong_agent(payload): - """Agent that can answer the meaning of life question.""" + """Agent that always gives wrong answers for testing failures.""" query = payload.get("query", "").lower() - # Handle the meaning of life question + # Intentionally return wrong answer for the meaning of life question if "meaning of life" in query: - return {"result": "42"} + return {"result": "I don't know"} # Default response for other queries - return {"result": "I don't know"} + return {"result": "42"}

🤖 Prompt for AI Agents

In tests/eval/failing_scenario.py around lines 50 to 59, the agent currently returns the correct answer "42" for queries containing "meaning of life", which makes the scenario pass; change the handler so that when the query contains "meaning of life" it returns an incorrect result (e.g., any value other than "42" or a clearly wrong string) to ensure the test actually fails; keep the rest of the default behavior unchanged.

coderabbitai · 2025-12-24T18:48:38Z

tests/integration/test_integration_basic.py

+@pytest.mark.integration
+def test_pytest_markers_configured():
+    """Test that pytest markers are properly configured."""
+
+    # This test itself should have the integration marker
+    # We can't easily test this programmatically, but the test runner will validate it


⚠️ Potential issue | 🟡 Minor

Remove or enhance placeholder test.

This test contains no assertions and the comment admits it can't validate anything programmatically. A test that doesn't assert anything provides no value and can give false confidence in test coverage. Either remove it or add actual validation logic.

💡 Suggestion

Consider removing this test entirely, or enhance it to programmatically check for marker presence:

@pytest.mark.integration def test_pytest_markers_configured(): """Test that pytest markers are properly configured.""" import inspect # Get the current test function current_test = inspect.currentframe() # This test should have the integration marker # If it runs, the marker is working (though we can't easily assert it here) # Instead, rely on the marker tests in test_pytest_plugin.py assert True # Placeholder - marker validation happens via pytest collection

Or simply remove it and rely on tests/test_pytest_plugin.py::TestPytestPlugin::test_pytest_markers for marker validation.

🤖 Prompt for AI Agents

In tests/integration/test_integration_basic.py around lines 50-55 the test is a placeholder with no assertions; either delete the test or replace it with a programmatic check that the current test has the integration marker by accepting the pytest "request" fixture and asserting that any(marker.name == "integration" for marker in request.node.iter_markers()); update the docstring accordingly if you keep it.

aviralgarg05

Fix the issues pls

dharapandya85 · 2025-12-24T19:12:46Z

Fix the issues pls

I had so many commits redirected to this branch, which I hard reset, now only my commit is visible, this happened during git fetch upstream.

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (1)

src/agentunit/examples/basic_evaluation.py (1)
12-36: Enhance comments for newcomer understanding.

The PR objectives specify that the example should "include explanatory comments for each step to aid newcomers." While the current comments mark each step, they could be more explanatory to help users who are new to AgentUnit understand what each component does and why.

Consider expanding the comments to explain:

What FakeAdapter is and why it's useful for testing (a simple mock that returns a fixed response)

The purpose and structure of DatasetCase fields (id, query, expected_output)

What the Runner does (orchestrates evaluation of scenarios)

What information the evaluation result provides
🔎 Suggested enhancement for explanatory comments
 def main() -> None:
-    # define simple dataset
+    # Define a simple dataset with test cases
+    # Each DatasetCase represents a single test with an input query and expected output
     cases = [
         DatasetCase(
             id="echo-task",
             query="Say hello",
             expected_output="hello",
         )
     ]
-    # create a scenario using the fake adapter
+    
+    # Create a scenario using FakeAdapter (a simple mock adapter for testing)
+    # FakeAdapter always returns the same fixed response, useful for demonstrations
     scenario = Scenario(
         name="Basic Evaluation Example",
         adapter=FakeAdapter(response="hello"),
         dataset=cases,
     )

-    # run evaluation
+    # Run the evaluation
+    # Runner orchestrates the execution of all test cases in the scenario
     runner = Runner(scenarios=[scenario])
     result = runner.run()

-    # print summary
+    # Print evaluation summary
+    # The result contains metrics like success rate and individual test outcomes
     print("=== Evaluation Summary ===")
     print(f"Scenario: {scenario.name}")
     print(f"Success rate: {result.success_rate:.0%}")

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between aa46eb6 and 119ac8c.

📒 Files selected for processing (1)

src/agentunit/examples/basic_evaluation.py

🧰 Additional context used

🧬 Code graph analysis (1)

src/agentunit/examples/basic_evaluation.py (3)

src/agentunit/datasets/base.py (2)

DatasetCase (19-27)

name (38-39)

src/agentunit/core/runner.py (1)

Runner (38-86)

src/agentunit/core/scenario.py (1)

Scenario (23-278)

🔇 Additional comments (2)

src/agentunit/examples/basic_evaluation.py (2)

1-1: LGTM!

The imports are correctly structured for the example.

38-39: LGTM!

The main guard is correctly positioned at module level, allowing the script to execute when run directly. The indentation issue from the previous review has been properly addressed.

src/agentunit/examples/basic_evaluation.py

aviralgarg05

Please fix the issues

coderabbitai

Actionable comments posted: 3

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3a427d0 and 158dfdf.

📒 Files selected for processing (1)

src/agentunit/examples/basic_evaluation.py

🧰 Additional context used

🧬 Code graph analysis (1)

src/agentunit/examples/basic_evaluation.py (3)

src/agentunit/core/runner.py (1)

Runner (38-86)

src/agentunit/core/scenario.py (1)

Scenario (23-278)

src/agentunit/datasets/base.py (4)

DatasetCase (19-27)

DatasetSource (30-57)

from_list (49-50)

name (38-39)

🔇 Additional comments (3)

src/agentunit/examples/basic_evaluation.py (3)

26-26: LGTM!

Correctly uses DatasetSource.from_list(cases) to wrap the dataset, matching the API shown in the relevant code snippets.

29-37: LGTM!

The Scenario and Runner setup correctly matches the API contracts from the relevant code snippets:

Scenario receives the required name, adapter, and dataset parameters

Runner is initialized with a list of scenarios

runner.run() is called with no arguments as expected

39-47: LGTM!

The result access pattern and main guard are correctly implemented:

Line 39 extracts the first scenario result from the suite

Line 43 accesses success_rate on the scenario result (matches the resolved pattern from past reviews)

Lines 46-47 place the main guard at module level with proper formatting

src/agentunit/examples/basic_evaluation.py

coderabbitai · 2026-01-02T19:14:29Z

src/agentunit/examples/basic_evaluation.py

+        DatasetCase(
+            input="hello",
+            expected="hello",
+        )


⚠️ Potential issue | 🔴 Critical

Fix DatasetCase field names to match the API.

The DatasetCase constructor arguments don't match the actual field names defined in the class. From the relevant code snippets, DatasetCase has:

id: str (required)

query: str (required, not input)

expected_output: str | None (optional, not expected)

This will cause a TypeError at runtime, violating the acceptance criteria that the script must run without errors.

🔎 Proposed fix for DatasetCase construction

cases = [ DatasetCase( - input="hello", - expected="hello", + id="case_1", + query="hello", + expected_output="hello", ) ]

🤖 Prompt for AI Agents

In src/agentunit/examples/basic_evaluation.py around lines 21 to 24, the DatasetCase constructor is using incorrect field names; replace the kwargs to match the class API by adding a required id (e.g., "case-1"), rename input to query, and rename expected to expected_output (or omit expected_output if you want None). Ensure the resulting call is DatasetCase(id="...", query="hello", expected_output="hello") or DatasetCase(id="...", query="hello") if no expected output is needed.

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (1)

src/agentunit/examples/basic_evaluation.py (1)

22-48: Fix incorrect DatasetCase field name and add explanatory comments.

Two issues prevent this example from meeting the acceptance criteria:

Critical: Line 27 uses input="hello", but DatasetCase expects query (not input). This will cause a TypeError at runtime.
Essential: The acceptance criteria require "explanatory comments for each step aimed at newcomers," but the main function lacks these comments.

🔎 Proposed fix

 def main() -> None:
-    # define simple dataset
+    # Step 1: Define a simple dataset with one test case
+    # Each case has an id, a query (input), and expected output
     cases = [
         DatasetCase(
             id="case_1",
-            input="hello",
+            query="hello",
             expected_output="hello",
         )
     ]
     
-    # create a scenario using the fake adapter
+    # Step 2: Create a scenario that combines the dataset with an adapter
+    # The FakeAdapter will return "hello" for any input
     scenario = Scenario(
         name="Basic Evaluation Example",
         adapter=FakeAdapter(response="hello"),
         dataset=DatasetSource.from_list(cases),
     )
 
-    # run evaluation
+    # Step 3: Run the evaluation using the Runner
+    # The runner executes all scenarios and collects results
     runner = Runner([scenario])
     result = runner.run()
 
+    # Step 4: Extract and print the results
     scenario_result = result.scenarios[0]
-    # print summary
     print("=== Evaluation Summary ===")
     print(f"Scenario: {scenario.name}")
     print(f"Success rate: {scenario_result.success_rate:.0%}")

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 158dfdf and 9e67660.

📒 Files selected for processing (1)

src/agentunit/examples/basic_evaluation.py

🧰 Additional context used

🧬 Code graph analysis (1)

src/agentunit/examples/basic_evaluation.py (5)

src/agentunit/adapters/base.py (2)

BaseAdapter (28-69)

AdapterOutcome (18-25)

src/agentunit/core/trace.py (1)

TraceLog (26-78)

src/agentunit/core/runner.py (1)

Runner (38-86)

src/agentunit/core/scenario.py (1)

Scenario (23-278)

src/agentunit/datasets/base.py (4)

DatasetCase (19-27)

DatasetSource (30-57)

name (38-39)

from_list (49-50)

coderabbitai · 2026-01-03T12:36:11Z

src/agentunit/examples/basic_evaluation.py

@@ -0,0 +1,51 @@
+from agentunit.adapters.base import BaseAdapter, AdapterOutcome
+from agentunit.core.tracelog import TraceLog


⚠️ Potential issue | 🔴 Critical

Fix the incorrect import path for TraceLog.

The import references agentunit.core.tracelog, but based on the codebase structure, TraceLog is defined in agentunit.core.trace (not tracelog). This will cause an ImportError at runtime, preventing the script from executing.

🔎 Proposed fix

-from agentunit.core.tracelog import TraceLog +from agentunit.core.trace import TraceLog

🤖 Prompt for AI Agents

In src/agentunit/examples/basic_evaluation.py around line 2, the import path is incorrect: it imports TraceLog from agentunit.core.tracelog but TraceLog is defined in agentunit.core.trace; update the import to import TraceLog from agentunit.core.trace so the module path matches the codebase and prevents ImportError at runtime.

coderabbitai · 2026-01-03T12:36:11Z

src/agentunit/examples/basic_evaluation.py

+class FakeAdapter(BaseAdapter):
+    def __init__(self, response: str):
+        self.response = response
+
+    def prepare(self):
+        pass
+
+    def execute(self, case, trace_log: TraceLog) -> AdapterOutcome:
+        return AdapterOutcome(success=True, output=self.response, error=None)
+
+    def cleanup(self):
+        pass


🛠️ Refactor suggestion | 🟠 Major

Add explanatory comments and type hints for newcomers.

The acceptance criteria require that the code be "well-commented and understandable for newcomers." The FakeAdapter class lacks explanatory comments describing its purpose and how it implements the adapter pattern. Additionally, the case parameter in execute() is missing its type hint.

🔎 Proposed improvements

class FakeAdapter(BaseAdapter): + """ + A minimal adapter that returns a fixed response for any input. + Used for demonstration purposes in this example. + """ + def __init__(self, response: str): self.response = response def prepare(self): + """Set up the adapter before execution (no-op for this example).""" pass - def execute(self, case, trace_log: TraceLog) -> AdapterOutcome: + def execute(self, case: DatasetCase, trace_log: TraceLog) -> AdapterOutcome: + """Execute a test case by returning the fixed response.""" return AdapterOutcome(success=True, output=self.response, error=None) def cleanup(self): + """Clean up resources after execution (no-op for this example).""" pass

🤖 Prompt for AI Agents

In src/agentunit/examples/basic_evaluation.py around lines 8 to 19, the FakeAdapter class lacks explanatory comments and the execute() method is missing a type hint for the case parameter; add a brief docstring for the class describing its purpose (a simple testing adapter that returns a fixed response), inline comments for each lifecycle method (prepare, execute, cleanup) explaining when they're called, and annotate execute(self, case: <appropriate_case_type>, trace_log: TraceLog) -> AdapterOutcome using the correct Case type from the project (import it if needed) so newcomers can understand usage and types.

aviralgarg05

Pls fix the issues

coderabbitai bot reviewed Dec 10, 2025

View reviewed changes

src/agentunit/examples/basic_evaluation.py Outdated Show resolved Hide resolved

aviralgarg05 requested changes Dec 10, 2025

View reviewed changes

coderabbitai bot reviewed Dec 24, 2025

View reviewed changes

aviralgarg05 requested changes Dec 24, 2025

View reviewed changes

dharapandya85 force-pushed the add-basic-evaluation-example branch from fcd60d9 to aa46eb6 Compare December 24, 2025 19:09

coderabbitai bot reviewed Dec 25, 2025

View reviewed changes

src/agentunit/examples/basic_evaluation.py Outdated Show resolved Hide resolved

src/agentunit/examples/basic_evaluation.py Outdated Show resolved Hide resolved

src/agentunit/examples/basic_evaluation.py Outdated Show resolved Hide resolved

aviralgarg05 requested changes Dec 25, 2025

View reviewed changes

dharapandya85 added 2 commits January 2, 2026 18:43

fix(example): restore basic evaluation example

4934a98

fix(example): align basic evaluation example with current AgentUnit API

3a427d0

dharapandya85 force-pushed the add-basic-evaluation-example branch from 119ac8c to 3a427d0 Compare January 2, 2026 18:46

fix(example): align basic example with Runner and BaseAdapter APIs

158dfdf

coderabbitai bot reviewed Jan 2, 2026

View reviewed changes

fix(example): align basic evaluation with current APIs

9e67660

coderabbitai bot reviewed Jan 3, 2026

View reviewed changes

aviralgarg05 requested changes Jan 3, 2026

View reviewed changes

-class FakeAdapter(BaseAdapter):
-    """
-    A simple mock adapter used only for demonstration.
-    It returns a predictable output so evaluation is easy to understand.
-    """
-    def generate(self, prompt: str) -> str:
-        # Always returns the same answer for simplicity
-        return "Hello, this is a fake response!"
+from agentunit.adapters.base import BaseAdapter, AdapterOutcome
+from agentunit.core.trace import TraceLog
+from agentunit.datasets.base import DatasetCase
+class FakeAdapter(BaseAdapter):
+    """
+    A simple mock adapter used only for demonstration.
+    It returns a predictable output so evaluation is easy to understand.
+    """
+    name = "fake"
+    def prepare(self) -> None:
+        """No setup required for the fake adapter."""
+        pass
+    def execute(self, case: DatasetCase, trace: TraceLog) -> AdapterOutcome:
+        """Execute with a fake response."""
+        response = "Hello, this is a fake response!"
+        trace.record("agent_response", content=response)
+        return AdapterOutcome(
+            success=True,
+            output=response,
+            tool_calls=[],
+        )
+    def cleanup(self) -> None:
+        """No cleanup required for the fake adapter."""
+        pass

		@@ -0,0 +1,51 @@
		from agentunit.adapters.base import BaseAdapter, AdapterOutcome
		from agentunit.core.tracelog import TraceLog

feat: add basic evaluation example using FakeAdapter #35

Are you sure you want to change the base?

feat: add basic evaluation example using FakeAdapter #35

Conversation

dharapandya85 commented Dec 10, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of Change

Changes Made

Testing

Test Configuration

Test Results

Code Quality

Documentation

Breaking Changes

Dependencies

Performance Impact

Screenshots/Recordings (if applicable)

Additional Context

Checklist

Reviewer Notes

Please pay special attention to:

Post-Merge Tasks

Summary by CodeRabbit

Uh oh!

continue bot commented Dec 10, 2025

Uh oh!

coderabbitai bot commented Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

aviralgarg05 left a comment

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented Dec 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 24, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 24, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 24, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Dec 24, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 24, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 24, 2025

Choose a reason for hiding this comment

Uh oh!

aviralgarg05 left a comment

Choose a reason for hiding this comment

Uh oh!

dharapandya85 commented Dec 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

dharapandya85 commented Dec 10, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 10, 2025 •

edited

Loading

codecov-commenter commented Dec 23, 2025 •

edited

Loading

dharapandya85 commented Dec 24, 2025 •

edited

Loading