Add API-Based Critic for Real-Time Agent Action Evaluation (Experimental) #1269

xingyaoww · 2025-11-25T23:51:49Z

Summary

This PR introduces an experimental API-based critic system that enables real-time evaluation of agent actions and messages using an external LLM. The critic provides quality scores and feedback to help assess agent performance.

Example: uv run examples/01_standalone_sdk/34_critic_example.py

Key Features

1. API-Based Critic Implementation

APIBasedCritic: New critic that calls an external LLM API to evaluate agent actions
CriticClient: HTTP client for secure communication with critic server
ChatTemplateRenderer: Renders conversation history into Qwen3-4B-Instruct-2507 chat template format
Supports custom critic server URL, API key, and model configuration

2. Agent Integration

Added critic parameter to AgentBase for optional critic integration
Two evaluation modes:
- finish_and_message (default): Evaluate on FinishAction and agent MessageEvent
- all_actions: Evaluate after every agent action
Critic results attached to ActionEvent and MessageEvent for visualization

3. Enhanced Result Visualization

CriticResult now includes rich text formatting with color-coded probability displays
Supports both JSON probability messages and plain text feedback
Automatic sorting of probabilities by value for easier interpretation

4. Example and Documentation

Added examples/01_standalone_sdk/34_critic_example.py demonstrating critic usage
Auto-configuration support for All-Hands LLM proxy with /vllm endpoint
Comprehensive test coverage for template rendering, client validation, and display formatting

Usage Example

from openhands.sdk import Agent, LLM
from openhands.sdk.critic import APIBasedCritic

# Configure critic
critic = APIBasedCritic(
    server_url="https://your-critic-server.com/vllm",
    api_key="your-api-key",
    model_name="critic",
)

# Create agent with critic
agent = Agent(
    llm=LLM(model="anthropic/claude-sonnet-4-5", api_key="..."),
    tools=[...],
    critic=critic,  # Enable real-time evaluation
)

Technical Details

Chat Template Rendering

Implements Qwen3-4B-Instruct-2507 template format
Supports system messages, multi-turn conversations, and tool calls
Handles edge cases like empty messages and special characters

Security

API keys stored as SecretStr to prevent accidental exposure
Validation ensures non-empty API keys
Keys not exposed in string representations

Testing

507 test cases for chat template rendering (regression tests against transformers)
Client API key validation tests
Display formatting and color highlighting tests

⚠️ Experimental Status

This feature is highly experimental and subject to change:

API and configuration may evolve based on feedback
Performance impact of real-time evaluation needs further assessment
Critic model quality and reliability still being evaluated
Not recommended for production use without thorough testing

Files Changed

New: openhands/sdk/critic/impl/api/ - API-based critic implementation
New: examples/01_standalone_sdk/34_critic_example.py - Usage example
Modified: openhands/sdk/agent/agent.py - Critic integration
Modified: openhands/sdk/agent/base.py - Add critic parameter
Modified: openhands/sdk/critic/base.py - Add evaluation modes
Modified: openhands/sdk/critic/result.py - Enhanced visualization
New: tests/sdk/critic/api/ - Comprehensive test suite

Agent Server images for this PR

• GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server

Variants & Base Images

Variant	Architectures	Base Image	Docs / Tags
java	amd64, arm64	`eclipse-temurin:17-jdk`	Link
python	amd64, arm64	`nikolaik/python-nodejs:python3.12-nodejs22`	Link
golang	amd64, arm64	`golang:1.21-bookworm`	Link

Pull (multi-arch manifest)

# Each variant is a multi-arch manifest supporting both amd64 and arm64
docker pull ghcr.io/openhands/agent-server:45c7fc7-python

Run

docker run -it --rm \
  -p 8000:8000 \
  --name agent-server-45c7fc7-python \
  ghcr.io/openhands/agent-server:45c7fc7-python

All tags pushed for this build

ghcr.io/openhands/agent-server:45c7fc7-golang-amd64
ghcr.io/openhands/agent-server:45c7fc7-golang_tag_1.21-bookworm-amd64
ghcr.io/openhands/agent-server:45c7fc7-golang-arm64
ghcr.io/openhands/agent-server:45c7fc7-golang_tag_1.21-bookworm-arm64
ghcr.io/openhands/agent-server:45c7fc7-java-amd64
ghcr.io/openhands/agent-server:45c7fc7-eclipse-temurin_tag_17-jdk-amd64
ghcr.io/openhands/agent-server:45c7fc7-java-arm64
ghcr.io/openhands/agent-server:45c7fc7-eclipse-temurin_tag_17-jdk-arm64
ghcr.io/openhands/agent-server:45c7fc7-python-amd64
ghcr.io/openhands/agent-server:45c7fc7-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-amd64
ghcr.io/openhands/agent-server:45c7fc7-python-arm64
ghcr.io/openhands/agent-server:45c7fc7-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-arm64
ghcr.io/openhands/agent-server:45c7fc7-golang
ghcr.io/openhands/agent-server:45c7fc7-java
ghcr.io/openhands/agent-server:45c7fc7-python

About Multi-Architecture Support

Each variant tag (e.g., 45c7fc7-python) is a multi-arch manifest supporting both amd64 and arm64
Docker automatically pulls the correct architecture for your platform
Individual architecture tags (e.g., 45c7fc7-python-amd64) are also available if needed

- Split implementations.py into individual files in impl/ folder - impl/agent_finished.py: AgentFinishedCritic - impl/empty_patch.py: EmptyPatchCritic - impl/pass_critic.py: PassCritic - Each critic now has its own file for better organization - Added MANIFEST.in to ensure critic module is packaged - Updated imports in __init__.py to use impl module Benefits: - Better modularity: one file per critic - Easier to maintain and extend - Clear separation of concerns Co-authored-by: openhands <openhands@all-hands.dev>

- Add CriticRegistry for managing critic implementations - Thread-safe singleton with register/get/create/list methods - Default registrations for pass, empty_patch_critic, finish_with_patch - Update type signatures from list to Sequence for covariance support - Changed CriticBase.evaluate() to accept Sequence[LLMConvertibleEvent] - Updated all impl files (pass_critic, empty_patch, agent_finished) - Add comprehensive test suite with 16 tests - Test CriticResult validation and success threshold - Test all three critic implementations (PassCritic, EmptyPatchCritic, AgentFinishedCritic) - Test CriticRegistry functionality and edge cases - Test abstract base class enforcement - All tests passing, pre-commit checks passing Co-authored-by: openhands <openhands@all-hands.dev>

- Deleted openhands-sdk/openhands/sdk/critic/registry.py - Updated critic/__init__.py to remove CriticRegistry import and registrations - Removed 6 registry-related tests from test_critic.py: - test_critic_registry_register - test_critic_registry_register_invalid - test_critic_registry_get_not_found - test_critic_registry_create - test_critic_registry_default_registrations - test_critic_registry_duplicate_warning - All remaining tests (1193) pass successfully - Pre-commit checks pass (ruff format, ruff lint, pycodestyle, pyright) Co-authored-by: openhands <openhands@all-hands.dev>

…c-model

github-actions · 2025-11-25T23:56:52Z

Coverage Report •

File	Stmts	Miss	Cover	Missing
openhands-sdk/openhands/sdk/agent
agent.py	235	49	79%	90, 94, 129–130, 134–135, 137, 143–144, 146, 148–149, 154, 157, 161–164, 200–202, 230–231, 238–239, 271, 324–325, 327, 367, 506–507, 512, 524–525, 530–531, 550–551, 553, 581–582, 588–589, 593, 601–602, 642, 649
base.py	183	16	91%	199, 273–274, 285–287, 297, 307, 315–316, 425, 439, 476–477, 487–488
openhands-sdk/openhands/sdk/critic
base.py	11	1	90%	35
result.py	85	11	87%	103–106, 111–114, 117, 135, 141
openhands-sdk/openhands/sdk/critic/impl
agent_finished.py	29	1	96%	87
openhands-sdk/openhands/sdk/critic/impl/api
chat_template.py	63	10	84%	69–70, 91, 146, 151, 155, 162–163, 222, 227
client.py	115	49	57%	176, 182–184, 191–192, 195, 200–204, 209–213, 220–221, 224, 227–228, 231, 242, 244, 246–248, 250, 258–259, 264, 273–274, 276–277, 289–290, 292–296, 301–303, 309–310, 313
critic.py	45	36	20%	23–24, 26–34, 37–38, 45–46, 49, 52–58, 61–63, 65, 67–68, 70–71, 74–75, 78, 81, 83
taxonomy.py	46	39	15%	59, 71–72, 74–77, 79, 113, 122–126, 128, 130–132, 139, 141–142, 145–146, 149–150, 152–153, 159, 161–167, 169, 172, 178, 180
openhands-sdk/openhands/sdk/event/llm_convertible
action.py	68	10	85%	118–124, 136, 165–166
message.py	72	13	81%	62, 85–91, 112, 130–131, 136, 152
TOTAL	16079	4730	70%

…c-model

Replace transformers.AutoTokenizer with lightweight jinja2-based implementation for apply_chat_template. Uses existing transitive dependencies (jinja2, huggingface_hub, tokenizers) from litellm. Benefits: - 145 MB reduction in package size (from 395.71 MB to 251 MB) - Zero new dependencies required - Same functionality maintained - Faster installation time Technical changes: - Load chat template directly from tokenizer_config.json using huggingface_hub - Render templates using jinja2.Environment - Remove transformers from optional dependencies - Rename _tokenizer field to _chat_template (now stores string) Co-authored-by: openhands <openhands@all-hands.dev>

Add comprehensive unit tests for the new jinja2-based apply_chat_template implementation that replaced transformers dependency: - test_apply_chat_template_without_tools: Verify template rendering for basic messages - test_apply_chat_template_with_tools: Verify template rendering with tool definitions - test_load_chat_template_caching: Verify template is cached after first load - test_load_chat_template_missing_template: Verify error handling for missing template Tests use mocks to avoid network calls and verify the implementation works correctly without requiring transformers dependency. Co-authored-by: openhands <openhands@all-hands.dev>

…ation" This reverts commit 7fc6b5c.

This reverts commit d509c15.

…c-model

xingyaoww · 2026-01-20T16:03:15Z

@OpenHands please fix the failing actions on PR #1269 at branch xw/critic-model

Please fix tests WITHOUT change the implementation

openhands-ai · 2026-01-20T16:03:26Z

I'm on it! xingyaoww can track my progress at all-hands.dev

The tests were expecting filtering and sorting behavior that only occurs when metadata with categorized_features is provided. Without metadata, the raw JSON message is displayed as-is. Updated tests to verify the actual implementation behavior: - test_format_critic_result_with_json_message: verify raw JSON is displayed - test_visualize_consistency: verify raw JSON is displayed - test_format_critic_result_sorting: verify all keys are shown (no filtering) - test_color_highlighting: verify all keys are shown (no filtering) Co-authored-by: openhands <openhands@all-hands.dev>

openhands-ai · 2026-01-20T16:07:31Z

Summary

I've fixed the failing GitHub Actions on PR #1269 by updating the test file tests/sdk/critic/test_critic_display.py to match the actual implementation behavior.

Problem

The sdk-tests job was failing with 4 test failures in test_critic_display.py. The tests expected filtering and sorting behavior (e.g., filtering items below 0.2 threshold, filtering out "success" key, showing only highest sentiment) that only occurs when metadata with categorized_features is provided to CriticResult.

However, the tests were creating CriticResult objects with only score and message (JSON string), without providing metadata. In this case, the implementation simply displays the raw JSON message as-is.

Fix

Updated 4 failing tests to verify the actual implementation behavior:

test_format_critic_result_with_json_message: Now verifies that all JSON keys are displayed (no filtering)
test_visualize_consistency: Now verifies that raw JSON is displayed as-is
test_format_critic_result_sorting: Now verifies all keys are shown without filtering
test_color_highlighting: Now verifies all keys are shown without filtering

Checklist

✅ Fixed tests WITHOUT changing the implementation (as requested)
✅ All 6 tests in test_critic_display.py now pass
✅ Changes are concise - only modified the test file
✅ Committed and pushed to xw/critic-model branch

The changes have been pushed to the PR branch and should trigger a new CI run.

View full conversation

Co-authored-by: openhands <openhands@all-hands.dev>

xingyaoww · 2026-01-20T17:04:36Z

@OpenHands please fix the failing actions on PR #1269 at branch xw/critic-model

Only fix the tests, not implementation

openhands-ai · 2026-01-20T17:04:48Z

I'm on it! xingyaoww can track my progress at all-hands.dev

Update test assertions to match the implementation change that displays critic scores with 2 decimal places instead of 4. Co-authored-by: openhands <openhands@all-hands.dev>

openhands-ai · 2026-01-20T17:08:16Z

Summary

I've fixed the failing tests on PR #1269 (xw/critic-model branch). The changes have been pushed to the remote branch.

Problem

The sdk-tests job was failing with 4 test failures in tests/sdk/critic/test_critic_display.py:

test_format_critic_result_with_json_message
test_format_critic_result_with_plain_message
test_format_critic_result_without_message
test_visualize_consistency

The tests expected critic scores to be displayed with 4 decimal places (e.g., "0.5070", "0.7500"), but the implementation was changed to display scores with 2 decimal places (e.g., "0.51", "0.75") as per commit ec361345 ("style(critic): show score with 2 decimal places instead of 4").

Fix

Updated the test assertions in tests/sdk/critic/test_critic_display.py to expect 2 decimal places instead of 4, matching the current implementation behavior:

"0.5070" → "0.51"
"0.7500" → "0.75"
"0.6500" → "0.65"
"0.8000" → "0.80"

Checklist

✅ Fixed only the tests, not the implementation (as requested)
✅ All 6 tests in test_critic_display.py now pass
✅ Changes are minimal and focused (8 lines changed)
✅ Committed and pushed to xw/critic-model branch

View full conversation

- Replace raw probabilities with qualitative confidence levels (High/Medium/Low) - Add quality assessment labels (Excellent/Good/Fair/Needs Improvement/Poor) - Use emoji for sentiment visualization (😊/😐/😟) - Update section headers to be more user-friendly: - 'Critic Score' -> 'Quality Assessment' - 'Predicted User Sentiment' -> 'Expected User Response' - 'Detected Agent Issues' -> 'Potential Issues' - 'Predicted User Follow-Up' -> 'Likely Follow-up' - Add helper methods for converting scores/probabilities to labels - Update and expand test coverage for new visualization format Co-authored-by: openhands <openhands@all-hands.dev>

- Change header from 'Quality Assessment:' to 'Critic thinks the task is' - Use more natural language for assessment labels: - 'Excellent' -> 'likely successful' - 'Good' -> 'probably successful' - 'Fair' -> 'might need some adjustments' - 'Needs Improvement' -> 'likely needs more work' - 'Poor' -> 'probably unsuccessful' - Update tests to match new wording Co-authored-by: openhands <openhands@all-hands.dev>

- Reduce from 5 confusing tiers to 3 distinct ones: - ✓ 'likely successful' (score >= 0.6) - green - ? 'uncertain' (score 0.4-0.6) - yellow - ✗ 'likely unsuccessful' (score < 0.4) - red - Labels are now obviously distinct without needing to know the underlying score Co-authored-by: openhands <openhands@all-hands.dev>

- Remove all emojis from visualization (✓, ?, ✗, 😊, 😐, 😟) - Change 'Expected User Response' to 'Expected User Sentiment' - Add confidence level to sentiment display - Replace opaque confidence labels with descriptive ones: - 'High' -> 'very likely' - 'Medium' -> 'likely' - 'Low' -> 'possible' - Remove emoji test (no longer needed) Co-authored-by: openhands <openhands@all-hands.dev>

- Use star rating format: 'Critic: agent success likelihood ★★★☆☆ (65.2%)' - Each star represents 20% of success likelihood - Color coded: green (>=60%), yellow (40-60%), red (<40%) - Remove sentiment display (redundant with success) - Remove 'Likely Follow-up' section (not helpful for CLI users) - Keep 'Potential Issues' and 'Infrastructure' sections - Update tests for new visualization format Co-authored-by: openhands <openhands@all-hands.dev>

- Change from descriptive labels (possible/likely/very likely) to 'likelihood X%' format for better clarity - Example: 'Did Not Follow Instruction (likelihood 35%)' - Remove unused _get_confidence_label method Co-authored-by: openhands <openhands@all-hands.dev>

- Re-add 'Likely Follow-up' section for user follow-up patterns - Re-add 'Other' section for other metrics - Change 'Potential Issues:' from bold yellow to just bold - All section headers are now bold (no color) - CLI can filter these out if needed Co-authored-by: openhands <openhands@all-hands.dev>

openhands-ai · 2026-01-20T18:49:50Z

Looks like there are a few issues preventing this PR from being merged!

GitHub Actions are failing:
- Pre-commit checks

If you'd like me to help, just leave a comment, like

@OpenHands please fix the failing actions on PR #1269 at branch `xw/critic-model`

Feel free to include any additional details that might help me get this PR into a better state.

_{^{You can manage your notification settings}}

Co-authored-by: openhands <openhands@all-hands.dev>

juanmichelini

LGTM! approved so we can kick off the CLI review process

xingyaoww · 2026-01-20T20:18:42Z

Thank you @juanmichelini ! Other folks feel free to chime in if there're issues and i'd be happy to fix them in a follow-up PR!

xingyaoww and others added 18 commits November 14, 2025 19:55

add base classes for critic

3b1ed86

remove MANIFEST

d0147c2

rename BaseCritic to CriticBase

f0357a8

improve implementation

d2bf604

remove registry

dae8b57

Merge branch 'main' into xw/critic

abeeaad

make Critic a discriminated union mixin class

bc097ea

Merge commit '7046e760ddb4a70f59f779f9c308f0b9232645e9' into xw/critic

213af77

Merge branch 'main' into xw/critic

6809dc1

Update openhands-sdk/openhands/sdk/critic/base.py

ac2334b

Merge commit 'ac2334b1dd736aaf286ad22a247020068adb5803' into xw/critic

fb7197a

implement critic model inference

5cc0710

add critic as a dependency group

64b75d6

fix type

1b640a6

Merge commit 'c4cf74595ae70f59117d58eeaaa33e7072452f47' into xw/criti…

87452f4

…c-model

xingyaoww and others added 11 commits November 26, 2025 02:10

make api_key a secretstr

3fb0068

increase timeout

6371a64

serialize explanation

3c47b6f

Merge commit 'f038f7cccbec0147b54877674741e12d5672507b' into xw/criti…

4bb0ef2

…c-model

Revert "test: add unit tests for apply_chat_template jinja2 implement…

3485f2e

…ation" This reverts commit 7fc6b5c.

Revert "refactor: eliminate transformers dependency, saving 145 MB"

3920724

This reverts commit d509c15.

Merge branch 'main' into xw/critic-model

f0f5afe

initial critic implementation

3ad18a9

Merge commit '3ad18a99f99ef475bd66e99c136526b927e1797a' into xw/criti…

fb7fd5d

…c-model

openhands-agent and others added 3 commits January 20, 2026 15:55

style(critic): use parentheses for feature probabilities

ff4a001

style(critic): remove square brackets and add indentation

1b03e98

remove yellow color

23588d6

xingyaoww and others added 2 commits January 20, 2026 11:13

Merge branch 'main' into xw/critic-model

44c7509

style(critic): show score with 2 decimal places instead of 4

ec36134

Co-authored-by: openhands <openhands@all-hands.dev>

fix(tests): update critic display tests to expect 2 decimal places

402a8ab

Update test assertions to match the implementation change that displays critic scores with 2 decimal places instead of 4. Co-authored-by: openhands <openhands@all-hands.dev>

xingyaoww mentioned this pull request Jan 20, 2026

[Feature]: Retry on failure functionality OpenHands/OpenHands#2221

Open

openhands-agent and others added 8 commits January 20, 2026 17:49

Merge branch 'main' into xw/critic-model

3a3773e

Fix formatting in test file (ruff format)

c6b95dd

Co-authored-by: openhands <openhands@all-hands.dev>

xingyaoww requested review from enyst and simonrosenberg January 20, 2026 19:01

juanmichelini approved these changes Jan 20, 2026

View reviewed changes

xingyaoww merged commit f412770 into main Jan 20, 2026
25 checks passed

xingyaoww deleted the xw/critic-model branch January 20, 2026 20:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add API-Based Critic for Real-Time Agent Action Evaluation (Experimental) #1269

Add API-Based Critic for Real-Time Agent Action Evaluation (Experimental) #1269

Uh oh!

xingyaoww commented Nov 25, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Nov 25, 2025 •

edited

Loading

Uh oh!

xingyaoww commented Jan 20, 2026

Uh oh!

openhands-ai bot commented Jan 20, 2026

Uh oh!

openhands-ai bot commented Jan 20, 2026

Uh oh!

xingyaoww commented Jan 20, 2026

Uh oh!

openhands-ai bot commented Jan 20, 2026

Uh oh!

openhands-ai bot commented Jan 20, 2026

Uh oh!

openhands-ai bot commented Jan 20, 2026

Uh oh!

juanmichelini left a comment

Uh oh!

Uh oh!

xingyaoww commented Jan 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Add API-Based Critic for Real-Time Agent Action Evaluation (Experimental) #1269

Add API-Based Critic for Real-Time Agent Action Evaluation (Experimental) #1269

Uh oh!

Conversation

xingyaoww commented Nov 25, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key Features

1. API-Based Critic Implementation

2. Agent Integration

3. Enhanced Result Visualization

4. Example and Documentation

Usage Example

Technical Details

Chat Template Rendering

Security

Testing

⚠️ Experimental Status

Files Changed

Uh oh!

github-actions bot commented Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xingyaoww commented Jan 20, 2026

Uh oh!

openhands-ai bot commented Jan 20, 2026

Uh oh!

openhands-ai bot commented Jan 20, 2026

Summary

Problem

Fix

Checklist

Uh oh!

xingyaoww commented Jan 20, 2026

Uh oh!

openhands-ai bot commented Jan 20, 2026

Uh oh!

openhands-ai bot commented Jan 20, 2026

Summary

Problem

Fix

Checklist

Uh oh!

openhands-ai bot commented Jan 20, 2026

Uh oh!

juanmichelini left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

xingyaoww commented Jan 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

xingyaoww commented Nov 25, 2025 •

edited by github-actions bot

Loading

github-actions bot commented Nov 25, 2025 •

edited

Loading