Skip to content

Conversation

@xingyaoww
Copy link
Collaborator

@xingyaoww xingyaoww commented Nov 25, 2025

Summary

This PR introduces an experimental API-based critic system that enables real-time evaluation of agent actions and messages using an external LLM. The critic provides quality scores and feedback to help assess agent performance.

Example: uv run examples/01_standalone_sdk/34_critic_example.py

image

Key Features

1. API-Based Critic Implementation

  • APIBasedCritic: New critic that calls an external LLM API to evaluate agent actions
  • CriticClient: HTTP client for secure communication with critic server
  • ChatTemplateRenderer: Renders conversation history into Qwen3-4B-Instruct-2507 chat template format
  • Supports custom critic server URL, API key, and model configuration

2. Agent Integration

  • Added critic parameter to AgentBase for optional critic integration
  • Two evaluation modes:
    • finish_and_message (default): Evaluate on FinishAction and agent MessageEvent
    • all_actions: Evaluate after every agent action
  • Critic results attached to ActionEvent and MessageEvent for visualization

3. Enhanced Result Visualization

  • CriticResult now includes rich text formatting with color-coded probability displays
  • Supports both JSON probability messages and plain text feedback
  • Automatic sorting of probabilities by value for easier interpretation

4. Example and Documentation

  • Added examples/01_standalone_sdk/34_critic_example.py demonstrating critic usage
  • Auto-configuration support for All-Hands LLM proxy with /vllm endpoint
  • Comprehensive test coverage for template rendering, client validation, and display formatting

Usage Example

from openhands.sdk import Agent, LLM
from openhands.sdk.critic import APIBasedCritic

# Configure critic
critic = APIBasedCritic(
    server_url="https://your-critic-server.com/vllm",
    api_key="your-api-key",
    model_name="critic",
)

# Create agent with critic
agent = Agent(
    llm=LLM(model="anthropic/claude-sonnet-4-5", api_key="..."),
    tools=[...],
    critic=critic,  # Enable real-time evaluation
)

Technical Details

Chat Template Rendering

  • Implements Qwen3-4B-Instruct-2507 template format
  • Supports system messages, multi-turn conversations, and tool calls
  • Handles edge cases like empty messages and special characters

Security

  • API keys stored as SecretStr to prevent accidental exposure
  • Validation ensures non-empty API keys
  • Keys not exposed in string representations

Testing

  • 507 test cases for chat template rendering (regression tests against transformers)
  • Client API key validation tests
  • Display formatting and color highlighting tests

⚠️ Experimental Status

This feature is highly experimental and subject to change:

  • API and configuration may evolve based on feedback
  • Performance impact of real-time evaluation needs further assessment
  • Critic model quality and reliability still being evaluated
  • Not recommended for production use without thorough testing

Files Changed

  • New: openhands/sdk/critic/impl/api/ - API-based critic implementation
  • New: examples/01_standalone_sdk/34_critic_example.py - Usage example
  • Modified: openhands/sdk/agent/agent.py - Critic integration
  • Modified: openhands/sdk/agent/base.py - Add critic parameter
  • Modified: openhands/sdk/critic/base.py - Add evaluation modes
  • Modified: openhands/sdk/critic/result.py - Enhanced visualization
  • New: tests/sdk/critic/api/ - Comprehensive test suite

Agent Server images for this PR

GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server

Variants & Base Images

Variant Architectures Base Image Docs / Tags
java amd64, arm64 eclipse-temurin:17-jdk Link
python amd64, arm64 nikolaik/python-nodejs:python3.12-nodejs22 Link
golang amd64, arm64 golang:1.21-bookworm Link

Pull (multi-arch manifest)

# Each variant is a multi-arch manifest supporting both amd64 and arm64
docker pull ghcr.io/openhands/agent-server:45c7fc7-python

Run

docker run -it --rm \
  -p 8000:8000 \
  --name agent-server-45c7fc7-python \
  ghcr.io/openhands/agent-server:45c7fc7-python

All tags pushed for this build

ghcr.io/openhands/agent-server:45c7fc7-golang-amd64
ghcr.io/openhands/agent-server:45c7fc7-golang_tag_1.21-bookworm-amd64
ghcr.io/openhands/agent-server:45c7fc7-golang-arm64
ghcr.io/openhands/agent-server:45c7fc7-golang_tag_1.21-bookworm-arm64
ghcr.io/openhands/agent-server:45c7fc7-java-amd64
ghcr.io/openhands/agent-server:45c7fc7-eclipse-temurin_tag_17-jdk-amd64
ghcr.io/openhands/agent-server:45c7fc7-java-arm64
ghcr.io/openhands/agent-server:45c7fc7-eclipse-temurin_tag_17-jdk-arm64
ghcr.io/openhands/agent-server:45c7fc7-python-amd64
ghcr.io/openhands/agent-server:45c7fc7-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-amd64
ghcr.io/openhands/agent-server:45c7fc7-python-arm64
ghcr.io/openhands/agent-server:45c7fc7-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-arm64
ghcr.io/openhands/agent-server:45c7fc7-golang
ghcr.io/openhands/agent-server:45c7fc7-java
ghcr.io/openhands/agent-server:45c7fc7-python

About Multi-Architecture Support

  • Each variant tag (e.g., 45c7fc7-python) is a multi-arch manifest supporting both amd64 and arm64
  • Docker automatically pulls the correct architecture for your platform
  • Individual architecture tags (e.g., 45c7fc7-python-amd64) are also available if needed

xingyaoww and others added 18 commits November 14, 2025 19:55
- Split implementations.py into individual files in impl/ folder
  - impl/agent_finished.py: AgentFinishedCritic
  - impl/empty_patch.py: EmptyPatchCritic
  - impl/pass_critic.py: PassCritic
- Each critic now has its own file for better organization
- Added MANIFEST.in to ensure critic module is packaged
- Updated imports in __init__.py to use impl module

Benefits:
- Better modularity: one file per critic
- Easier to maintain and extend
- Clear separation of concerns

Co-authored-by: openhands <openhands@all-hands.dev>
- Add CriticRegistry for managing critic implementations
  - Thread-safe singleton with register/get/create/list methods
  - Default registrations for pass, empty_patch_critic, finish_with_patch
- Update type signatures from list to Sequence for covariance support
  - Changed CriticBase.evaluate() to accept Sequence[LLMConvertibleEvent]
  - Updated all impl files (pass_critic, empty_patch, agent_finished)
- Add comprehensive test suite with 16 tests
  - Test CriticResult validation and success threshold
  - Test all three critic implementations (PassCritic, EmptyPatchCritic, AgentFinishedCritic)
  - Test CriticRegistry functionality and edge cases
  - Test abstract base class enforcement
- All tests passing, pre-commit checks passing

Co-authored-by: openhands <openhands@all-hands.dev>
- Deleted openhands-sdk/openhands/sdk/critic/registry.py
- Updated critic/__init__.py to remove CriticRegistry import and registrations
- Removed 6 registry-related tests from test_critic.py:
  - test_critic_registry_register
  - test_critic_registry_register_invalid
  - test_critic_registry_get_not_found
  - test_critic_registry_create
  - test_critic_registry_default_registrations
  - test_critic_registry_duplicate_warning
- All remaining tests (1193) pass successfully
- Pre-commit checks pass (ruff format, ruff lint, pycodestyle, pyright)

Co-authored-by: openhands <openhands@all-hands.dev>
@github-actions
Copy link
Contributor

github-actions bot commented Nov 25, 2025

Coverage

Coverage Report •
FileStmtsMissCoverMissing
openhands-sdk/openhands/sdk/agent
   agent.py2354979%90, 94, 129–130, 134–135, 137, 143–144, 146, 148–149, 154, 157, 161–164, 200–202, 230–231, 238–239, 271, 324–325, 327, 367, 506–507, 512, 524–525, 530–531, 550–551, 553, 581–582, 588–589, 593, 601–602, 642, 649
   base.py1831691%199, 273–274, 285–287, 297, 307, 315–316, 425, 439, 476–477, 487–488
openhands-sdk/openhands/sdk/critic
   base.py11190%35
   result.py851187%103–106, 111–114, 117, 135, 141
openhands-sdk/openhands/sdk/critic/impl
   agent_finished.py29196%87
openhands-sdk/openhands/sdk/critic/impl/api
   chat_template.py631084%69–70, 91, 146, 151, 155, 162–163, 222, 227
   client.py1154957%176, 182–184, 191–192, 195, 200–204, 209–213, 220–221, 224, 227–228, 231, 242, 244, 246–248, 250, 258–259, 264, 273–274, 276–277, 289–290, 292–296, 301–303, 309–310, 313
   critic.py453620%23–24, 26–34, 37–38, 45–46, 49, 52–58, 61–63, 65, 67–68, 70–71, 74–75, 78, 81, 83
   taxonomy.py463915%59, 71–72, 74–77, 79, 113, 122–126, 128, 130–132, 139, 141–142, 145–146, 149–150, 152–153, 159, 161–167, 169, 172, 178, 180
openhands-sdk/openhands/sdk/event/llm_convertible
   action.py681085%118–124, 136, 165–166
   message.py721381%62, 85–91, 112, 130–131, 136, 152
TOTAL16079473070% 

xingyaoww and others added 11 commits November 26, 2025 02:10
Replace transformers.AutoTokenizer with lightweight jinja2-based
implementation for apply_chat_template. Uses existing transitive
dependencies (jinja2, huggingface_hub, tokenizers) from litellm.

Benefits:
- 145 MB reduction in package size (from 395.71 MB to 251 MB)
- Zero new dependencies required
- Same functionality maintained
- Faster installation time

Technical changes:
- Load chat template directly from tokenizer_config.json using huggingface_hub
- Render templates using jinja2.Environment
- Remove transformers from optional dependencies
- Rename _tokenizer field to _chat_template (now stores string)

Co-authored-by: openhands <openhands@all-hands.dev>
Add comprehensive unit tests for the new jinja2-based apply_chat_template
implementation that replaced transformers dependency:

- test_apply_chat_template_without_tools: Verify template rendering for basic messages
- test_apply_chat_template_with_tools: Verify template rendering with tool definitions
- test_load_chat_template_caching: Verify template is cached after first load
- test_load_chat_template_missing_template: Verify error handling for missing template

Tests use mocks to avoid network calls and verify the implementation works
correctly without requiring transformers dependency.

Co-authored-by: openhands <openhands@all-hands.dev>
@xingyaoww
Copy link
Collaborator Author

@OpenHands please fix the failing actions on PR #1269 at branch xw/critic-model

Please fix tests WITHOUT change the implementation

@openhands-ai
Copy link

openhands-ai bot commented Jan 20, 2026

I'm on it! xingyaoww can track my progress at all-hands.dev

The tests were expecting filtering and sorting behavior that only occurs
when metadata with categorized_features is provided. Without metadata,
the raw JSON message is displayed as-is.

Updated tests to verify the actual implementation behavior:
- test_format_critic_result_with_json_message: verify raw JSON is displayed
- test_visualize_consistency: verify raw JSON is displayed
- test_format_critic_result_sorting: verify all keys are shown (no filtering)
- test_color_highlighting: verify all keys are shown (no filtering)

Co-authored-by: openhands <openhands@all-hands.dev>
@openhands-ai
Copy link

openhands-ai bot commented Jan 20, 2026

Summary

I've fixed the failing GitHub Actions on PR #1269 by updating the test file tests/sdk/critic/test_critic_display.py to match the actual implementation behavior.

Problem

The sdk-tests job was failing with 4 test failures in test_critic_display.py. The tests expected filtering and sorting behavior (e.g., filtering items below 0.2 threshold, filtering out "success" key, showing only highest sentiment) that only occurs when metadata with categorized_features is provided to CriticResult.

However, the tests were creating CriticResult objects with only score and message (JSON string), without providing metadata. In this case, the implementation simply displays the raw JSON message as-is.

Fix

Updated 4 failing tests to verify the actual implementation behavior:

  1. test_format_critic_result_with_json_message: Now verifies that all JSON keys are displayed (no filtering)
  2. test_visualize_consistency: Now verifies that raw JSON is displayed as-is
  3. test_format_critic_result_sorting: Now verifies all keys are shown without filtering
  4. test_color_highlighting: Now verifies all keys are shown without filtering

Checklist

  • ✅ Fixed tests WITHOUT changing the implementation (as requested)
  • ✅ All 6 tests in test_critic_display.py now pass
  • ✅ Changes are concise - only modified the test file
  • ✅ Committed and pushed to xw/critic-model branch

The changes have been pushed to the PR branch and should trigger a new CI run.

View full conversation

@xingyaoww
Copy link
Collaborator Author

@OpenHands please fix the failing actions on PR #1269 at branch xw/critic-model

Only fix the tests, not implementation

@openhands-ai
Copy link

openhands-ai bot commented Jan 20, 2026

I'm on it! xingyaoww can track my progress at all-hands.dev

Update test assertions to match the implementation change that displays
critic scores with 2 decimal places instead of 4.

Co-authored-by: openhands <openhands@all-hands.dev>
@openhands-ai
Copy link

openhands-ai bot commented Jan 20, 2026

Summary

I've fixed the failing tests on PR #1269 (xw/critic-model branch). The changes have been pushed to the remote branch.

Problem

The sdk-tests job was failing with 4 test failures in tests/sdk/critic/test_critic_display.py:

  • test_format_critic_result_with_json_message
  • test_format_critic_result_with_plain_message
  • test_format_critic_result_without_message
  • test_visualize_consistency

The tests expected critic scores to be displayed with 4 decimal places (e.g., "0.5070", "0.7500"), but the implementation was changed to display scores with 2 decimal places (e.g., "0.51", "0.75") as per commit ec361345 ("style(critic): show score with 2 decimal places instead of 4").

Fix

Updated the test assertions in tests/sdk/critic/test_critic_display.py to expect 2 decimal places instead of 4, matching the current implementation behavior:

  • "0.5070""0.51"
  • "0.7500""0.75"
  • "0.6500""0.65"
  • "0.8000""0.80"

Checklist

  • ✅ Fixed only the tests, not the implementation (as requested)
  • ✅ All 6 tests in test_critic_display.py now pass
  • ✅ Changes are minimal and focused (8 lines changed)
  • ✅ Committed and pushed to xw/critic-model branch

View full conversation

openhands-agent and others added 8 commits January 20, 2026 17:49
- Replace raw probabilities with qualitative confidence levels (High/Medium/Low)
- Add quality assessment labels (Excellent/Good/Fair/Needs Improvement/Poor)
- Use emoji for sentiment visualization (😊/😐/😟)
- Update section headers to be more user-friendly:
  - 'Critic Score' -> 'Quality Assessment'
  - 'Predicted User Sentiment' -> 'Expected User Response'
  - 'Detected Agent Issues' -> 'Potential Issues'
  - 'Predicted User Follow-Up' -> 'Likely Follow-up'
- Add helper methods for converting scores/probabilities to labels
- Update and expand test coverage for new visualization format

Co-authored-by: openhands <openhands@all-hands.dev>
- Change header from 'Quality Assessment:' to 'Critic thinks the task is'
- Use more natural language for assessment labels:
  - 'Excellent' -> 'likely successful'
  - 'Good' -> 'probably successful'
  - 'Fair' -> 'might need some adjustments'
  - 'Needs Improvement' -> 'likely needs more work'
  - 'Poor' -> 'probably unsuccessful'
- Update tests to match new wording

Co-authored-by: openhands <openhands@all-hands.dev>
- Reduce from 5 confusing tiers to 3 distinct ones:
  - ✓ 'likely successful' (score >= 0.6) - green
  - ? 'uncertain' (score 0.4-0.6) - yellow
  - ✗ 'likely unsuccessful' (score < 0.4) - red
- Labels are now obviously distinct without needing to know the underlying score

Co-authored-by: openhands <openhands@all-hands.dev>
- Remove all emojis from visualization (✓, ?, ✗, 😊, 😐, 😟)
- Change 'Expected User Response' to 'Expected User Sentiment'
- Add confidence level to sentiment display
- Replace opaque confidence labels with descriptive ones:
  - 'High' -> 'very likely'
  - 'Medium' -> 'likely'
  - 'Low' -> 'possible'
- Remove emoji test (no longer needed)

Co-authored-by: openhands <openhands@all-hands.dev>
- Use star rating format: 'Critic: agent success likelihood ★★★☆☆ (65.2%)'
- Each star represents 20% of success likelihood
- Color coded: green (>=60%), yellow (40-60%), red (<40%)
- Remove sentiment display (redundant with success)
- Remove 'Likely Follow-up' section (not helpful for CLI users)
- Keep 'Potential Issues' and 'Infrastructure' sections
- Update tests for new visualization format

Co-authored-by: openhands <openhands@all-hands.dev>
- Change from descriptive labels (possible/likely/very likely) to
  'likelihood X%' format for better clarity
- Example: 'Did Not Follow Instruction (likelihood 35%)'
- Remove unused _get_confidence_label method

Co-authored-by: openhands <openhands@all-hands.dev>
- Re-add 'Likely Follow-up' section for user follow-up patterns
- Re-add 'Other' section for other metrics
- Change 'Potential Issues:' from bold yellow to just bold
- All section headers are now bold (no color)
- CLI can filter these out if needed

Co-authored-by: openhands <openhands@all-hands.dev>
@openhands-ai
Copy link

openhands-ai bot commented Jan 20, 2026

Looks like there are a few issues preventing this PR from being merged!

  • GitHub Actions are failing:
    • Pre-commit checks

If you'd like me to help, just leave a comment, like

@OpenHands please fix the failing actions on PR #1269 at branch `xw/critic-model`

Feel free to include any additional details that might help me get this PR into a better state.

You can manage your notification settings

Co-authored-by: openhands <openhands@all-hands.dev>
Copy link
Collaborator

@juanmichelini juanmichelini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! approved so we can kick off the CLI review process

@xingyaoww xingyaoww merged commit f412770 into main Jan 20, 2026
25 checks passed
@xingyaoww xingyaoww deleted the xw/critic-model branch January 20, 2026 20:18
@xingyaoww
Copy link
Collaborator Author

Thank you @juanmichelini ! Other folks feel free to chime in if there're issues and i'd be happy to fix them in a follow-up PR!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants