-
Notifications
You must be signed in to change notification settings - Fork 115
Add API-Based Critic for Real-Time Agent Action Evaluation (Experimental) #1269
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- Split implementations.py into individual files in impl/ folder - impl/agent_finished.py: AgentFinishedCritic - impl/empty_patch.py: EmptyPatchCritic - impl/pass_critic.py: PassCritic - Each critic now has its own file for better organization - Added MANIFEST.in to ensure critic module is packaged - Updated imports in __init__.py to use impl module Benefits: - Better modularity: one file per critic - Easier to maintain and extend - Clear separation of concerns Co-authored-by: openhands <openhands@all-hands.dev>
- Add CriticRegistry for managing critic implementations - Thread-safe singleton with register/get/create/list methods - Default registrations for pass, empty_patch_critic, finish_with_patch - Update type signatures from list to Sequence for covariance support - Changed CriticBase.evaluate() to accept Sequence[LLMConvertibleEvent] - Updated all impl files (pass_critic, empty_patch, agent_finished) - Add comprehensive test suite with 16 tests - Test CriticResult validation and success threshold - Test all three critic implementations (PassCritic, EmptyPatchCritic, AgentFinishedCritic) - Test CriticRegistry functionality and edge cases - Test abstract base class enforcement - All tests passing, pre-commit checks passing Co-authored-by: openhands <openhands@all-hands.dev>
- Deleted openhands-sdk/openhands/sdk/critic/registry.py - Updated critic/__init__.py to remove CriticRegistry import and registrations - Removed 6 registry-related tests from test_critic.py: - test_critic_registry_register - test_critic_registry_register_invalid - test_critic_registry_get_not_found - test_critic_registry_create - test_critic_registry_default_registrations - test_critic_registry_duplicate_warning - All remaining tests (1193) pass successfully - Pre-commit checks pass (ruff format, ruff lint, pycodestyle, pyright) Co-authored-by: openhands <openhands@all-hands.dev>
Coverage Report •
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Replace transformers.AutoTokenizer with lightweight jinja2-based implementation for apply_chat_template. Uses existing transitive dependencies (jinja2, huggingface_hub, tokenizers) from litellm. Benefits: - 145 MB reduction in package size (from 395.71 MB to 251 MB) - Zero new dependencies required - Same functionality maintained - Faster installation time Technical changes: - Load chat template directly from tokenizer_config.json using huggingface_hub - Render templates using jinja2.Environment - Remove transformers from optional dependencies - Rename _tokenizer field to _chat_template (now stores string) Co-authored-by: openhands <openhands@all-hands.dev>
Add comprehensive unit tests for the new jinja2-based apply_chat_template implementation that replaced transformers dependency: - test_apply_chat_template_without_tools: Verify template rendering for basic messages - test_apply_chat_template_with_tools: Verify template rendering with tool definitions - test_load_chat_template_caching: Verify template is cached after first load - test_load_chat_template_missing_template: Verify error handling for missing template Tests use mocks to avoid network calls and verify the implementation works correctly without requiring transformers dependency. Co-authored-by: openhands <openhands@all-hands.dev>
…ation" This reverts commit 7fc6b5c.
This reverts commit d509c15.
|
@OpenHands please fix the failing actions on PR #1269 at branch Please fix tests WITHOUT change the implementation |
|
I'm on it! xingyaoww can track my progress at all-hands.dev |
The tests were expecting filtering and sorting behavior that only occurs when metadata with categorized_features is provided. Without metadata, the raw JSON message is displayed as-is. Updated tests to verify the actual implementation behavior: - test_format_critic_result_with_json_message: verify raw JSON is displayed - test_visualize_consistency: verify raw JSON is displayed - test_format_critic_result_sorting: verify all keys are shown (no filtering) - test_color_highlighting: verify all keys are shown (no filtering) Co-authored-by: openhands <openhands@all-hands.dev>
SummaryI've fixed the failing GitHub Actions on PR #1269 by updating the test file ProblemThe However, the tests were creating FixUpdated 4 failing tests to verify the actual implementation behavior:
Checklist
The changes have been pushed to the PR branch and should trigger a new CI run. |
Co-authored-by: openhands <openhands@all-hands.dev>
|
@OpenHands please fix the failing actions on PR #1269 at branch Only fix the tests, not implementation |
|
I'm on it! xingyaoww can track my progress at all-hands.dev |
Update test assertions to match the implementation change that displays critic scores with 2 decimal places instead of 4. Co-authored-by: openhands <openhands@all-hands.dev>
SummaryI've fixed the failing tests on PR #1269 ( ProblemThe
The tests expected critic scores to be displayed with 4 decimal places (e.g., "0.5070", "0.7500"), but the implementation was changed to display scores with 2 decimal places (e.g., "0.51", "0.75") as per commit FixUpdated the test assertions in
Checklist
|
- Replace raw probabilities with qualitative confidence levels (High/Medium/Low) - Add quality assessment labels (Excellent/Good/Fair/Needs Improvement/Poor) - Use emoji for sentiment visualization (😊/😐/😟) - Update section headers to be more user-friendly: - 'Critic Score' -> 'Quality Assessment' - 'Predicted User Sentiment' -> 'Expected User Response' - 'Detected Agent Issues' -> 'Potential Issues' - 'Predicted User Follow-Up' -> 'Likely Follow-up' - Add helper methods for converting scores/probabilities to labels - Update and expand test coverage for new visualization format Co-authored-by: openhands <openhands@all-hands.dev>
- Change header from 'Quality Assessment:' to 'Critic thinks the task is' - Use more natural language for assessment labels: - 'Excellent' -> 'likely successful' - 'Good' -> 'probably successful' - 'Fair' -> 'might need some adjustments' - 'Needs Improvement' -> 'likely needs more work' - 'Poor' -> 'probably unsuccessful' - Update tests to match new wording Co-authored-by: openhands <openhands@all-hands.dev>
- Reduce from 5 confusing tiers to 3 distinct ones: - ✓ 'likely successful' (score >= 0.6) - green - ? 'uncertain' (score 0.4-0.6) - yellow - ✗ 'likely unsuccessful' (score < 0.4) - red - Labels are now obviously distinct without needing to know the underlying score Co-authored-by: openhands <openhands@all-hands.dev>
- Remove all emojis from visualization (✓, ?, ✗, 😊, 😐, 😟) - Change 'Expected User Response' to 'Expected User Sentiment' - Add confidence level to sentiment display - Replace opaque confidence labels with descriptive ones: - 'High' -> 'very likely' - 'Medium' -> 'likely' - 'Low' -> 'possible' - Remove emoji test (no longer needed) Co-authored-by: openhands <openhands@all-hands.dev>
- Use star rating format: 'Critic: agent success likelihood ★★★☆☆ (65.2%)' - Each star represents 20% of success likelihood - Color coded: green (>=60%), yellow (40-60%), red (<40%) - Remove sentiment display (redundant with success) - Remove 'Likely Follow-up' section (not helpful for CLI users) - Keep 'Potential Issues' and 'Infrastructure' sections - Update tests for new visualization format Co-authored-by: openhands <openhands@all-hands.dev>
- Change from descriptive labels (possible/likely/very likely) to 'likelihood X%' format for better clarity - Example: 'Did Not Follow Instruction (likelihood 35%)' - Remove unused _get_confidence_label method Co-authored-by: openhands <openhands@all-hands.dev>
- Re-add 'Likely Follow-up' section for user follow-up patterns - Re-add 'Other' section for other metrics - Change 'Potential Issues:' from bold yellow to just bold - All section headers are now bold (no color) - CLI can filter these out if needed Co-authored-by: openhands <openhands@all-hands.dev>
|
Looks like there are a few issues preventing this PR from being merged!
If you'd like me to help, just leave a comment, like Feel free to include any additional details that might help me get this PR into a better state. You can manage your notification settings |
Co-authored-by: openhands <openhands@all-hands.dev>
juanmichelini
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! approved so we can kick off the CLI review process
|
Thank you @juanmichelini ! Other folks feel free to chime in if there're issues and i'd be happy to fix them in a follow-up PR! |
Summary
This PR introduces an experimental API-based critic system that enables real-time evaluation of agent actions and messages using an external LLM. The critic provides quality scores and feedback to help assess agent performance.
Example:
uv run examples/01_standalone_sdk/34_critic_example.pyKey Features
1. API-Based Critic Implementation
APIBasedCritic: New critic that calls an external LLM API to evaluate agent actionsCriticClient: HTTP client for secure communication with critic serverChatTemplateRenderer: Renders conversation history into Qwen3-4B-Instruct-2507 chat template format2. Agent Integration
criticparameter toAgentBasefor optional critic integrationfinish_and_message(default): Evaluate onFinishActionand agentMessageEventall_actions: Evaluate after every agent actionActionEventandMessageEventfor visualization3. Enhanced Result Visualization
CriticResultnow includes rich text formatting with color-coded probability displays4. Example and Documentation
examples/01_standalone_sdk/34_critic_example.pydemonstrating critic usage/vllmendpointUsage Example
Technical Details
Chat Template Rendering
Security
SecretStrto prevent accidental exposureTesting
This feature is highly experimental and subject to change:
Files Changed
openhands/sdk/critic/impl/api/- API-based critic implementationexamples/01_standalone_sdk/34_critic_example.py- Usage exampleopenhands/sdk/agent/agent.py- Critic integrationopenhands/sdk/agent/base.py- Add critic parameteropenhands/sdk/critic/base.py- Add evaluation modesopenhands/sdk/critic/result.py- Enhanced visualizationtests/sdk/critic/api/- Comprehensive test suiteAgent Server images for this PR
• GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server
Variants & Base Images
eclipse-temurin:17-jdknikolaik/python-nodejs:python3.12-nodejs22golang:1.21-bookwormPull (multi-arch manifest)
# Each variant is a multi-arch manifest supporting both amd64 and arm64 docker pull ghcr.io/openhands/agent-server:45c7fc7-pythonRun
All tags pushed for this build
About Multi-Architecture Support
45c7fc7-python) is a multi-arch manifest supporting both amd64 and arm6445c7fc7-python-amd64) are also available if needed