-
Notifications
You must be signed in to change notification settings - Fork 115
fix: improve vision detection for proxy model names #1795
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
When using models through a proxy (e.g., litellm_proxy/openai/gpt-4o), the vision detection was failing because litellm.supports_vision() returns False for the full proxy path but True for the provider/model format. This fix tries multiple model name variants: 1. Full model name (litellm_proxy/openai/gpt-4o) 2. Provider/model format (openai/gpt-4o) 3. Just the model name (gpt-4o) This ensures vision support is correctly detected for models accessed through evaluation proxies like litellm_proxy. Added test cases for litellm_proxy/openai/* and litellm_proxy/gemini/* formats that are commonly used in evaluations. Co-authored-by: openhands <openhands@all-hands.dev>
all-hands-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall Assessment: This PR correctly fixes the vision detection bug for proxy model names. The implementation is clean, the logic is sound, and test coverage is appropriate. I have one minor suggestion for comment clarity below.
|
Does not seem to work, testing this on https://storage.googleapis.com/openhands-evaluation-results/eval-21259253300-gemini-3-p_litellm_proxy-gemini-gemini-3-pro-preview_26-01-22-18-22.tar.gz Result
|
enyst
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for looking into this, I was sure there is scope to improve!
I do have a little question though: is it possible to use canonical_name from LLM class? I believe we have added it relatively recently in order to account for the fact that litellm proxies might be configured with different model names.
Or is that a bad idea, and then maybe we could clean it up if useless?
| # remove when litellm is updated to fix https://github.com/BerriAI/litellm/issues/5608 # noqa: E501 | ||
| # Check both the full model name and the name after proxy prefix for vision support # noqa: E501 | ||
| # Check multiple formats for vision support to handle proxy prefixes like 'litellm_proxy/provider/model' # noqa: E501 | ||
| model_for_caps = self._model_name_for_capabilities() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe _model_name_for_capabilities() intended to figure out the "real" model, so that vision ability can be detected correctly.
Sorry, I might not fully understand the problem here, just wonder, is there a reason why we can't use it?
This adds logging to format_messages_for_llm and format_messages_for_responses to explicitly show when vision_enabled=True and images are being included. This helps diagnose the perceived vision_enabled=false issue in CI logs, which actually shows the Message default value before formatting, not the actual state at LLM call time. Co-authored-by: openhands <openhands@all-hands.dev>
Investigation ResultsAfter thoroughly analyzing the codebase, I believe I have found the root cause of the perceived Key Findings
Fix AppliedI've added logging to
This will appear in logs like: This should help confirm in CI logs that vision IS being used, even if the persisted event shows the default To Further VerifyWhen running the next CI eval with this commit, look for the new log messages that explicitly confirm vision usage at LLM call time. |
|
@OpenHands Read all the comments, install yourself I’m curious about this part of the last comment:
Messages are stored in events? Investigate the code and tell me the code that proves the scenario described here. |
|
I'm on it! enyst can track my progress at all-hands.dev |
|
Summary of work (no code changes made):
Checklist against your request:
Since there were no repository changes, there is no git diff to review and nothing to push. |
Ahh. Now I understand the origin of the problem. Totally agree that’s wrong, I think there’s no point in persisting information that became false when actually computed. I think the other PR is fixing it? |
Summary
This PR fixes a critical bug in vision detection that was causing images to be stripped from multimodal evaluations when using proxy model names.
Problem
For proxy model names like
litellm_proxy/openai/gpt-4oorlitellm_proxy/anthropic/claude-opus-4-5-20251101:The previous code only tried the full path and the last segment (model name only), but missed the
provider/modelformat which is what litellm recognizes for many models.This was causing
vision_is_active()to returnFalsefor vision-capable models accessed through evaluation proxies, resulting in images being stripped from messages in multimodal benchmarks like SWE-bench Multimodal.Solution
Updated
_supports_vision()to try multiple model name variants:litellm_proxy/anthropic/claude-opus-4-5-20251101anthropic/claude-opus-4-5-20251101← This was missingclaude-opus-4-5-20251101Testing
litellm_proxy/openai/*andlitellm_proxy/gemini/*formatsImpact
This fix should improve SWE-bench Multimodal scores by ensuring images are actually sent to the model when using proxy configurations in evaluations.
Related
Found during investigation of low SWE-bench Multimodal evaluation scores.
@neubig can click here to continue refining the PR
Agent Server images for this PR
• GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server
Variants & Base Images
eclipse-temurin:17-jdknikolaik/python-nodejs:python3.12-nodejs22golang:1.21-bookwormPull (multi-arch manifest)
# Each variant is a multi-arch manifest supporting both amd64 and arm64 docker pull ghcr.io/openhands/agent-server:602b6d6-pythonRun
All tags pushed for this build
About Multi-Architecture Support
602b6d6-python) is a multi-arch manifest supporting both amd64 and arm64602b6d6-python-amd64) are also available if needed