Skip to content

Conversation

@neubig
Copy link
Contributor

@neubig neubig commented Jan 22, 2026

Summary

This PR fixes a critical bug in vision detection that was causing images to be stripped from multimodal evaluations when using proxy model names.

Problem

For proxy model names like litellm_proxy/openai/gpt-4o or litellm_proxy/anthropic/claude-opus-4-5-20251101:

litellm.supports_vision("litellm_proxy/openai/gpt-4o")  # → False ❌
litellm.supports_vision("openai/gpt-4o")                # → True ✅
litellm.supports_vision("gpt-4o")                        # → True ✅

The previous code only tried the full path and the last segment (model name only), but missed the provider/model format which is what litellm recognizes for many models.

This was causing vision_is_active() to return False for vision-capable models accessed through evaluation proxies, resulting in images being stripped from messages in multimodal benchmarks like SWE-bench Multimodal.

Solution

Updated _supports_vision() to try multiple model name variants:

  1. Full model name: litellm_proxy/anthropic/claude-opus-4-5-20251101
  2. Provider/model format: anthropic/claude-opus-4-5-20251101This was missing
  3. Just model name: claude-opus-4-5-20251101

Testing

  • All 18 existing vision tests pass
  • Added 3 new test cases for litellm_proxy/openai/* and litellm_proxy/gemini/* formats

Impact

This fix should improve SWE-bench Multimodal scores by ensuring images are actually sent to the model when using proxy configurations in evaluations.

Related

Found during investigation of low SWE-bench Multimodal evaluation scores.

@neubig can click here to continue refining the PR


Agent Server images for this PR

GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server

Variants & Base Images

Variant Architectures Base Image Docs / Tags
java amd64, arm64 eclipse-temurin:17-jdk Link
python amd64, arm64 nikolaik/python-nodejs:python3.12-nodejs22 Link
golang amd64, arm64 golang:1.21-bookworm Link

Pull (multi-arch manifest)

# Each variant is a multi-arch manifest supporting both amd64 and arm64
docker pull ghcr.io/openhands/agent-server:602b6d6-python

Run

docker run -it --rm \
  -p 8000:8000 \
  --name agent-server-602b6d6-python \
  ghcr.io/openhands/agent-server:602b6d6-python

All tags pushed for this build

ghcr.io/openhands/agent-server:602b6d6-golang-amd64
ghcr.io/openhands/agent-server:602b6d6-golang_tag_1.21-bookworm-amd64
ghcr.io/openhands/agent-server:602b6d6-golang-arm64
ghcr.io/openhands/agent-server:602b6d6-golang_tag_1.21-bookworm-arm64
ghcr.io/openhands/agent-server:602b6d6-java-amd64
ghcr.io/openhands/agent-server:602b6d6-eclipse-temurin_tag_17-jdk-amd64
ghcr.io/openhands/agent-server:602b6d6-java-arm64
ghcr.io/openhands/agent-server:602b6d6-eclipse-temurin_tag_17-jdk-arm64
ghcr.io/openhands/agent-server:602b6d6-python-amd64
ghcr.io/openhands/agent-server:602b6d6-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-amd64
ghcr.io/openhands/agent-server:602b6d6-python-arm64
ghcr.io/openhands/agent-server:602b6d6-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-arm64
ghcr.io/openhands/agent-server:602b6d6-golang
ghcr.io/openhands/agent-server:602b6d6-java
ghcr.io/openhands/agent-server:602b6d6-python

About Multi-Architecture Support

  • Each variant tag (e.g., 602b6d6-python) is a multi-arch manifest supporting both amd64 and arm64
  • Docker automatically pulls the correct architecture for your platform
  • Individual architecture tags (e.g., 602b6d6-python-amd64) are also available if needed

When using models through a proxy (e.g., litellm_proxy/openai/gpt-4o),
the vision detection was failing because litellm.supports_vision() returns
False for the full proxy path but True for the provider/model format.

This fix tries multiple model name variants:
1. Full model name (litellm_proxy/openai/gpt-4o)
2. Provider/model format (openai/gpt-4o)
3. Just the model name (gpt-4o)

This ensures vision support is correctly detected for models accessed
through evaluation proxies like litellm_proxy.

Added test cases for litellm_proxy/openai/* and litellm_proxy/gemini/*
formats that are commonly used in evaluations.

Co-authored-by: openhands <openhands@all-hands.dev>
@neubig neubig marked this pull request as ready for review January 22, 2026 17:43
@github-actions
Copy link
Contributor

github-actions bot commented Jan 22, 2026

Coverage

Coverage Report •
FileStmtsMissCoverMissing
openhands-sdk/openhands/sdk/llm
   llm.py4276584%350, 371–372, 408, 572, 673, 701, 775–780, 900, 903–906, 952, 1075, 1108–1109, 1118, 1131, 1133–1138, 1140–1157, 1160–1164, 1166–1167, 1173–1182
TOTAL16339478970% 

Copy link
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall Assessment: This PR correctly fixes the vision detection bug for proxy model names. The implementation is clean, the logic is sound, and test coverage is appropriate. I have one minor suggestion for comment clarity below.

@juanmichelini juanmichelini self-requested a review January 22, 2026 18:02
@juanmichelini
Copy link
Collaborator

Does not seem to work, testing this on https://storage.googleapis.com/openhands-evaluation-results/eval-21259253300-gemini-3-p_litellm_proxy-gemini-gemini-3-pro-preview_26-01-22-18-22.tar.gz

Result

  • vision_enabled: false
  • Images present: Yes (1 image in messages)
  • Model: litellm_proxy/gemini/gemini-3-pro-preview (supports vision)
  • Benchmark: swebenchmultimodal
  • SDK commit: e4e6a1e
  • Test size: Only 1 instance (eval_limit: 1)

Copy link
Collaborator

@enyst enyst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for looking into this, I was sure there is scope to improve!

I do have a little question though: is it possible to use canonical_name from LLM class? I believe we have added it relatively recently in order to account for the fact that litellm proxies might be configured with different model names.

Or is that a bad idea, and then maybe we could clean it up if useless?

# remove when litellm is updated to fix https://github.com/BerriAI/litellm/issues/5608 # noqa: E501
# Check both the full model name and the name after proxy prefix for vision support # noqa: E501
# Check multiple formats for vision support to handle proxy prefixes like 'litellm_proxy/provider/model' # noqa: E501
model_for_caps = self._model_name_for_capabilities()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe _model_name_for_capabilities() intended to figure out the "real" model, so that vision ability can be detected correctly.

Sorry, I might not fully understand the problem here, just wonder, is there a reason why we can't use it?

This adds logging to format_messages_for_llm and format_messages_for_responses
to explicitly show when vision_enabled=True and images are being included.

This helps diagnose the perceived vision_enabled=false issue in CI logs,
which actually shows the Message default value before formatting, not the
actual state at LLM call time.

Co-authored-by: openhands <openhands@all-hands.dev>
Copy link
Contributor Author

neubig commented Jan 23, 2026

Investigation Results

After thoroughly analyzing the codebase, I believe I have found the root cause of the perceived vision_enabled=false issue in Datadog logs.

Key Findings

  1. Vision detection IS working correctly for proxy model names like litellm_proxy/gemini/gemini-3-pro-preview. Local tests confirm that:

    • llm.vision_is_active() returns True
    • Formatted messages include images when sent to the LLM ✓
    • The PR's prefix-stripping logic works as expected ✓
  2. The vision_enabled=false in logs is expected behavior:

    • The Message class has vision_enabled: bool = False as the default
    • When messages are created in benchmarks, they have vision_enabled=False
    • Messages are stored in events and persisted with this default value
    • The format_messages_for_llm() method sets vision_enabled=True on a DEEP COPY of messages right before the LLM API call
    • The original messages in events retain vision_enabled=False
    • When these events are logged to Datadog, they show the pre-formatted state
  3. Images ARE being sent to the LLM. The formatted messages (what's actually sent to the LLM API) correctly include the images when vision is supported.

Fix Applied

I've added logging to format_messages_for_llm() and format_messages_for_responses() that explicitly shows:

  • vision_enabled=True/False at format time
  • The number of images being included
  • The model name

This will appear in logs like:

Formatting messages with vision_enabled=True, including 1 image(s) for model litellm_proxy/gemini/gemini-3-pro-preview

This should help confirm in CI logs that vision IS being used, even if the persisted event shows the default vision_enabled=False.

To Further Verify

When running the next CI eval with this commit, look for the new log messages that explicitly confirm vision usage at LLM call time.

@enyst
Copy link
Collaborator

enyst commented Jan 23, 2026

@OpenHands Read all the comments, install yourself gh if you don’t have it so that it’s easier.

I’m curious about this part of the last comment:

Messages are stored in events and persisted with this default value
The format_messages_for_llm() method sets vision_enabled=True on a DEEP COPY of messages right before the LLM API call
The original messages in events retain vision_enabled=False

Messages are stored in events? Investigate the code and tell me the code that proves the scenario described here.

@openhands-ai
Copy link

openhands-ai bot commented Jan 23, 2026

I'm on it! enyst can track my progress at all-hands.dev

@openhands-ai
Copy link

openhands-ai bot commented Jan 23, 2026

Summary of work (no code changes made):

  • Installed GitHub CLI (gh) in the environment via apt so it’s available for repo/PR inspection.
  • Investigated the SDK code to validate (or refute) the claim that “messages are stored in events and persisted with vision_enabled=False, while format_messages_for_llm sets vision_enabled=True only on a deep copy right before the LLM call.”
  • Located and cited the exact code paths proving this behavior:
    • MessageEvent persists an llm_message: Message in the event stream (openhands-sdk/openhands/sdk/event/llm_convertible/message.py).
    • Message.vision_enabled defaults to False (openhands-sdk/openhands/sdk/llm/message.py).
    • LLM.format_messages_for_llm() performs copy.deepcopy(messages) and then sets message.vision_enabled = vision_active on the copied messages (openhands-sdk/openhands/sdk/llm/llm.py).
    • MessageEvent.to_llm_message() returns a deep copy of the stored llm_message, preventing mutation of the persisted event payload (openhands-sdk/openhands/sdk/event/llm_convertible/message.py).
    • Event→message conversion uses LLMConvertibleEvent.events_to_messages(...) which calls event.to_llm_message() (openhands-sdk/openhands/sdk/event/base.py).

Checklist against your request:

  • Read/investigated relevant code paths in the repo
  • Installed gh when it wasn’t present
  • Provided the specific code that proves the described scenario
  • Did not modify repository code or comment on the PR

Since there were no repository changes, there is no git diff to review and nothing to push.

View full conversation

@enyst
Copy link
Collaborator

enyst commented Jan 23, 2026

MessageEvent persists an llm_message: Message in the event stream (openhands-sdk/openhands/sdk/event/llm_convertible/message.py).

Ahh. Now I understand the origin of the problem. Totally agree that’s wrong, I think there’s no point in persisting information that became false when actually computed. I think the other PR is fixing it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants