Issues with reproducing results

I am currently testing the performance of the Qwen3-235B-VL model as a replacement for GPT-5. Could you please share the specific experimental parameters used for your GPT-5 sampling?

My current configuration is as follows: Args: Namespace(task='shopping_admin', task_ids=None, exp='qwen235B-VL-Instruct', rerun=True, retry=False, model_name='Qwen/Qwen3-VL-235B-A22B-Instruct', visual_effects=True, use_html=False, use_axtree=False, use_screenshot=True, use_som=True, mode='bid', tips=False, headless=True, use_full_action_history=True)

Are these key parameters consistent with those used in your experiments? Additionally, did you utilize vision-based input or text-based input for GPT-5? Regarding the observation space, would you say that using use_screenshot and use_som (Set-of-Mark) tends to yield better results compared to use_axtree?

Finally, are there any other recommended models besides GPT-5? For instance, would a combination of DeepSeek-V3.1 and use_axtree be a viable alternative?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issues with reproducing results #2

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issues with reproducing results #2

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions