Add Qwen3-VL export support for multimodal text-to-text pipeline by seyeong-han · Pull Request #214 · huggingface/optimum-executorch

seyeong-han · 2026-02-19T22:07:04Z

Overview

Enables exporting Qwen3-VL vision-language models through the multimodal-text-to-text task. Qwen3-VL uses M-RoPE (Multi-dimensional Rotary Position Embeddings) and a Conv3d-based vision encoder, both of which require special handling during torch.export.

Changes

Three changes to the export pipeline:

M-RoPE vision encoder positions — The Qwen3-VL visual encoder computes position embeddings via data-dependent ops (torch.linspace, repeat_interleave on image_grid_thw) that torch.export cannot trace. VisionExportableModule now pre-computes pos_embeds, rotary_pos_emb, and cu_seqlens eagerly and stores them as buffers so they become constants in the exported graph.
M-RoPE text decoder hook — During text decoder export only inputs_embeds and cache_position are provided (no input_ids). M-RoPE models call get_rope_index which requires input_ids and crashes. A forward pre-hook injects position_ids derived from cache_position so the model skips that code path.
Model loading / modality detection — AutoModelForPreTraining doesn't resolve Qwen3-VL, so we fall back to AutoModelForImageTextToText. Modality detection now handles models that report more than two modalities (Qwen3-VL reports ("image", "video", "text")) by picking the first supported non-text modality.

optimum-cli export executorch \
  --model "Qwen/Qwen3-VL-2B-Instruct" \
  --task "multimodal-text-to-text" \
  --recipe "xnnpack" \
  --use_custom_sdpa \
  --use_custom_kv_cache \
  --qlinear "8da4w" \
  --qlinear_group_size 32 \
  --qlinear_encoder "8da4w,8da8w" \
  --qlinear_encoder_group_size 32 \
  --qembedding "8w" \
  --qembedding_encoder "8w" \
  --dtype "float32" \
  --output_dir="qwen3/Qwen3-VL-2B-Instruct-xnnpack"

Quantized model is ~1.4 GB (down from ~4.4 GB bf16).
Text decoder runs at ~25-29 tok/s on Apple Silicon via XNNPACK.

…oading - Introduced pre-computation of position-related values for M-RoPE vision encoders in VisionExportableModule to enhance export efficiency. - Added a method to register a forward pre-hook for M-RoPE models to inject position_ids during text decoder export, preventing crashes due to missing input_ids. - Updated load_multimodal_text_to_text_model to fallback to AutoModelForImageTextToText when AutoModelForPreTraining fails, ensuring compatibility with various model types. - Enhanced modality detection logic to correctly identify the primary non-text modality for multimodal models, improving robustness in model loading.

seyeong-han changed the title ~~Add support for M-RoPE vision encoders and improve multimodal model loading~~ Add Qwen3-VL export support for multimodal text-to-text pipeline Feb 19, 2026

seyeong-han mentioned this pull request Feb 19, 2026

Add Qwen3-VL export support and example runner pytorch/executorch#17572

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Add Qwen3-VL export support for multimodal text-to-text pipeline#214

Add Qwen3-VL export support for multimodal text-to-text pipeline#214
seyeong-han wants to merge 1 commit intohuggingface:mainfrom
seyeong-han:main

seyeong-han commented Feb 19, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

seyeong-han commented Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

seyeong-han commented Feb 19, 2026 •

edited

Loading