Support Flashinfer FP8 ViT Attn #41

zhandaz · 2026-02-08T04:40:18Z

Summary

Add FP8 quantization support for ViT attn in MMEncoderAttention (through Flashinfer and cudnn)
Implement a stride-aware Triton kernel for FP8 quantization that handles non-contiguous QKV views and pads head_dim to a multiple of 16 (e.g., 72 -> 80) without extra copies
Plumb q_scale/k_scale/v_scale/o_data_type through the FlashInfer cuDNN wrapper to support FP8 input tensors
Adjust compute_flashinfer_cu_seqlens in Qwen3-VL to produce correct element offsets for the padded FP8 tensor layout

How to enable

FP8 ViT Attn is only enabled with --mm-encoder-attn-backend=FLASHINFER

export VLLM_MM_ENCODER_FP8_ATTN=1
# All one scales is fine in our case, and scales can be provided through (if necessary)
# export VLLM_MM_ENCODER_FP8_ATTN_SCALE_PATH=/path/to/scales.json

The scales JSON should map layer names to per-tensor Q/K/V scales, e.g.:

{"visual.blocks.0.attn": {"q": 0.0123, "k": 0.0456, "v": 0.0789}, ...}

If VLLM_MM_ENCODER_FP8_ATTN_SCALE_PATH is not set, scale=1.0 is used for all layers (cast-only, no scaling).

Dependency

Requires a corresponding FlashInfer change (separate PR in the FlashInfer repo, Pulled the fp8 attention support & minor fix for compatibility flashinfer#2) to expose q_scale/k_scale/v_scale/o_data_type parameters in cudnn_batch_prefill_with_kv_cache.
This PR should be merged after Add bilinear_pos_embed triton kernel and cache #40 and rebased on top of that.

Files changed

vllm/envs.py -- register VLLM_MM_ENCODER_FP8_ATTN and VLLM_MM_ENCODER_FP8_ATTN_SCALE_PATH
vllm/model_executor/layers/attention/mm_encoder_attention.py -- FP8 init, scale loading, quantization, FlashInfer integration
vllm/model_executor/layers/quantization/input_quant_fp8.py -- stride-aware Triton kernel for padded FP8 quantization
vllm/model_executor/models/qwen3_vl.py -- FP8-aware cu_seqlens computation
vllm/v1/attention/ops/vit_attn_wrappers.py -- pass FP8 scale/dtype params to cuDNN

Signed-off-by: Zhanda <zhandazhu@gmail.com>

Signed-off-by: <>

Signed-off-by: Zhanda <zhandazhu@gmail.com>

zhandaz and others added 10 commits February 7, 2026 19:54

feat: Add bilinear_pos_embed triton kernel and cache

82e5ff4

Signed-off-by: Zhanda <zhandazhu@gmail.com>

Set env to control the cache size

1ad7f89

Signed-off-by: Zhanda <zhandazhu@gmail.com>

Initial implementation for fp8 vit flashinfer attn

c4f829d

Support fp8 flashinfer vit attn

3641873

Fix fp8 quant FI interface issues (padding + strides)

c523195

Signed-off-by: Zhanda <zhandazhu@gmail.com>

Fix & linting

1ca3421

Signed-off-by: Zhanda <zhandazhu@gmail.com>

Temporarily change the flashinfer branch

f4bf0ac

Signed-off-by: Zhanda <zhandazhu@gmail.com>

Add guard for fp8 attn backend & update logging

3abd55b

Signed-off-by: <>

Update the flashinfer branch

ca80913

Signed-off-by: Zhanda <zhandazhu@gmail.com>

Merge branch 'mlperf-inf-mm-q3vl-v6.0' into zhanda-fp8-vit-attn-fi

ad3f550

wangshangsam approved these changes Feb 8, 2026

View reviewed changes

wangshangsam merged commit 6336bec into mlperf-inf-mm-q3vl-v6.0 Feb 8, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Flashinfer FP8 ViT Attn #41

Support Flashinfer FP8 ViT Attn #41

Uh oh!

zhandaz commented Feb 8, 2026 •

edited by github-actions bot

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Support Flashinfer FP8 ViT Attn #41

Support Flashinfer FP8 ViT Attn #41

Uh oh!

Conversation

zhandaz commented Feb 8, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

How to enable

Dependency

Files changed

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

zhandaz commented Feb 8, 2026 •

edited by github-actions bot

Loading