Skip to content

Conversation

@zhandaz
Copy link

@zhandaz zhandaz commented Feb 8, 2026

Summary

  • Add FP8 quantization support for ViT attn in MMEncoderAttention (through Flashinfer and cudnn)
  • Implement a stride-aware Triton kernel for FP8 quantization that handles non-contiguous QKV views and pads head_dim to a multiple of 16 (e.g., 72 -> 80) without extra copies
  • Plumb q_scale/k_scale/v_scale/o_data_type through the FlashInfer cuDNN wrapper to support FP8 input tensors
  • Adjust compute_flashinfer_cu_seqlens in Qwen3-VL to produce correct element offsets for the padded FP8 tensor layout

How to enable

FP8 ViT Attn is only enabled with --mm-encoder-attn-backend=FLASHINFER

export VLLM_MM_ENCODER_FP8_ATTN=1
# All one scales is fine in our case, and scales can be provided through (if necessary)
# export VLLM_MM_ENCODER_FP8_ATTN_SCALE_PATH=/path/to/scales.json

The scales JSON should map layer names to per-tensor Q/K/V scales, e.g.:

{"visual.blocks.0.attn": {"q": 0.0123, "k": 0.0456, "v": 0.0789}, ...}

If VLLM_MM_ENCODER_FP8_ATTN_SCALE_PATH is not set, scale=1.0 is used for all layers (cast-only, no scaling).

Dependency

Files changed

  • vllm/envs.py -- register VLLM_MM_ENCODER_FP8_ATTN and VLLM_MM_ENCODER_FP8_ATTN_SCALE_PATH
  • vllm/model_executor/layers/attention/mm_encoder_attention.py -- FP8 init, scale loading, quantization, FlashInfer integration
  • vllm/model_executor/layers/quantization/input_quant_fp8.py -- stride-aware Triton kernel for padded FP8 quantization
  • vllm/model_executor/models/qwen3_vl.py -- FP8-aware cu_seqlens computation
  • vllm/v1/attention/ops/vit_attn_wrappers.py -- pass FP8 scale/dtype params to cuDNN

zhandaz and others added 10 commits February 7, 2026 19:54
Signed-off-by: Zhanda <zhandazhu@gmail.com>
Signed-off-by: Zhanda <zhandazhu@gmail.com>
Signed-off-by: Zhanda <zhandazhu@gmail.com>
Signed-off-by: Zhanda <zhandazhu@gmail.com>
Signed-off-by: Zhanda <zhandazhu@gmail.com>
Signed-off-by: Zhanda <zhandazhu@gmail.com>
@wangshangsam wangshangsam merged commit 6336bec into mlperf-inf-mm-q3vl-v6.0 Feb 8, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants