Support Flashinfer FP8 ViT Attn #41
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
MMEncoderAttention(through Flashinfer and cudnn)q_scale/k_scale/v_scale/o_data_typethrough the FlashInfer cuDNN wrapper to support FP8 input tensorscompute_flashinfer_cu_seqlensin Qwen3-VL to produce correct element offsets for the padded FP8 tensor layoutHow to enable
FP8 ViT Attn is only enabled with
--mm-encoder-attn-backend=FLASHINFERThe scales JSON should map layer names to per-tensor Q/K/V scales, e.g.:
{"visual.blocks.0.attn": {"q": 0.0123, "k": 0.0456, "v": 0.0789}, ...}If
VLLM_MM_ENCODER_FP8_ATTN_SCALE_PATHis not set, scale=1.0 is used for all layers (cast-only, no scaling).Dependency
q_scale/k_scale/v_scale/o_data_typeparameters incudnn_batch_prefill_with_kv_cache.Files changed
vllm/envs.py-- registerVLLM_MM_ENCODER_FP8_ATTNandVLLM_MM_ENCODER_FP8_ATTN_SCALE_PATHvllm/model_executor/layers/attention/mm_encoder_attention.py-- FP8 init, scale loading, quantization, FlashInfer integrationvllm/model_executor/layers/quantization/input_quant_fp8.py-- stride-aware Triton kernel for padded FP8 quantizationvllm/model_executor/models/qwen3_vl.py-- FP8-aware cu_seqlens computationvllm/v1/attention/ops/vit_attn_wrappers.py-- pass FP8 scale/dtype params to cuDNN