Skip to content

Conversation

@zhandaz
Copy link

@zhandaz zhandaz commented Feb 8, 2026

Summary

  • Add a fused Triton kernel for bilinear position-embedding interpolation that replaces ~25 small eager-mode CUDA kernels with a single launch
  • Add position embedding cache with pre-warming for the top 100 most common grid configurations (~71% MLPerf VLM dataset coverage, ~0.9 GB at BF16).
  • Cache is bounded to the pre-defined warmup set to prevent unbounded memory growth; cache misses are computed on-the-fly without caching
  • Pre-warming runs automatically after model weight loading

Caching mechanism is borrowed from the PR #33 of @b-mu.

How to enable

The Triton kernel is off by default. To enable:

export VLLM_USE_TRITON_POS_EMBED=1

Pos emb cache size can be controlled by

# Default to 100
export VLLM_POS_EMBED_CACHE_SIZE=0

Files changed

  • vllm/envs.py -- register VLLM_USE_TRITON_POS_EMBED
  • vllm/model_executor/models/qwen3_vl.py -- Triton kernel, cache infrastructure, warmup grid list

Please review after the next PR on FP8 attn is up. Will run validations of these two PRs together.

Signed-off-by: Zhanda <zhandazhu@gmail.com>
Signed-off-by: Zhanda <zhandazhu@gmail.com>
@wangshangsam wangshangsam merged commit 0c41c65 into mlperf-inf-mm-q3vl-v6.0 Feb 8, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants