Enable ViT torch.compile + CUDA Graph #33

b-mu · 2026-01-30T06:18:10Z

Purpose

After integration of high-performance kernels for ViT attention, we saw kernel launch overhead. To improve performance, we add two features:

torch.compile(): fuse native kernels, e.g. layernorm, elementwise
CUDA graph for the ViT: we sort image sizes in ascending order and greedily pack as many images as possible within largest token budgets, then we check if there's a smaller budget would also fit to avoid waste. We pad cu_seqlens so that allow us to pack various number of images.

We also warmup embedding cache for the ViT in order to reuse embeddings for common seen grid sizes and reduce embedding computations on the fly.

Test Plan

Compilation Configs:

    --vllm.cli=--compilation-config='{
      "compile_mm_encoder": true,
      "cudagraph_mm_encoder": true,
      "encoder_cudagraph_verbose": true,
      "encoder_cudagraph_token_budgets": [2048, 4096, 8192, 13824],
      "encoder_cudagraph_max_images_per_batch": 16,
      ...
    }'

Test Result

Tested end-to-end accuracy with the above configuration with FA4 encoder backend

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Max Hu <hyoung2991@gmail.com>

b-mu self-assigned this Jan 30, 2026

b-mu changed the title ~~WIP: Reduce Gaps between Kernels in ViT~~ WIP: Enable ViT torch.compile + CUDA Graph Jan 30, 2026

b-mu changed the title ~~WIP: Enable ViT torch.compile + CUDA Graph~~ Enable ViT torch.compile + CUDA Graph Feb 1, 2026

b-mu requested review from maxyanghu and wangshangsam February 1, 2026 21:41

maxyanghu and others added 25 commits February 1, 2026 17:22

transfer impl

a997f97

Signed-off-by: Max Hu <hyoung2991@gmail.com>

add compilation configs for mm encoder cudagraph.

b5886e9

add mm encoder cudagraph manager (exact | bucket).

ccbeba9

add capture mm encoder cudagraph option.

062ceea

implement mm encoder cudagraph manager (exact | bucket).

7d70346

precompute pos_embeds, rotary embeddings, cu_seqlens.

e1019e3

use forward_cudagraph() with precomputed tensors.

b31bca9

add more grids.

d0d63e3

add encoder cudagraph manager in v1.

bb32c23

keep assertion.

bcc72a4

distinguish encoder and lm for graph capture range.

35becaa

fix ambiguous tensor.

9a7e47b

add log for grid_thw.

f7af48a

disable assertion for now, but warn.

a516b15

compute embeddings with exact, unpadded grid thw.

3490b83

log vit cudagraph mode.

6aea329

add custom grid config.

4c1f1a0

update custom grid config.

7ca3136

update comment for cudagraph related compilation config.

cf04736

clean up.

545e478

clean up.

a2b474d

update comment for encoder cudagraph.

c3a025f

log encoder cudagraph stats.

722ff9d

get bucket size from config for padding mode.

c981904

eliminate dead code.

b479627

b-mu and others added 28 commits February 5, 2026 21:37

add token budget and max bs configs.

41f6a4f

read budget, max bs from configs.

1d0fd70

capture for token budget, max bs.

23c003b

check num images against max bs.

5c84f2e

update cache hit.

37a0680

pad cu_seqlens.

772dbbb

update logs for batching and padding.

dcef68a

execute budget based batching, greedy pack.

da17210

check if token budget is divisible by max bs.

5b2dd9e

rename var.

c32327d

format.

047e490

clean up grid based batching.

96500f8

format.

c16c9ae

execute budget batching.

c37908a

clean up grid based batching.

9595e20

clean up exact and padded.

49d04f3

clean up grid based batching.

bab71d0

cache specified common embeds only, not only the fly.

444bc6b

remove grid configs and bucket sizes.

3efffeb

clean up legacy compilation configs.

fa4da3c

remove embed buffer tracking for padded.

b3cc204

clean up grid matching.

26d44ff

remove buffer for exact match.

e0d3af8

add graph budgets in log.

26b028d

format.

37869ee

fix cu_seqlen to base on inputs.

258f763

clean up.

7b03baf

fix fi

c87c22f

Signed-off-by: Max Hu <hyoung2991@gmail.com>

zhandaz mentioned this pull request Feb 8, 2026

Add bilinear_pos_embed triton kernel and cache #40

Merged

b-mu closed this Feb 8, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable ViT torch.compile + CUDA Graph #33

Enable ViT torch.compile + CUDA Graph #33

Uh oh!

b-mu commented Jan 30, 2026 •

edited by github-actions bot

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Enable ViT torch.compile + CUDA Graph #33

Enable ViT torch.compile + CUDA Graph #33

Uh oh!

Conversation

b-mu commented Jan 30, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

b-mu commented Jan 30, 2026 •

edited by github-actions bot

Loading