Enable torch.compile and CUDA graphs for vLLM inference by Lucaskabela · Pull Request #4 · Lucaskabela/torchtitan

Lucaskabela · 2026-02-18T23:11:19Z

Summary

We now enable usingsupport_torch_compile in the vllm wrapper definition in order to improve our end to end training runtime.

The particular changes we do in this PR are:

Add build_compilation_config() to parallelism_utils with TP-aware cudagraph_mode selection (piecewise for TP>1, full_and_piecewise otherwise) NOTE: we can't use this form of TP with full cudagraphs
Add @support_torch_compile decorator to TorchTitanVLLMModelWrapper
Add vllm_compile_and_cudagraph flag to Generator and VLLMRolloutEngine in both unified and vllm_compat paths
Add --disable-vllm-compile-and-cudagraph CLI arg to infer.py
Wire compilation_config and enforce_eager through all LLM instantiation sites

Test Plan

Execute

VLLM_BATCH_INVARIANT=1 VLLM_ATTENTION_BACKEND=FLASH_ATTN python3 torchtitan/experiments/rl/unified/simple_rl_multiprocess.py

Baseline (losses/rewards)

Step	Loss
0	-0.0196
1	7.4252
2	3.4577
3	1.2154
4	0.4386
5	0.2471
6	0.2507
7	0.2722
8	0.2140
9	0.1330

This PR (losses/rewards)

Step	Loss
0	-0.0196
1	7.4252
2	3.4577
3	1.2154
4	0.4386
5	0.2471
6	0.2507
7	0.2722
8	0.2140
9	0.1330

These match exactly showing we preserve training stability

Metric comparisson

On top of pytorch#2398 we get the following metrics:

Metric	Baseline	%	After Changes	%
Total wall-clock	135.15s		125.34s
Cumul. rollout	9.45s	7.0%	2.60s	2.1%
Cumul. train	56.23s	41.6%	46.56s	37.1%
Cumul. optimizer	0.31s	0.2%	0.30s	0.2%
Cumul. weight_sync	69.47s	51.4%	76.18s	60.8%
Peak mem (rollout)	4.09 GiB		4.12 GiB
Peak mem (train)	9.58 GiB		9.58 GiB
Peak mem (optimizer)	8.44 GiB		8.44 GiB

of note - runtime improves significantly, cutting rollout time by almost 5x. There is no significant memory usage

Authored with claude

- Add build_compilation_config() and get_cudagraph_mode() to parallelism_utils; callers derive cudagraph_mode from tp_size via the helper and pass the string to build_compilation_config - Add @support_torch_compile decorator to TorchTitanVLLMModelWrapper - Add vllm_compile_and_cudagraph flag to Generator and VLLMRolloutEngine in both unified and vllm_compat paths - Add --disable-vllm-compile-and-cudagraph CLI arg to infer.py - Wire compilation_config and enforce_eager through all LLM instantiation sites

Lucaskabela force-pushed the lucaskabela/compile_logic_changes branch from 88de8a8 to 6612d3e Compare February 19, 2026 00:00

Lucaskabela force-pushed the lucaskabela/compile_infra branch from 69fc2e0 to 054ae5e Compare February 19, 2026 00:01

Lucaskabela mentioned this pull request Feb 19, 2026

[RL] Prepare vllm definition for support_torch_compile compatibility pytorch/torchtitan#2393

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable torch.compile and CUDA graphs for vLLM inference#4

Enable torch.compile and CUDA graphs for vLLM inference#4
Lucaskabela wants to merge 1 commit intolucaskabela/compile_logic_changesfrom
lucaskabela/compile_infra

Lucaskabela commented Feb 18, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Lucaskabela commented Feb 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test Plan

Baseline (losses/rewards)

This PR (losses/rewards)

Metric comparisson

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Lucaskabela commented Feb 18, 2026 •

edited

Loading