Move torch.cond predicate non-persistent buffer to CPU #16378

larryliu0820 · 2025-12-23T20:28:16Z

Avoid device-to-host memory copies when evaluating torch.cond predicates.

When a GPU buffer (e.g., a KV cache initialized flag) is used as a predicate for torch.cond, the runtime must synchronize and copy the predicate value from GPU to CPU on every forward pass to evaluate the condition. This adds latency and synchronization overhead.

MoveCondPredicateToCpuPass moves non-persistent buffer predicates to CPU at export time, eliminating per-inference D2H transfers. The predicate is typically a small scalar (e.g., a boolean flag), so keeping it on CPU has negligible memory impact.

Add MoveCondPredicateToCpuPass in backends/cuda/passes/
Add unit tests covering:
- GPU buffer predicates moved to CPU
- CPU buffer predicates unchanged
- Computed predicates unaffected
- Multiple torch.cond calls
- Cross-attention cache pattern
- Persistent buffers (state_dict) not moved
Add Python tests to unittest-cuda CI job in cuda.yml

[ghstack-poisoned]

larryliu0820 · 2025-12-23T20:28:16Z

Stack from ghstack (oldest at bottom):

-> Move torch.cond predicate non-persistent buffer to CPU #16378

pytorch-bot · 2025-12-23T20:28:19Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/16378

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 7e0e692 with merge base c5d66a5 ():

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

pull / android / run-emulator (gh) (#16137)
Timeout waiting for emulator to boot.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Avoid device-to-host memory copies when evaluating `torch.cond` predicates. When a GPU buffer (e.g., a KV cache `initialized` flag) is used as a predicate for `torch.cond`, the runtime must synchronize and copy the predicate value from GPU to CPU on every forward pass to evaluate the condition. This adds latency and synchronization overhead. `MoveCondPredicateToCpuPass` moves non-persistent buffer predicates to CPU at export time, eliminating per-inference D2H transfers. The predicate is typically a small scalar (e.g., a boolean flag), so keeping it on CPU has negligible memory impact. - Add `MoveCondPredicateToCpuPass` in `backends/cuda/passes/` - Add unit tests covering: - GPU buffer predicates moved to CPU - CPU buffer predicates unchanged - Computed predicates unaffected - Multiple `torch.cond` calls - Cross-attention cache pattern - Persistent buffers (state_dict) not moved - Add Python tests to `unittest-cuda` CI job in `cuda.yml` ghstack-source-id: ff22758 ghstack-comment-id: 3687889864 Pull-Request: #16378

Gasoonjia

gogogo!

.github/workflows/cuda.yml

backends/cuda/passes/move_cond_predicate_to_cpu.py

[ghstack-poisoned]

Avoid device-to-host memory copies when evaluating `torch.cond` predicates. When a GPU buffer (e.g., a KV cache `initialized` flag) is used as a predicate for `torch.cond`, the runtime must synchronize and copy the predicate value from GPU to CPU on every forward pass to evaluate the condition. This adds latency and synchronization overhead. `MoveCondPredicateToCpuPass` moves non-persistent buffer predicates to CPU at export time, eliminating per-inference D2H transfers. The predicate is typically a small scalar (e.g., a boolean flag), so keeping it on CPU has negligible memory impact. - Add `MoveCondPredicateToCpuPass` in `backends/cuda/passes/` - Add unit tests covering: - GPU buffer predicates moved to CPU - CPU buffer predicates unchanged - Computed predicates unaffected - Multiple `torch.cond` calls - Cross-attention cache pattern - Persistent buffers (state_dict) not moved - Add Python tests to `unittest-cuda` CI job in `cuda.yml` ghstack-source-id: 8d724ef ghstack-comment-id: 3687889864 Pull-Request: #16378

[ghstack-poisoned]

Avoid device-to-host memory copies when evaluating `torch.cond` predicates. When a GPU buffer (e.g., a KV cache `initialized` flag) is used as a predicate for `torch.cond`, the runtime must synchronize and copy the predicate value from GPU to CPU on every forward pass to evaluate the condition. This adds latency and synchronization overhead. `MoveCondPredicateToCpuPass` moves non-persistent buffer predicates to CPU at export time, eliminating per-inference D2H transfers. The predicate is typically a small scalar (e.g., a boolean flag), so keeping it on CPU has negligible memory impact. - Add `MoveCondPredicateToCpuPass` in `backends/cuda/passes/` - Add unit tests covering: - GPU buffer predicates moved to CPU - CPU buffer predicates unchanged - Computed predicates unaffected - Multiple `torch.cond` calls - Cross-attention cache pattern - Persistent buffers (state_dict) not moved - Add Python tests to `unittest-cuda` CI job in `cuda.yml` ghstack-source-id: 4714546 ghstack-comment-id: 3687889864 Pull-Request: #16378

[ghstack-poisoned]

Avoid device-to-host memory copies when evaluating `torch.cond` predicates. When a GPU buffer (e.g., a KV cache `initialized` flag) is used as a predicate for `torch.cond`, the runtime must synchronize and copy the predicate value from GPU to CPU on every forward pass to evaluate the condition. This adds latency and synchronization overhead. `MoveCondPredicateToCpuPass` moves non-persistent buffer predicates to CPU at export time, eliminating per-inference D2H transfers. The predicate is typically a small scalar (e.g., a boolean flag), so keeping it on CPU has negligible memory impact. - Add `MoveCondPredicateToCpuPass` in `backends/cuda/passes/` - Add unit tests covering: - GPU buffer predicates moved to CPU - CPU buffer predicates unchanged - Computed predicates unaffected - Multiple `torch.cond` calls - Cross-attention cache pattern - Persistent buffers (state_dict) not moved - Add Python tests to `unittest-cuda` CI job in `cuda.yml` ghstack-source-id: d813c68 ghstack-comment-id: 3687889864 Pull-Request: #16378

[ghstack-poisoned]

Avoid device-to-host memory copies when evaluating `torch.cond` predicates. When a GPU buffer (e.g., a KV cache `initialized` flag) is used as a predicate for `torch.cond`, the runtime must synchronize and copy the predicate value from GPU to CPU on every forward pass to evaluate the condition. This adds latency and synchronization overhead. `MoveCondPredicateToCpuPass` moves non-persistent buffer predicates to CPU at export time, eliminating per-inference D2H transfers. The predicate is typically a small scalar (e.g., a boolean flag), so keeping it on CPU has negligible memory impact. - Add `MoveCondPredicateToCpuPass` in `backends/cuda/passes/` - Add unit tests covering: - GPU buffer predicates moved to CPU - CPU buffer predicates unchanged - Computed predicates unaffected - Multiple `torch.cond` calls - Cross-attention cache pattern - Persistent buffers (state_dict) not moved - Add Python tests to `unittest-cuda` CI job in `cuda.yml` ghstack-source-id: efe08be ghstack-comment-id: 3687889864 Pull-Request: #16378

[ghstack-poisoned]

Avoid device-to-host memory copies when evaluating `torch.cond` predicates. When a GPU buffer (e.g., a KV cache `initialized` flag) is used as a predicate for `torch.cond`, the runtime must synchronize and copy the predicate value from GPU to CPU on every forward pass to evaluate the condition. This adds latency and synchronization overhead. `MoveCondPredicateToCpuPass` moves non-persistent buffer predicates to CPU at export time, eliminating per-inference D2H transfers. The predicate is typically a small scalar (e.g., a boolean flag), so keeping it on CPU has negligible memory impact. - Add `MoveCondPredicateToCpuPass` in `backends/cuda/passes/` - Add unit tests covering: - GPU buffer predicates moved to CPU - CPU buffer predicates unchanged - Computed predicates unaffected - Multiple `torch.cond` calls - Cross-attention cache pattern - Persistent buffers (state_dict) not moved - Add Python tests to `unittest-cuda` CI job in `cuda.yml` ghstack-source-id: 58e9268 ghstack-comment-id: 3687889864 Pull-Request: #16378

[ghstack-poisoned]

Avoid device-to-host memory copies when evaluating `torch.cond` predicates. When a GPU buffer (e.g., a KV cache `initialized` flag) is used as a predicate for `torch.cond`, the runtime must synchronize and copy the predicate value from GPU to CPU on every forward pass to evaluate the condition. This adds latency and synchronization overhead. `MoveCondPredicateToCpuPass` moves non-persistent buffer predicates to CPU at export time, eliminating per-inference D2H transfers. The predicate is typically a small scalar (e.g., a boolean flag), so keeping it on CPU has negligible memory impact. - Add `MoveCondPredicateToCpuPass` in `backends/cuda/passes/` - Add unit tests covering: - GPU buffer predicates moved to CPU - CPU buffer predicates unchanged - Computed predicates unaffected - Multiple `torch.cond` calls - Cross-attention cache pattern - Persistent buffers (state_dict) not moved - Add Python tests to `unittest-cuda` CI job in `cuda.yml` ghstack-source-id: b439eb3 ghstack-comment-id: 3687889864 Pull-Request: #16378

larryliu0820 added 21 commits December 19, 2025 11:21

Update

63a2766

[ghstack-poisoned]

Update

f02dbe1

[ghstack-poisoned]

Update

9a7aa91

[ghstack-poisoned]

Update

bc07a7b

[ghstack-poisoned]

Update

a97933b

[ghstack-poisoned]

Update

99ca698

[ghstack-poisoned]

Update

e1bb6c2

[ghstack-poisoned]

Update

395ab4f

[ghstack-poisoned]

Update

2a7a9f0

[ghstack-poisoned]

Update

a86ab6e

[ghstack-poisoned]

Update

ca3ac6d

[ghstack-poisoned]

Update

8b94087

[ghstack-poisoned]

Update

5f755f9

[ghstack-poisoned]

Update

690546b

[ghstack-poisoned]

Update

73efe12

[ghstack-poisoned]

Update

d96dec8

[ghstack-poisoned]

Update

eb6a7e6

[ghstack-poisoned]

Update

d5c53ec

[ghstack-poisoned]

Update

8b8580d

[ghstack-poisoned]

Update

ba6fdff

[ghstack-poisoned]

Update

b103b7f

[ghstack-poisoned]

larryliu0820 requested review from JacobSzwejbka, SS-JIA, cccclai, digantdesai, lucylq and mergennachin as code owners December 23, 2025 20:28

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Dec 23, 2025

larryliu0820 mentioned this pull request Dec 23, 2025

Custom op to update cache for torch.cond #16366

Merged

Gasoonjia approved these changes Dec 23, 2025

View reviewed changes

.github/workflows/cuda.yml Outdated Show resolved Hide resolved

backends/cuda/passes/move_cond_predicate_to_cpu.py Show resolved Hide resolved

backends/cuda/passes/move_cond_predicate_to_cpu.py Show resolved Hide resolved

larryliu0820 added 2 commits December 23, 2025 23:41

Update

a8b20f5

[ghstack-poisoned]

Update

016adb3

[ghstack-poisoned]

Base automatically changed from gh/larryliu0820/85/head to main December 24, 2025 00:41

Update

3fc3117

[ghstack-poisoned]

larryliu0820 added the release notes: desktop for desktop/laptop workstream label Dec 24, 2025

Update

e8349a7

[ghstack-poisoned]

Update

5897ba4

[ghstack-poisoned]

Update

ab861b9

[ghstack-poisoned]

Update

7e0e692

[ghstack-poisoned]

larryliu0820 merged commit 40a18e7 into main Dec 25, 2025
163 of 164 checks passed

larryliu0820 deleted the gh/larryliu0820/86/head branch December 25, 2025 22:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Move torch.cond predicate non-persistent buffer to CPU #16378

Move torch.cond predicate non-persistent buffer to CPU #16378

larryliu0820 commented Dec 23, 2025

Uh oh!

larryliu0820 commented Dec 23, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Dec 23, 2025 •

edited

Loading

Uh oh!

Gasoonjia left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Move torch.cond predicate non-persistent buffer to CPU #16378

Move torch.cond predicate non-persistent buffer to CPU #16378

Conversation

larryliu0820 commented Dec 23, 2025

Uh oh!

larryliu0820 commented Dec 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Dec 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/16378

✅ You can merge normally! (1 Unrelated Failure)

Uh oh!

Gasoonjia left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

larryliu0820 commented Dec 23, 2025 •

edited

Loading

pytorch-bot bot commented Dec 23, 2025 •

edited

Loading