Integrate CUDA sampler into ASR runner and enable skip_copy for decoder #16388

larryliu0820 · 2025-12-24T01:47:00Z

cuda_backend.cpp: Support comma-separated method names in
skip_copy_output_to_cpu_for_method backend option.
runner.cpp: Use CudaSampler for argmax when CUDA is available and
temperature==0. Skip copy to CPU for both encoder and decoder methods.
CMakeLists.txt updates: Link against extension_llm_sampler_cuda library
and include the sampler subdirectory.

This optimization keeps decoder logits on GPU and performs argmax directly
on GPU memory, avoiding unnecessary device-to-host copies in the decode loop.

[ghstack-poisoned]

larryliu0820 · 2025-12-24T01:47:01Z

Stack from ghstack (oldest at bottom):

- cuda_backend.cpp: Support comma-separated method names in skip_copy_output_to_cpu_for_method backend option. - runner.cpp: Use CudaSampler for argmax when CUDA is available and temperature==0. Skip copy to CPU for both encoder and decoder methods. - CMakeLists.txt updates: Link against extension_llm_sampler_cuda library and include the sampler subdirectory. This optimization keeps decoder logits on GPU and performs argmax directly on GPU memory, avoiding unnecessary device-to-host copies in the decode loop. ghstack-source-id: e3e85c9 ghstack-comment-id: 3688378475 Pull-Request: #16388

pytorch-bot · 2025-12-24T01:47:04Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/16388

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 14 New Failures, 2 Unrelated Failures

As of commit 1637ec1 with merge base c5d66a5 ():

NEW FAILURES - The following jobs have failed:

Lint / lintrunner / linux-job (gh)
>>> Lint for extension/llm/runner/CMakeLists.txt:
pull / test-multimodal-linux (gemma3-4b) / linux-job (gh)
RuntimeError: Command docker exec -t 4466472f45a0b195db5e372715a237cc95d589061175d6bb474a95673f9bbac0 /exec failed with exit code 139
Test CUDA Builds / test-cuda-shims / linux-job (gh)
RuntimeError: Command docker exec -t ac2acada201032fcfa28a448f30239f80f8d56cd6f444e4b6f1fa8954029d86a /exec failed with exit code 1
Test CUDA Builds / test-model-cuda-e2e (google, gemma-3-4b-it, non-quantized) / linux-job (gh)
RuntimeError: Command docker exec -t 42cc3d4baf40e21817fc34e2c4b7de2c28d76dcdbef427a9d5066df17a7d7683 /exec failed with exit code 2
Test CUDA Builds / test-model-cuda-e2e (google, gemma-3-4b-it, quantized-int4-tile-packed) / linux-job (gh)
RuntimeError: Command docker exec -t 98ce809db8f343dce7bfe1f3435d7a3fd78d61988073ddfb2ccd28f56adf7224 /exec failed with exit code 2
Test CUDA Builds / test-model-cuda-e2e (mistralai, Voxtral-Mini-3B-2507, non-quantized) / linux-job (gh)
RuntimeError: Command docker exec -t 159d6f51c05833debfef29c964c94eeb9872bdddb6867993b71ca52171868ad1 /exec failed with exit code 2
Test CUDA Builds / test-model-cuda-e2e (mistralai, Voxtral-Mini-3B-2507, quantized-int4-tile-packed) / linux-job (gh)
RuntimeError: Command docker exec -t dcfa35d3dff75b5e3f1516adf19577b960b2324a59e0ef526b2f078519b4d97c /exec failed with exit code 2
Test CUDA Builds / test-model-cuda-e2e (mistralai, Voxtral-Mini-3B-2507, quantized-int4-weight-only) / linux-job (gh)
RuntimeError: Command docker exec -t a5bb4a490b58993bed9eb0b97b087ef390fb8b8391528e4ac0abb7c718b6bdac /exec failed with exit code 2
Test CUDA Builds / test-model-cuda-e2e (openai, whisper-large-v3-turbo, non-quantized) / linux-job (gh)
RuntimeError: Command docker exec -t 51541149242cbad68000b1caf53b1327838776ba66a66650d7f388f842775a4c /exec failed with exit code 2
Test CUDA Builds / test-model-cuda-e2e (openai, whisper-large-v3-turbo, quantized-int4-tile-packed) / linux-job (gh)
RuntimeError: Command docker exec -t 5815100b569dbe333addcd54e15cf3dbf583ffd277de780fae5e0698b777dd74 /exec failed with exit code 2
Test CUDA Builds / test-model-cuda-e2e (openai, whisper-large-v3-turbo, quantized-int4-weight-only) / linux-job (gh)
RuntimeError: Command docker exec -t 126e9de5984bcb7991384a27156e517f945f9ab362f5373e408939e37d0458bd /exec failed with exit code 2
Test CUDA Builds / test-model-cuda-e2e (openai, whisper-small, non-quantized) / linux-job (gh)
RuntimeError: Command docker exec -t 8d165dbedd1e24b5fe4567656d62d904dd6dd3cec3b486bc0f15681aebf66798 /exec failed with exit code 2
Test CUDA Builds / test-model-cuda-e2e (openai, whisper-small, quantized-int4-tile-packed) / linux-job (gh)
RuntimeError: Command docker exec -t be9f5859fd05da90e00768e8dc94de3afb693a0af847346062ca11cb5ccacca2 /exec failed with exit code 2
Test CUDA Builds / test-model-cuda-e2e (openai, whisper-small, quantized-int4-weight-only) / linux-job (gh)
RuntimeError: Command docker exec -t 81e642c9af3e56071f8c2ead047550a96b163d4da2c750b86f7dcb2dcf63df81 /exec failed with exit code 2

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

pull / test-samsung-models-linux / linux-job (gh) (trunk failure)
##[error]The operation was canceled.

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

pull / android / run-emulator (gh) (#16137)
Timeout waiting for emulator to boot.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Gasoonjia · 2025-12-25T06:49:35Z

backends/cuda/runtime/cuda_backend.cpp

    std::lock_guard<std::mutex> guard(skip_copy_method_mutex_);
-    return method_name == skip_copy_method_;
+    // Support comma-separated list of method names
+    if (skip_copy_method_.empty()) {


hmm can we make the skip_copy_method_ as an array? Maybe we can avoid the fancy comparsion here for supporting multiple method names

Update

1637ec1

[ghstack-poisoned]

larryliu0820 requested a review from kirklandsign as a code owner December 24, 2025 01:47

This was referenced Dec 24, 2025

Add CUDA argmax kernel for LLM sampler #16386

Open

Add CudaSampler class for GPU-based token sampling #16387

Open

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Dec 24, 2025

Gasoonjia approved these changes Dec 25, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Integrate CUDA sampler into ASR runner and enable skip_copy for decoder #16388

Integrate CUDA sampler into ASR runner and enable skip_copy for decoder #16388

larryliu0820 commented Dec 24, 2025

Uh oh!

larryliu0820 commented Dec 24, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Dec 24, 2025 •

edited

Loading

Uh oh!

Gasoonjia Dec 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Integrate CUDA sampler into ASR runner and enable skip_copy for decoder #16388

Are you sure you want to change the base?

Integrate CUDA sampler into ASR runner and enable skip_copy for decoder #16388

Conversation

larryliu0820 commented Dec 24, 2025

Uh oh!

larryliu0820 commented Dec 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Dec 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/16388

❌ 14 New Failures, 2 Unrelated Failures

Uh oh!

Gasoonjia Dec 25, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

larryliu0820 commented Dec 24, 2025 •

edited

Loading

pytorch-bot bot commented Dec 24, 2025 •

edited

Loading