Replace cudaMemcpyAsync with cudaMemcpyBatchAsync to avoid locking #777

PointKernel · 2025-12-05T18:57:46Z

This PR switches from cudaMemcpyAsync to cudaMemcpyBatchAsync to eliminate a performance regression caused by driver-side locking in the legacy memcpy path. Using the new batch-async API removes that locking overhead and restores the expected performance.

…ver locking bug

bdice · 2025-12-06T15:01:34Z

include/cuco/detail/utility/memcpy_async.cuh

+
+#if CUDART_VERSION >= 12080
+  if (stream.get() == 0) {
+    CUCO_CUDA_TRY(cudaMemcpyAsync(dst, src, count, kind, stream.get()));


If you make this return the cudaError_t and do the CUCO_CUDA_TRY at the call site, I think it may be easier to track down where errors come from with file/line?

Good point. Updated.

bdice · 2025-12-06T15:02:52Z

include/cuco/detail/utility/memcpy_async.cuh

+ * @param kind Memory copy direction
+ * @param stream CUDA stream for the operation
+ */
+inline void memcpy_async(


I’d like to see if we can align on the exact semantics of this between cuCo and libcudf. My libcudf PR has a few subtle differences that we should iron out. I’ll study this PR and tweak mine accordingly but we may have more to share and learn.

For example, should we write this as “memcpy_batch_async” and call it from another function “memcpy_async”? That way we have the ability to do a batched copy if there is a use case in cuCo, and it matches the libcudf design.

I’d like to see if we can align on the exact semantics of this between cuCo and libcudf.

Agreed! Feel free to drop comments on this PR anytime you spot something that needs fixing or improving.

should we write this as “memcpy_batch_async” and call it from another function “memcpy_async”?

Great question! I actually noticed this difference between your approach and the current PR, and I intentionally hid the fact that we’re using batch async memcpy under the hood. Since cuco doesn’t have any real use cases for batch memcpy, we’re only relying on it as a workaround for the driver locking issue in the legacy async memcpy. I felt it was cleaner to keep that detail internal. From a user’s perspective, it’s just an implementation detail they don’t need to deal with.

In libcudf, though, batch async memcpy is used more broadly IIRC, so exposing both the legacy and batch variants there makes a lot more sense.

sleeepyjack

LGTM
Could you share some details offline on why this fix is needed?

PointKernel · 2025-12-10T17:24:59Z

LGTM Could you share some details offline on why this fix is needed?

I’ve expanded the PR description with more context on the issue. In short, cudaMemcpyAsync incurs a costly driver lock when used across multiple streams, which leads to a noticeable performance hit. Switching to the new batch async API removes this locking behavior and resolves the regression.

Repalce cudaMemcpyAsync with cudaMemcpyBatchAsync to get rid of a dri…

ee5addf

…ver locking bug

PointKernel requested a review from sleeepyjack as a code owner December 5, 2025 18:57

PointKernel added helps: rapids Helps or needed by RAPIDS topic: performance Performance related issue labels Dec 5, 2025

PointKernel self-assigned this Dec 5, 2025

PointKernel added 3 commits December 5, 2025 11:02

Fix pre-12.8 compatibility

f034a26

Fix edge cases for cudaMemcpyBatchAsync

17aa550

Header cleanups

d735479

PointKernel added the Needs Review Awaiting reviews before merging label Dec 6, 2025

bdice reviewed Dec 6, 2025

View reviewed changes

PointKernel added 2 commits December 6, 2025 18:06

Update detail memcpy_async to return CUDA error

56246c6

Use C++ header instead of CUDA header

fbe6249

sleeepyjack approved these changes Dec 8, 2025

View reviewed changes

PointKernel changed the title ~~Repalce cudaMemcpyAsync with cudaMemcpyBatchAsync to avoid locking~~ Replace cudaMemcpyAsync with cudaMemcpyBatchAsync to avoid locking Dec 15, 2025

PointKernel added 2 commits December 19, 2025 13:05

Resolve conflicts

669c84b

Merge branch 'dev' into fix-memcpy-locking

e1fb79b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Replace cudaMemcpyAsync with cudaMemcpyBatchAsync to avoid locking #777

Replace cudaMemcpyAsync with cudaMemcpyBatchAsync to avoid locking #777

Uh oh!

PointKernel commented Dec 5, 2025 •

edited

Loading

Uh oh!

bdice Dec 6, 2025

Uh oh!

PointKernel Dec 7, 2025

Uh oh!

bdice Dec 6, 2025

Uh oh!

PointKernel Dec 7, 2025

Uh oh!

sleeepyjack left a comment

Uh oh!

PointKernel commented Dec 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Replace cudaMemcpyAsync with cudaMemcpyBatchAsync to avoid locking #777

Are you sure you want to change the base?

Replace cudaMemcpyAsync with cudaMemcpyBatchAsync to avoid locking #777

Uh oh!

Conversation

PointKernel commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bdice Dec 6, 2025

Choose a reason for hiding this comment

Uh oh!

PointKernel Dec 7, 2025

Choose a reason for hiding this comment

Uh oh!

bdice Dec 6, 2025

Choose a reason for hiding this comment

Uh oh!

PointKernel Dec 7, 2025

Choose a reason for hiding this comment

Uh oh!

sleeepyjack left a comment

Choose a reason for hiding this comment

Uh oh!

PointKernel commented Dec 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

PointKernel commented Dec 5, 2025 •

edited

Loading