Skip to content

Conversation

@PointKernel
Copy link
Member

@PointKernel PointKernel commented Dec 5, 2025

This PR switches from cudaMemcpyAsync to cudaMemcpyBatchAsync to eliminate a performance regression caused by driver-side locking in the legacy memcpy path. Using the new batch-async API removes that locking overhead and restores the expected performance.

@PointKernel PointKernel added helps: rapids Helps or needed by RAPIDS topic: performance Performance related issue labels Dec 5, 2025
@PointKernel PointKernel self-assigned this Dec 5, 2025
@PointKernel PointKernel added the Needs Review Awaiting reviews before merging label Dec 6, 2025

#if CUDART_VERSION >= 12080
if (stream.get() == 0) {
CUCO_CUDA_TRY(cudaMemcpyAsync(dst, src, count, kind, stream.get()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you make this return the cudaError_t and do the CUCO_CUDA_TRY at the call site, I think it may be easier to track down where errors come from with file/line?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Updated.

* @param kind Memory copy direction
* @param stream CUDA stream for the operation
*/
inline void memcpy_async(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’d like to see if we can align on the exact semantics of this between cuCo and libcudf. My libcudf PR has a few subtle differences that we should iron out. I’ll study this PR and tweak mine accordingly but we may have more to share and learn.

For example, should we write this as “memcpy_batch_async” and call it from another function “memcpy_async”? That way we have the ability to do a batched copy if there is a use case in cuCo, and it matches the libcudf design.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’d like to see if we can align on the exact semantics of this between cuCo and libcudf.

Agreed! Feel free to drop comments on this PR anytime you spot something that needs fixing or improving.

should we write this as “memcpy_batch_async” and call it from another function “memcpy_async”?

Great question! I actually noticed this difference between your approach and the current PR, and I intentionally hid the fact that we’re using batch async memcpy under the hood. Since cuco doesn’t have any real use cases for batch memcpy, we’re only relying on it as a workaround for the driver locking issue in the legacy async memcpy. I felt it was cleaner to keep that detail internal. From a user’s perspective, it’s just an implementation detail they don’t need to deal with.

In libcudf, though, batch async memcpy is used more broadly IIRC, so exposing both the legacy and batch variants there makes a lot more sense.

Copy link
Collaborator

@sleeepyjack sleeepyjack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
Could you share some details offline on why this fix is needed?

@PointKernel
Copy link
Member Author

LGTM Could you share some details offline on why this fix is needed?

I’ve expanded the PR description with more context on the issue. In short, cudaMemcpyAsync incurs a costly driver lock when used across multiple streams, which leads to a noticeable performance hit. Switching to the new batch async API removes this locking behavior and resolves the regression.

@PointKernel PointKernel changed the title Repalce cudaMemcpyAsync with cudaMemcpyBatchAsync to avoid locking Replace cudaMemcpyAsync with cudaMemcpyBatchAsync to avoid locking Dec 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

helps: rapids Helps or needed by RAPIDS Needs Review Awaiting reviews before merging topic: performance Performance related issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants