-
Notifications
You must be signed in to change notification settings - Fork 104
Replace cudaMemcpyAsync with cudaMemcpyBatchAsync to avoid locking #777
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
Conversation
|
|
||
| #if CUDART_VERSION >= 12080 | ||
| if (stream.get() == 0) { | ||
| CUCO_CUDA_TRY(cudaMemcpyAsync(dst, src, count, kind, stream.get())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you make this return the cudaError_t and do the CUCO_CUDA_TRY at the call site, I think it may be easier to track down where errors come from with file/line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Updated.
| * @param kind Memory copy direction | ||
| * @param stream CUDA stream for the operation | ||
| */ | ||
| inline void memcpy_async( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’d like to see if we can align on the exact semantics of this between cuCo and libcudf. My libcudf PR has a few subtle differences that we should iron out. I’ll study this PR and tweak mine accordingly but we may have more to share and learn.
For example, should we write this as “memcpy_batch_async” and call it from another function “memcpy_async”? That way we have the ability to do a batched copy if there is a use case in cuCo, and it matches the libcudf design.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’d like to see if we can align on the exact semantics of this between cuCo and libcudf.
Agreed! Feel free to drop comments on this PR anytime you spot something that needs fixing or improving.
should we write this as “memcpy_batch_async” and call it from another function “memcpy_async”?
Great question! I actually noticed this difference between your approach and the current PR, and I intentionally hid the fact that we’re using batch async memcpy under the hood. Since cuco doesn’t have any real use cases for batch memcpy, we’re only relying on it as a workaround for the driver locking issue in the legacy async memcpy. I felt it was cleaner to keep that detail internal. From a user’s perspective, it’s just an implementation detail they don’t need to deal with.
In libcudf, though, batch async memcpy is used more broadly IIRC, so exposing both the legacy and batch variants there makes a lot more sense.
sleeepyjack
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Could you share some details offline on why this fix is needed?
I’ve expanded the PR description with more context on the issue. In short, |
This PR switches from
cudaMemcpyAsynctocudaMemcpyBatchAsyncto eliminate a performance regression caused by driver-side locking in the legacy memcpy path. Using the new batch-async API removes that locking overhead and restores the expected performance.