-
Notifications
You must be signed in to change notification settings - Fork 52
More tuning improvements #2201
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
More tuning improvements #2201
Conversation
| // Create context with threading disabled internally, attach shared pool | ||
| ctx = std::make_unique<MLIRContext>(registry, | ||
| MLIRContext::Threading::DISABLED); | ||
| ctx->setThreadPool(getSharedThreadPool()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean that the compilation is parallelized internally as well? We could be oversubscribing threads because we are already parallelizing at a higher level. Can we control the number of threads in the pool?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes it looks like it would use threading internally as well.
But not sure how exactly it works.
I copied logic from MIGraphX's compilation
https://github.com/ROCm/AMDMIGraphX/blob/4d968f79f02de4de5aa3c36f12a179183c12c04e/src/targets/gpu/mlir.cpp#L286
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Threading is enabled by default internally inside MLIRContext unless disabled explictly.
| MLIRContextImpl(bool threadingIsEnabled) |
This change of passing threadpool explictly is meant to reduce overheads of creating seperate threadpool across all parallel threads. But it looks like it is not really affecting runtime. But it is a good practice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good optimization.
In that case we are for sure oversubscribing. I think we can optimize further by manipulating the thread count. Maybe leave 50% of the compile threads for the compilation workers, and the other 50% for the thread pool. There's probably a way to tell the thread pool how many threads to use. 50% is arbitrary here, maybe a different distribution works better. That could be the reason why you don't see it affect runtime, the cpu is already oversaturated with threads.
|
|
||
| static bool isThreadingGloballyDisabled() { | ||
| #if MLIR_ENABLE_THREADS != 0 | ||
| #if LLVM_ENABLE_THREADS != 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you create an upstream PR to fix this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bug is in rocm/llvm. I'll post a PR there.
|
|
||
| // Copy host buffers to GPU once (reused across all config benchmarks) | ||
| for (size_t i = 0; i < bufferLengths.size(); i++) { | ||
| HIPCHECK(hipMemcpy(gpuBuffers[i], hostBuffers[i], bufferLengths[i], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why change from hipMemcpyAsync to hipMemcpy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should be careful with this. I believe that host-to-device copies are still async unless the host memory is allocated page-locked (allocated with hipHostMalloc). It just stages it for DMA transfer and does not wait for the copy to finish.
CUDA behaves like this, I would suppose that HIP does as well: https://docs.nvidia.com/cuda/cuda-driver-api/api-sync-behavior.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably should be using hipHostMalloc anyway to speed up the memory transfer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was changed from hipMemCpyAsync to hipMemcpy becuase it doesn't require "stream" or because it uses default stream.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah I see, at this point you don't have a stream?
| gpuBuffers.push_back(gpuBuffer); | ||
| } | ||
|
|
||
| // Copy host buffers to GPU once (reused across all config benchmarks) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if we are using atomics? the results would be different in every iteration because we don't init the output tensor with the same values. I think that shouldn't affect run-time but asking just in case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't care about results here. Just the benchmarking
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we add a comment just in case? so the reader is aware of this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you have this change in a seperate commit with "[external]..."?
Motivation
rocm/llvmhas it incorrectly.llvm/llvmuses LLVM_ENABLE_THREADSTesting
rocmlir-gen -m 1500 -n 1152 -k 896 --operation gemm --arch gfx942 --num_cu 80