-
Notifications
You must be signed in to change notification settings - Fork 52
[rocmlir-tuning-driver] Add rotating buffers and inline instruction cache flush #2188
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
…ly in non-benchmark mode (otherwise it kills the performance of small kernels!)
… not being invalidated with this
…ng a separate kernel
| static llvm::cl::opt<unsigned> numRotatingBuffers( | ||
| "num-rotating-buffers", | ||
| llvm::cl::desc("Number of rotating buffers to use for benchmarking"), | ||
| llvm::cl::value_desc("number of buffers"), llvm::cl::init(5)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could set the default to -1, and do this by default:
if bufferSize >= L2Size:
numRotatingBuffers = 2
else
numRotatingBuffers = 2*ceil(L2Size/bufferSize)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where bufferSize is the total size of the tensors we are going to load from global memory.
|
|
||
| // Insert the inline assembly at the beginning of the entry block | ||
| builder.setInsertionPointToStart(entryBlock); | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move foundAnyKernel = true to here? so we are sure it's a valid kernel
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in some cases, we might run out of gpu memory. Is there a way to restrict num_buffers by querying the gpu global memory size?
| } | ||
| } | ||
|
|
||
| // 3. Launch the actualkernels. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actualkernels -> actual kernels
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR refactors the cache management strategy in the rocMLIR tuning driver to improve benchmarking accuracy and reduce overhead. It replaces explicit cache flush kernel launches with two optimizations: rotating buffers for L2 cache management and inline instruction cache flush assembly embedded directly in kernels.
Changes:
- Implements rotating buffer support to avoid L2 cache reuse between iterations (configurable via
--num-rotating-buffers, default 5) - Adds inline instruction cache flush assembly (
s_icache_inv+ nops) directly into kernel entry points to eliminate separate kernel launch overhead - Restructures the compilation pipeline to insert inline assembly at the LLVM IR stage before final binary compilation
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| llvm::errs() | ||
| << "numIterations must be greater than or equal to numBufferSets\n"; |
Copilot
AI
Jan 12, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error message should be more descriptive. It should include the actual values to help with debugging, e.g., "numIterations (N) must be greater than or equal to numBufferSets (M)".
| llvm::errs() | |
| << "numIterations must be greater than or equal to numBufferSets\n"; | |
| llvm::errs() << "numIterations (" << iterations | |
| << ") must be greater than or equal to numBufferSets (" | |
| << numBufferSets << ")\n"; |
| std::vector<void *> &argPointers, | ||
| std::vector<double> &measurements, double &smallKernelCpuMs, | ||
| bool benchmarkMode) { | ||
| // Insert instruction cache flush assembly at the beginning of kernel functions |
Copilot
AI
Jan 12, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment could be more detailed. Consider adding documentation about what the function does, its parameters, and return value. For example: "Inserts inline assembly at the beginning of each kernel function to flush the instruction cache. Returns success() if at least one kernel was found and modified, failure() otherwise."
| // Insert instruction cache flush assembly at the beginning of kernel functions | |
| // Inserts inline assembly at the beginning of each kernel function in the | |
| // given module to flush the instruction cache. | |
| // | |
| // Parameters: | |
| // module - The MLIR module containing LLVM GPU kernel functions. Kernel | |
| // functions are identified by the "gpu.kernel" attribute and by | |
| // having a gpu::GPUModuleOp as a parent. | |
| // | |
| // Returns: | |
| // success() if at least one kernel function was found in the module and | |
| // modified, or failure() if no applicable kernel functions were found. |
| #include "mlir/Dialect/Rock/Tuning/RockTuning.h" | ||
| #include "mlir/Dialect/Rock/utility/fusionUtils.h" | ||
| #include "mlir/Dialect/Rock/utility/loweringUtils.h" | ||
| // #include "mlir/IR/IRBuilder.h" |
Copilot
AI
Jan 12, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This commented-out include should be removed. If it's not needed, it shouldn't be in the code.
| // #include "mlir/IR/IRBuilder.h" |
| } | ||
| } | ||
|
|
||
| // 3. Launch the actualkernels. |
Copilot
AI
Jan 12, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a typo in this comment: "actualkernels" should be "actual kernels" (with a space).
| // 3. Launch the actualkernels. | |
| // 3. Launch the actual kernels. |
| HIPCHECK(hipExtModuleLaunchKernel( | ||
| func, gridSize * blockSize, 1, 1, blockSize, 1, 1, 0, stream, | ||
| argPointers.data(), nullptr, nullptr, nullptr)); | ||
| const_cast<void **>(argPointers.data()), nullptr, nullptr, nullptr)); |
Copilot
AI
Jan 12, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The const_cast is being added here, but the underlying issue is that argPointers contains pointers to non-const data. Consider whether the function signature should be changed or if there's a better way to handle this without using const_cast, as it can be error-prone and may hide const-correctness issues.
| argPointers.data(), nullptr, startEvent, stopEvent)); | ||
| HIPCHECK(hipExtModuleLaunchKernel(func, gridSize * blockSize, 1, 1, | ||
| blockSize, 1, 1, 0, stream, | ||
| const_cast<void **>(argPointers.data()), |
Copilot
AI
Jan 12, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same const_cast issue appears here as well. Consider refactoring to avoid the need for const_cast throughout the codebase.
| HIPCHECK(hipExtModuleLaunchKernel( | ||
| func, gridSize * blockSize, 1, 1, blockSize, 1, 1, 0, stream, | ||
| argPointers.data(), nullptr, startEvent, stopEvent)); | ||
| const_cast<void **>(warmupArgPointers.data()), nullptr, startEvent, |
Copilot
AI
Jan 12, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same const_cast pattern appears here. Consider whether the constness of warmupArgPointers is necessary or if the type should be adjusted to avoid the cast.
| } | ||
|
|
||
| // Insert instruction cache flush assembly at the beginning of kernel | ||
| // functions This must be done after backend pipeline generates LLVM IR |
Copilot
AI
Jan 12, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment has a spacing issue. "functions This" should be "functions. This" (period instead of space before "This").
| // functions This must be done after backend pipeline generates LLVM IR | |
| // functions. This must be done after backend pipeline generates LLVM IR |
Motivation
See https://github.com/ROCm/rocMLIR-internal/issues/2149 for motivation.
Technical Details
This PR removes calls to both
flushL2CacheandflushInstructionCache, replacing them with other approach:flushL2Cache: This PR implements rotating buffers. The idea is to allocate multiple buffers that are rotated on every iteration, with the goal of avoiding cache reuse from one iteration to another. We also have a new optionnum-rotating-buffersto control how many rotating buffers are used (default is 5).flushInstructionCache: This PR implementsinsertInstructionCacheFlush, which inserts the kernel containings_icache_invplusnops into the actual kernel that we want to executing, thus avoiding the launch overhead of theflushInstructionCachekernel. The compilation of the kernel is adapted to make a intermediate step at LLVM IR level, where we insert the assembly, which later get lowered to a binary.Test Plan
No new test was added.
Test Result
All test pass.
Submission Checklist