[rocmlir-tuning-driver] Add rotating buffers and inline instruction cache flush #2188

pabloantoniom · 2025-12-30T13:37:02Z

Motivation

See https://github.com/ROCm/rocMLIR-internal/issues/2149 for motivation.

Technical Details

This PR removes calls to both flushL2Cache and flushInstructionCache, replacing them with other approach:

For flushL2Cache: This PR implements rotating buffers. The idea is to allocate multiple buffers that are rotated on every iteration, with the goal of avoiding cache reuse from one iteration to another. We also have a new option num-rotating-buffers to control how many rotating buffers are used (default is 5).
For flushInstructionCache: This PR implements insertInstructionCacheFlush, which inserts the kernel containing s_icache_inv plus nops into the actual kernel that we want to executing, thus avoiding the launch overhead of the flushInstructionCache kernel. The compilation of the kernel is adapted to make a intermediate step at LLVM IR level, where we insert the assembly, which later get lowered to a binary.

Test Plan

No new test was added.

Test Result

All test pass.

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

…ly in non-benchmark mode (otherwise it kills the performance of small kernels!)

… not being invalidated with this

…ng a separate kernel

dhernandez0 · 2026-01-07T12:20:05Z

mlir/tools/rocmlir-tuning-driver/rocmlir-tuning-driver.cpp

+static llvm::cl::opt<unsigned> numRotatingBuffers(
+    "num-rotating-buffers",
+    llvm::cl::desc("Number of rotating buffers to use for benchmarking"),
+    llvm::cl::value_desc("number of buffers"), llvm::cl::init(5));


we could set the default to -1, and do this by default:

if bufferSize >= L2Size: numRotatingBuffers = 2 else numRotatingBuffers = 2*ceil(L2Size/bufferSize)

where bufferSize is the total size of the tensors we are going to load from global memory.

dhernandez0 · 2026-01-07T12:29:02Z

mlir/tools/rocmlir-tuning-driver/rocmlir-tuning-driver.cpp

+
+    // Insert the inline assembly at the beginning of the entry block
+    builder.setInsertionPointToStart(entryBlock);
+


move foundAnyKernel = true to here? so we are sure it's a valid kernel

dhernandez0 · 2026-01-07T12:32:50Z

mlir/tools/rocmlir-tuning-driver/rocmlir-tuning-driver.cpp

in some cases, we might run out of gpu memory. Is there a way to restrict num_buffers by querying the gpu global memory size?

dhernandez0 · 2026-01-07T12:34:04Z

mlir/tools/rocmlir-tuning-driver/rocmlir-tuning-driver.cpp

      }
    }
+
+    // 3. Launch the actualkernels.


actualkernels -> actual kernels

Copilot

Pull request overview

This PR refactors the cache management strategy in the rocMLIR tuning driver to improve benchmarking accuracy and reduce overhead. It replaces explicit cache flush kernel launches with two optimizations: rotating buffers for L2 cache management and inline instruction cache flush assembly embedded directly in kernels.

Changes:

Implements rotating buffer support to avoid L2 cache reuse between iterations (configurable via --num-rotating-buffers, default 5)
Adds inline instruction cache flush assembly (s_icache_inv + nops) directly into kernel entry points to eliminate separate kernel launch overhead
Restructures the compilation pipeline to insert inline assembly at the LLVM IR stage before final binary compilation

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-12T09:55:59Z

mlir/tools/rocmlir-tuning-driver/rocmlir-tuning-driver.cpp

+    llvm::errs()
+        << "numIterations must be greater than or equal to numBufferSets\n";


The error message should be more descriptive. It should include the actual values to help with debugging, e.g., "numIterations (N) must be greater than or equal to numBufferSets (M)".

Suggested change

llvm::errs()

<< "numIterations must be greater than or equal to numBufferSets\n";

llvm::errs() << "numIterations (" << iterations

<< ") must be greater than or equal to numBufferSets ("

<< numBufferSets << ")\n";

Copilot · 2026-01-12T09:55:59Z

mlir/tools/rocmlir-tuning-driver/rocmlir-tuning-driver.cpp

-                   std::vector<void *> &argPointers,
-                   std::vector<double> &measurements, double &smallKernelCpuMs,
-                   bool benchmarkMode) {
+// Insert instruction cache flush assembly at the beginning of kernel functions


This comment could be more detailed. Consider adding documentation about what the function does, its parameters, and return value. For example: "Inserts inline assembly at the beginning of each kernel function to flush the instruction cache. Returns success() if at least one kernel was found and modified, failure() otherwise."

Suggested change

// Insert instruction cache flush assembly at the beginning of kernel functions

// Inserts inline assembly at the beginning of each kernel function in the

// given module to flush the instruction cache.

//

// Parameters:

// module - The MLIR module containing LLVM GPU kernel functions. Kernel

// functions are identified by the "gpu.kernel" attribute and by

// having a gpu::GPUModuleOp as a parent.

//

// Returns:

// success() if at least one kernel function was found in the module and

// modified, or failure() if no applicable kernel functions were found.

Copilot · 2026-01-12T09:55:59Z

mlir/tools/rocmlir-tuning-driver/rocmlir-tuning-driver.cpp

 #include "mlir/Dialect/Rock/Tuning/RockTuning.h"
 #include "mlir/Dialect/Rock/utility/fusionUtils.h"
 #include "mlir/Dialect/Rock/utility/loweringUtils.h"
+// #include "mlir/IR/IRBuilder.h"


This commented-out include should be removed. If it's not needed, it shouldn't be in the code.

Suggested change

// #include "mlir/IR/IRBuilder.h"

Copilot · 2026-01-12T09:56:00Z

mlir/tools/rocmlir-tuning-driver/rocmlir-tuning-driver.cpp

      }
    }
+
+    // 3. Launch the actualkernels.


There's a typo in this comment: "actualkernels" should be "actual kernels" (with a space).

Suggested change

// 3. Launch the actualkernels.

// 3. Launch the actual kernels.

Copilot · 2026-01-12T09:56:00Z

mlir/tools/rocmlir-tuning-driver/rocmlir-tuning-driver.cpp

      HIPCHECK(hipExtModuleLaunchKernel(
          func, gridSize * blockSize, 1, 1, blockSize, 1, 1, 0, stream,
-          argPointers.data(), nullptr, nullptr, nullptr));
+          const_cast<void **>(argPointers.data()), nullptr, nullptr, nullptr));


The const_cast is being added here, but the underlying issue is that argPointers contains pointers to non-const data. Consider whether the function signature should be changed or if there's a better way to handle this without using const_cast, as it can be error-prone and may hide const-correctness issues.

Copilot · 2026-01-12T09:56:00Z

mlir/tools/rocmlir-tuning-driver/rocmlir-tuning-driver.cpp

-          argPointers.data(), nullptr, startEvent, stopEvent));
+      HIPCHECK(hipExtModuleLaunchKernel(func, gridSize * blockSize, 1, 1,
+                                        blockSize, 1, 1, 0, stream,
+                                        const_cast<void **>(argPointers.data()),


The same const_cast issue appears here as well. Consider refactoring to avoid the need for const_cast throughout the codebase.

Copilot · 2026-01-12T09:56:00Z

mlir/tools/rocmlir-tuning-driver/rocmlir-tuning-driver.cpp

        HIPCHECK(hipExtModuleLaunchKernel(
            func, gridSize * blockSize, 1, 1, blockSize, 1, 1, 0, stream,
-            argPointers.data(), nullptr, startEvent, stopEvent));
+            const_cast<void **>(warmupArgPointers.data()), nullptr, startEvent,


The same const_cast pattern appears here. Consider whether the constness of warmupArgPointers is necessary or if the type should be adjusted to avoid the cast.

Copilot · 2026-01-12T09:56:01Z

mlir/tools/rocmlir-tuning-driver/rocmlir-tuning-driver.cpp

+      }
+
+      // Insert instruction cache flush assembly at the beginning of kernel
+      // functions This must be done after backend pipeline generates LLVM IR


This comment has a spacing issue. "functions This" should be "functions. This" (period instead of space before "This").

Suggested change

// functions This must be done after backend pipeline generates LLVM IR

// functions. This must be done after backend pipeline generates LLVM IR

pabloantoniom added 7 commits December 29, 2025 15:26

Implement rotating buffers

7bd5ce6

Prepare pointers before benchmarking and run flushInstructionCache on…

9539730

…ly in non-benchmark mode (otherwise it kills the performance of small kernels!)

Tried idea of copying kernel binaries, but results suggest that L1 is…

72126ea

… not being invalidated with this

Undo previous commit

1199685

New idea: insert s_icache_inv inside the real kernel rather than havi…

9899514

…ng a separate kernel

Remove function to dump asm (to check we insert the right thing. Cleanup

a35ea98

Cleanup

61dd193

pabloantoniom requested a review from causten as a code owner December 30, 2025 13:37

pabloantoniom requested review from dhernandez0, justinrosner, stefankoncarevic and umangyadav December 31, 2025 06:18

dhernandez0 reviewed Jan 7, 2026

View reviewed changes

dhernandez0 requested a review from Copilot January 12, 2026 09:48

Copilot started reviewing on behalf of dhernandez0 January 12, 2026 09:51 View session

Copilot AI reviewed Jan 12, 2026

View reviewed changes


		// Insert the inline assembly at the beginning of the entry block
		builder.setInsertionPointToStart(entryBlock);

		llvm::errs()
		<< "numIterations must be greater than or equal to numBufferSets\n";

-    llvm::errs()
-        << "numIterations must be greater than or equal to numBufferSets\n";
+    llvm::errs() << "numIterations (" << iterations
+                 << ") must be greater than or equal to numBufferSets ("
+                 << numBufferSets << ")\n";

-// Insert instruction cache flush assembly at the beginning of kernel functions
+// Inserts inline assembly at the beginning of each kernel function in the
+// given module to flush the instruction cache.
+//
+// Parameters:
+//   module - The MLIR module containing LLVM GPU kernel functions. Kernel
+//            functions are identified by the "gpu.kernel" attribute and by
+//            having a gpu::GPUModuleOp as a parent.
+//
+// Returns:
+//   success() if at least one kernel function was found in the module and
+//   modified, or failure() if no applicable kernel functions were found.

	// 3. Launch the actualkernels.
	// 3. Launch the actual kernels.

	// functions This must be done after backend pipeline generates LLVM IR
	// functions. This must be done after backend pipeline generates LLVM IR

[rocmlir-tuning-driver] Add rotating buffers and inline instruction cache flush #2188

Are you sure you want to change the base?

[rocmlir-tuning-driver] Add rotating buffers and inline instruction cache flush #2188

Conversation

pabloantoniom commented Dec 30, 2025

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Uh oh!

dhernandez0 Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

dhernandez0 Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

dhernandez0 Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

dhernandez0 Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

dhernandez0 Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants