Skip to content

Conversation

@pabloantoniom
Copy link
Contributor

Motivation

See https://github.com/ROCm/rocMLIR-internal/issues/2149 for motivation.

Technical Details

This PR removes calls to both flushL2Cache and flushInstructionCache, replacing them with other approach:

  • For flushL2Cache: This PR implements rotating buffers. The idea is to allocate multiple buffers that are rotated on every iteration, with the goal of avoiding cache reuse from one iteration to another. We also have a new option num-rotating-buffers to control how many rotating buffers are used (default is 5).
  • For flushInstructionCache: This PR implements insertInstructionCacheFlush, which inserts the kernel containing s_icache_inv plus nops into the actual kernel that we want to executing, thus avoiding the launch overhead of the flushInstructionCache kernel. The compilation of the kernel is adapted to make a intermediate step at LLVM IR level, where we insert the assembly, which later get lowered to a binary.

Test Plan

No new test was added.

Test Result

All test pass.

Submission Checklist

static llvm::cl::opt<unsigned> numRotatingBuffers(
"num-rotating-buffers",
llvm::cl::desc("Number of rotating buffers to use for benchmarking"),
llvm::cl::value_desc("number of buffers"), llvm::cl::init(5));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could set the default to -1, and do this by default:

if bufferSize >= L2Size:
 numRotatingBuffers = 2
else
 numRotatingBuffers = 2*ceil(L2Size/bufferSize)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where bufferSize is the total size of the tensors we are going to load from global memory.


// Insert the inline assembly at the beginning of the entry block
builder.setInsertionPointToStart(entryBlock);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move foundAnyKernel = true to here? so we are sure it's a valid kernel

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in some cases, we might run out of gpu memory. Is there a way to restrict num_buffers by querying the gpu global memory size?

}
}

// 3. Launch the actualkernels.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actualkernels -> actual kernels

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors the cache management strategy in the rocMLIR tuning driver to improve benchmarking accuracy and reduce overhead. It replaces explicit cache flush kernel launches with two optimizations: rotating buffers for L2 cache management and inline instruction cache flush assembly embedded directly in kernels.

Changes:

  • Implements rotating buffer support to avoid L2 cache reuse between iterations (configurable via --num-rotating-buffers, default 5)
  • Adds inline instruction cache flush assembly (s_icache_inv + nops) directly into kernel entry points to eliminate separate kernel launch overhead
  • Restructures the compilation pipeline to insert inline assembly at the LLVM IR stage before final binary compilation

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +541 to +542
llvm::errs()
<< "numIterations must be greater than or equal to numBufferSets\n";
Copy link

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error message should be more descriptive. It should include the actual values to help with debugging, e.g., "numIterations (N) must be greater than or equal to numBufferSets (M)".

Suggested change
llvm::errs()
<< "numIterations must be greater than or equal to numBufferSets\n";
llvm::errs() << "numIterations (" << iterations
<< ") must be greater than or equal to numBufferSets ("
<< numBufferSets << ")\n";

Copilot uses AI. Check for mistakes.
std::vector<void *> &argPointers,
std::vector<double> &measurements, double &smallKernelCpuMs,
bool benchmarkMode) {
// Insert instruction cache flush assembly at the beginning of kernel functions
Copy link

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment could be more detailed. Consider adding documentation about what the function does, its parameters, and return value. For example: "Inserts inline assembly at the beginning of each kernel function to flush the instruction cache. Returns success() if at least one kernel was found and modified, failure() otherwise."

Suggested change
// Insert instruction cache flush assembly at the beginning of kernel functions
// Inserts inline assembly at the beginning of each kernel function in the
// given module to flush the instruction cache.
//
// Parameters:
// module - The MLIR module containing LLVM GPU kernel functions. Kernel
// functions are identified by the "gpu.kernel" attribute and by
// having a gpu::GPUModuleOp as a parent.
//
// Returns:
// success() if at least one kernel function was found in the module and
// modified, or failure() if no applicable kernel functions were found.

Copilot uses AI. Check for mistakes.
#include "mlir/Dialect/Rock/Tuning/RockTuning.h"
#include "mlir/Dialect/Rock/utility/fusionUtils.h"
#include "mlir/Dialect/Rock/utility/loweringUtils.h"
// #include "mlir/IR/IRBuilder.h"
Copy link

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This commented-out include should be removed. If it's not needed, it shouldn't be in the code.

Suggested change
// #include "mlir/IR/IRBuilder.h"

Copilot uses AI. Check for mistakes.
}
}

// 3. Launch the actualkernels.
Copy link

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a typo in this comment: "actualkernels" should be "actual kernels" (with a space).

Suggested change
// 3. Launch the actualkernels.
// 3. Launch the actual kernels.

Copilot uses AI. Check for mistakes.
HIPCHECK(hipExtModuleLaunchKernel(
func, gridSize * blockSize, 1, 1, blockSize, 1, 1, 0, stream,
argPointers.data(), nullptr, nullptr, nullptr));
const_cast<void **>(argPointers.data()), nullptr, nullptr, nullptr));
Copy link

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The const_cast is being added here, but the underlying issue is that argPointers contains pointers to non-const data. Consider whether the function signature should be changed or if there's a better way to handle this without using const_cast, as it can be error-prone and may hide const-correctness issues.

Copilot uses AI. Check for mistakes.
argPointers.data(), nullptr, startEvent, stopEvent));
HIPCHECK(hipExtModuleLaunchKernel(func, gridSize * blockSize, 1, 1,
blockSize, 1, 1, 0, stream,
const_cast<void **>(argPointers.data()),
Copy link

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same const_cast issue appears here as well. Consider refactoring to avoid the need for const_cast throughout the codebase.

Copilot uses AI. Check for mistakes.
HIPCHECK(hipExtModuleLaunchKernel(
func, gridSize * blockSize, 1, 1, blockSize, 1, 1, 0, stream,
argPointers.data(), nullptr, startEvent, stopEvent));
const_cast<void **>(warmupArgPointers.data()), nullptr, startEvent,
Copy link

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same const_cast pattern appears here. Consider whether the constness of warmupArgPointers is necessary or if the type should be adjusted to avoid the cast.

Copilot uses AI. Check for mistakes.
}

// Insert instruction cache flush assembly at the beginning of kernel
// functions This must be done after backend pipeline generates LLVM IR
Copy link

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment has a spacing issue. "functions This" should be "functions. This" (period instead of space before "This").

Suggested change
// functions This must be done after backend pipeline generates LLVM IR
// functions. This must be done after backend pipeline generates LLVM IR

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants