Skip to content

Conversation

@umangyadav
Copy link
Member

@umangyadav umangyadav commented Jan 8, 2026

Motivation

  • Creates dialect registry only once.
  • Uses the same threadPool for all the threads.
  • Renames MLIR_ENABLE_THREADS to LLVM_ENALBE_THREADS. There is no compile time flag as MLIR_ENALBE_THREADS. rocm/llvm has it incorrectly. llvm/llvm uses LLVM_ENABLE_THREADS
  • Only copy GPU buffers once and use same buffer for benchmarking for perf configs.

Testing

rocmlir-gen -m 1500 -n 1152 -k 896 --operation gemm --arch gfx942 --num_cu 80

Tuning Mode develop tuningImprovements Improvement
Quick 1.50s 1.53s -2% (slower)
Full 19.5s 19.5s ~0% (same)
Exhaustive 4m 2.6s 3m 59.5s ~1.3% faster

// Create context with threading disabled internally, attach shared pool
ctx = std::make_unique<MLIRContext>(registry,
MLIRContext::Threading::DISABLED);
ctx->setThreadPool(getSharedThreadPool());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that the compilation is parallelized internally as well? We could be oversubscribing threads because we are already parallelizing at a higher level. Can we control the number of threads in the pool?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes it looks like it would use threading internally as well.
But not sure how exactly it works.
I copied logic from MIGraphX's compilation
https://github.com/ROCm/AMDMIGraphX/blob/4d968f79f02de4de5aa3c36f12a179183c12c04e/src/targets/gpu/mlir.cpp#L286

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Threading is enabled by default internally inside MLIRContext unless disabled explictly.

MLIRContextImpl(bool threadingIsEnabled)

This change of passing threadpool explictly is meant to reduce overheads of creating seperate threadpool across all parallel threads. But it looks like it is not really affecting runtime. But it is a good practice.

Copy link
Contributor

@mirza-halilcevic mirza-halilcevic Jan 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good optimization.

In that case we are for sure oversubscribing. I think we can optimize further by manipulating the thread count. Maybe leave 50% of the compile threads for the compilation workers, and the other 50% for the thread pool. There's probably a way to tell the thread pool how many threads to use. 50% is arbitrary here, maybe a different distribution works better. That could be the reason why you don't see it affect runtime, the cpu is already oversaturated with threads.


static bool isThreadingGloballyDisabled() {
#if MLIR_ENABLE_THREADS != 0
#if LLVM_ENABLE_THREADS != 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you create an upstream PR to fix this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bug is in rocm/llvm. I'll post a PR there.


// Copy host buffers to GPU once (reused across all config benchmarks)
for (size_t i = 0; i < bufferLengths.size(); i++) {
HIPCHECK(hipMemcpy(gpuBuffers[i], hostBuffers[i], bufferLengths[i],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why change from hipMemcpyAsync to hipMemcpy?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be careful with this. I believe that host-to-device copies are still async unless the host memory is allocated page-locked (allocated with hipHostMalloc). It just stages it for DMA transfer and does not wait for the copy to finish.

CUDA behaves like this, I would suppose that HIP does as well: https://docs.nvidia.com/cuda/cuda-driver-api/api-sync-behavior.html

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably should be using hipHostMalloc anyway to speed up the memory transfer.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was changed from hipMemCpyAsync to hipMemcpy becuase it doesn't require "stream" or because it uses default stream.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah I see, at this point you don't have a stream?

gpuBuffers.push_back(gpuBuffer);
}

// Copy host buffers to GPU once (reused across all config benchmarks)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if we are using atomics? the results would be different in every iteration because we don't init the output tensor with the same values. I think that shouldn't affect run-time but asking just in case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't care about results here. Just the benchmarking

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add a comment just in case? so the reader is aware of this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you have this change in a seperate commit with "[external]..."?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants