(Async)DirectToLDS support in WMMA #2182

pabloantoniom · 2025-12-22T14:53:33Z

Motivation

In #2072 we introduced support for async DirectToLDS for gfx1250. However, DirectToLDS is not supported for WMMA, so gfx1250 would still be unable to use async DirectToLDS.

This PR implements it, plus other required features, in order to give full support for async DirectToLDS in gfx1250.

Technical Details

This PR adds the following:

Support for DirectToLDS for WMMA (gfx1250 is the only GPU available that can exploit it).
Changed the way Async DirectToLDS loads from memory: Previously we assumed it behaved like a gather (i.e., like traditional DirectToLDS), but it actually works like a normal load, so we must take into account the thread ID when computing the destination indices for the op.
Generate WaitAsynccntOp when lowering rock::AsyncWaitOp. Similarly to how SWaitcntOp is needed for traditional DirectToLDS, WaitAsynccntOp is required to wait for async load ops.
Added support for out-of-bounds checks (emitOobChecks) in GlobalLoadToLDS lowering in SugarToLoops. The idea is to follow GlobalLoad lowering, however is not easy to do so.

More details about the `emitOobChecks` support

We cannot reuse code from GlobalLoad lowering in GlobalLoadToLDS lowering because the ops have one radical difference: the former returns an SSA value with the result, whereas the latter does not. We might want to cleanup this in the future.
In the else condition that we generate when out-of-bound checks are needed, we have to store zeros into LDS. However, GlobalLoadToLDS has transferType, meaning that we might need to write multiple elements (e.g., if transferType is f64 but LDS buffer is f16). My approach is to use InBoundsStoreOp and add support to such op to write multiple elements. Note that we might want to separate this into another PR.

Test Plan

Added E2E test (GemmAsyncDirectToLDS).
Adapted LIT tests due to change in rock::AsyncWaitOp lowering.
Added LIT test in lowering_global_load_to_lds to exercise the case of OOB checks.

Test Result

All new E2E test pass on the emulator.

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

pabloantoniom · 2025-12-22T14:54:47Z

mlir/lib/Dialect/Rock/utility/AccelEmitter.cpp

  int64_t mPerWave = tuningParams.getMPerWave();
  int64_t nPerWave = tuningParams.getNPerWave();
  int64_t kPack = tuningParams.getKpack();
-  // TODO: gfx10 supports directToLDS. Implement it.


Note for reviewers: This comment is wrong. gfx10 supports directoLDS, but it does not support WMMA, so the comment does not make sense here.

pabloantoniom · 2025-12-22T15:15:53Z

mlir/lib/Dialect/Rock/Transforms/ThreadwiseGemmLowering.cpp

      // instruction
      ldsIndex = arith::AddIOp::create(b, loc, ldsIndex, ldsIndexWave);
+
+      if (isAsyncDirectToLDSSupported(maybeArch.value())) {


Note for reviewers: Is this the right place to implement this?

pabloantoniom · 2025-12-22T15:32:58Z

mlir/test/Dialect/Rock/async_wait_lowering.mlir

@@ -1,21 +1,27 @@
-// RUN: rocmlir-opt %s --rock-to-rocdl


pabloantoniom · 2025-12-23T10:35:49Z

mlir/lib/Dialect/Rock/Transforms/SugarToLoops.cpp

    } else {
-      b.replaceOpWithNewOp<memref::StoreOp>(op, op.getData(), op.getDest(),
-                                            op.getCoords());
+      Location loc = op.getLoc();


Note for reviewers: This is needed because of the changes in GlobalLoadToLDSRewritePattern. It is extending the InBoundsStore capabilities, but it's not directly related to this PR, so maybe we want to move it into a separate PR to have a cleaner git history.

yes, I'd prefer this to be an independent PR.

pabloantoniom · 2025-12-23T10:37:29Z

mlir/test/e2e/GemmAsyncDirectToLDS.toml

+config = "-g 1 -m 512 -k 1 -n 512"
+
+[[suite.test]]
+config = "-g 1 -m 512 -k 32 -n 512"


Note for reviwers: DirecToLDS counterpart also has:

[[suite.test]] config = "-g 3 -m 1024 -k 768 -n 1024"

but on emulator this takes too long (>30m) so I removed it.

pabloantoniom · 2025-12-23T11:19:13Z

mlir/lib/Dialect/Rock/Transforms/SugarToLoops.cpp

+      b.setInsertionPointToEnd(elseBlock);
+      Type transferType = op.getTransferType();
+      Value zeroValue = createZeroConstantOp(b, loc, transferType);
+      InBoundsStoreOp::create(b, loc, zeroValue, dest, destCoords);


Note for reviewers: Here I use InBoundsStoreOp to store zeros. Let's assume a case like the following: transferType=f64 and LDS buffer is f16. We would need to write 4 x f16 of zeros. But what if actually the first 32 bits are in bounds and the last 32 are out-of-bounds, meaning that we should actually read 32bits from LDS and the set to zero the remaining 32 bits? This is not handled here. Can that happen? How should it be handled?

dhernandez0 · 2026-01-07T13:00:48Z

mlir/lib/Dialect/Rock/Transforms/LowerRockOpsToROCDLOps.cpp

+                 << "AsyncWaitOpConversion: arch supports AsyncDirectToLDS\n");
+      unsigned asyncCnt = std::min(63u, 0u);
+      ROCDL::WaitAsynccntOp::create(rewriter, loc, asyncCnt);
+      ROCDL::SBarrierOp::create(rewriter, loc);


why do we need sbarrier here?

dhernandez0 · 2026-01-07T13:01:59Z

mlir/lib/Dialect/Rock/Transforms/LowerRockOpsToROCDLOps.cpp

I think we need to get rid of the barriers that RockPipeline will introduce for gfx1250 as well (I think WaitAsynccntOp is enough). Is there a ticket to do that?

dhernandez0 · 2026-01-07T13:02:48Z

mlir/lib/Dialect/Rock/Transforms/LowerRockOpsToROCDLOps.cpp

+    if (supportsAsyncDirectToLDS) {
+      LLVM_DEBUG(llvm::dbgs()
+                 << "AsyncWaitOpConversion: arch supports AsyncDirectToLDS\n");
+      unsigned asyncCnt = std::min(63u, 0u);


is the max 63 for WaitAsynccntOp as well?

dhernandez0 · 2026-01-07T13:03:41Z

mlir/lib/Dialect/Rock/Transforms/SugarToLoops.cpp

      coords.push_back(b.createOrFold<ConstantIndexOp>(loc, 0));
    }
+
+    Type originalLoadedType = op.getTransferType();


I think this could be another PR? it's not related to gfx1250, right?

dhernandez0 · 2026-01-07T13:04:15Z

mlir/lib/Dialect/Rock/Transforms/SugarToLoops.cpp

    } else {
-      b.replaceOpWithNewOp<memref::StoreOp>(op, op.getData(), op.getDest(),
-                                            op.getCoords());
+      Location loc = op.getLoc();


yes, I'd prefer this to be an independent PR.

dhernandez0 · 2026-01-07T13:06:45Z

mlir/lib/Dialect/Rock/Transforms/ThreadwiseGemmLowering.cpp

+          return emitError(loc)
+                 << "128 bits direct to LDS is not supported by the hardware";
        }
      } else {


I think this else (and the above if) should only run for direct to lds, not async.

dhernandez0 · 2026-01-07T13:09:12Z

mlir/lib/Dialect/Rock/Transforms/ThreadwiseGemmLowering.cpp

+        // the same output index), async DirectToLDS actually works like a
+        // traditional load, so we must take into account the thread-specific
+        // offset here.
+        if (loadTypeByteWidth == 16) {


instead of this, it'd be better to have a transform for the output tensor as well. So, instead of having linear indexing of the output we could have anything.

that would allow us to store in LDS with [kpackperblock, dperblock, kpack] layout.

dhernandez0 · 2026-01-07T13:17:01Z

mlir/test/e2e/GemmAsyncDirectToLDS.cfg

+if not config.arch.startswith("gfx1250"):
+  config.unsupported = True
+
+# This is useful when running on the emulator, to propagate the environment variables


we probably don't want to merge this

dhernandez0 · 2026-01-07T13:17:33Z

mlir/test/e2e/GemmAsyncDirectToLDS.toml

can we have attention tests as well?

dhernandez0 · 2026-01-07T13:17:53Z

mlir/test/e2e/GemmAsyncDirectToLDS.toml

how is this different than direct to lds tests?

Copilot

Pull request overview

This PR implements support for Async DirectToLDS in WMMA operations for gfx1250, completing the work started in #2072 which introduced basic async load support but lacked WMMA compatibility.

Key changes include:

Enabled DirectToLDS for WMMA operations with support for multiple load widths (8, 32, 64, and 128 bits) on gfx1250
Modified async DirectToLDS to use normal load semantics instead of gather semantics, accounting for thread-specific offsets
Added WaitAsynccntOp generation for async operations and out-of-bounds checking for GlobalLoadToLDS

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
mlir/test/e2e/GemmAsyncDirectToLDS.toml	E2E test configuration for async DirectToLDS with multiple GEMM configurations
mlir/test/e2e/GemmAsyncDirectToLDS.cfg	Test filter restricting tests to gfx1250 architecture
mlir/test/e2e/CMakeLists.txt	Added GemmAsyncDirectToLDS to test configuration list
mlir/test/Dialect/Rock/async_load_to_lds.mlir	LIT test verifying async_load_to_lds operation generation
mlir/test/Dialect/Rock/lowering_global_load_to_lds.mlir	Added test case for OOB checks in async DirectToLDS
mlir/test/Dialect/Rock/async_wait_lowering.mlir	Updated to test WaitAsynccntOp lowering for gfx1250
mlir/lib/Dialect/Rock/utility/loweringUtils.cpp	Extended DirectToLDS logic to support all load widths for async operations
mlir/lib/Dialect/Rock/utility/AccelEmitter.cpp	Implemented WMMA LDS buffer wrapping for DirectToLDS with proper transform maps
mlir/lib/Dialect/Rock/Transforms/ThreadwiseGemmLowering.cpp	Added thread-specific offset calculations for async DirectToLDS
mlir/lib/Dialect/Rock/Transforms/SugarToLoops.cpp	Implemented OOB checks and enhanced InBoundsStoreOp to handle type width mismatches
mlir/lib/Dialect/Rock/Transforms/LowerRockOpsToROCDLOps.cpp	Added WaitAsynccntOp lowering path for async DirectToLDS architectures
mlir/lib/Dialect/Rock/IR/RockDialect.cpp	Extended GlobalLoadToLDSOp verification to support 8, 32, 64, and 128-bit loads

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-07T13:25:38Z

mlir/lib/Dialect/Rock/Transforms/LowerRockOpsToROCDLOps.cpp

+    if (supportsAsyncDirectToLDS) {
+      LLVM_DEBUG(llvm::dbgs()
+                 << "AsyncWaitOpConversion: arch supports AsyncDirectToLDS\n");
+      unsigned asyncCnt = std::min(63u, 0u);


The asyncCnt calculation uses std::min(63u, 0u) which always returns 0, ignoring the operation's numInst attribute. This should be std::min(63u, op.getNumInst()) to properly clamp the instruction count, similar to the vmCnt calculation in the else branch.

Suggested change

unsigned asyncCnt = std::min(63u, 0u);

unsigned asyncCnt = std::min(63u, op.getNumInst());

Copilot · 2026-01-07T13:25:39Z

mlir/lib/Dialect/Rock/Transforms/SugarToLoops.cpp

+            "Source element type is larger than destination, but not a "
+            "multiple of the destination element type");


The error message states "Source element type is larger than destination" but this error is triggered in the else clause that handles cases where either srcBits <= destBits OR srcBits is not a multiple of destBits. The message should be more accurate to cover all these cases, for example: "Source and destination element types have incompatible bit widths (source must equal destination or be a multiple of destination)".

Suggested change

"Source element type is larger than destination, but not a "

"multiple of the destination element type");

"Source and destination element types have incompatible bit "

"widths (source bit width must equal destination bit width or be "

"an integer multiple of it)");

Copilot · 2026-01-07T13:25:39Z

mlir/lib/Dialect/Rock/utility/AccelEmitter.cpp

+    // handle both cases.
+    kVec = kPack;
+    kPerBlock *= kPack;
+    assert(!rotateDWithK && "rotateDWithK must not be enabled for directToLds");


Inconsistent capitalization: "directToLds" should be "directToLDS" to match the casing used throughout the codebase (e.g., "DirectToLDS", "directToLDS", "asyncDirectToLDS").

Suggested change

assert(!rotateDWithK && "rotateDWithK must not be enabled for directToLds");

assert(!rotateDWithK && "rotateDWithK must not be enabled for directToLDS");

pabloantoniom added 5 commits December 22, 2025 08:48

AsyncDirectToLDS support in WMMA

f4516b1

Fix

0685512

WIP

1bb9ba7

Async DirectToLDS is a normal load, not a gather!

289931d

Cleanup. E2E test passing

d7ad023

pabloantoniom commented Dec 22, 2025

View reviewed changes

pabloantoniom added 3 commits December 22, 2025 09:26

Properly implement this

fe53bd6

nit

badcd39

Proper error messages

281e87d

pabloantoniom commented Dec 22, 2025

View reviewed changes

mlir/test/Dialect/Rock/async_wait_lowering.mlir

@@ -1,21 +1,27 @@

// RUN: rocmlir-opt %s --rock-to-rocdl

Copy link

Contributor Author

pabloantoniom Dec 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😅

pabloantoniom marked this pull request as ready for review December 22, 2025 15:35

pabloantoniom requested a review from causten as a code owner December 22, 2025 15:35

pabloantoniom added 6 commits December 23, 2025 02:55

WIP

a3ac951

Implement support for oobchecks in async directolds

b33eba8

Extend inboundsStoreOp to support different source and dest bitwidths

3640e5d

Extend E2E test suite (all tests passing!)

840c225

LIT tests for global_load_to_lds lowering

7bc563e

clang-format

4ce7f1e

pabloantoniom commented Dec 23, 2025

View reviewed changes

pabloantoniom added 3 commits December 23, 2025 04:44

Fixes

4113841

Fixes

db19f89

Proper test

219085e

pabloantoniom commented Dec 23, 2025

View reviewed changes

pabloantoniom changed the title ~~[WIP] (Async)DirectToLDS support in WMMA~~ (Async)DirectToLDS support in WMMA Dec 23, 2025

pabloantoniom requested review from dhernandez0, justinrosner, stefankoncarevic and umangyadav December 23, 2025 13:22

dhernandez0 reviewed Jan 7, 2026

View reviewed changes

dhernandez0 requested a review from Copilot January 7, 2026 13:18

Copilot started reviewing on behalf of dhernandez0 January 7, 2026 13:20 View session

Copilot AI reviewed Jan 7, 2026

View reviewed changes

	unsigned asyncCnt = std::min(63u, 0u);
	unsigned asyncCnt = std::min(63u, op.getNumInst());

		"Source element type is larger than destination, but not a "
		"multiple of the destination element type");

-            "Source element type is larger than destination, but not a "
-            "multiple of the destination element type");
+            "Source and destination element types have incompatible bit "
+            "widths (source bit width must equal destination bit width or be "
+            "an integer multiple of it)");

	assert(!rotateDWithK && "rotateDWithK must not be enabled for directToLds");
	assert(!rotateDWithK && "rotateDWithK must not be enabled for directToLDS");

(Async)DirectToLDS support in WMMA #2182

Are you sure you want to change the base?

(Async)DirectToLDS support in WMMA #2182

Uh oh!

Conversation

pabloantoniom commented Dec 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Technical Details

More details about the emitOobChecks support

Test Plan

Test Result

Submission Checklist

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pabloantoniom Dec 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

pabloantoniom commented Dec 22, 2025 •

edited

Loading

More details about the `emitOobChecks` support

pabloantoniom Dec 23, 2025 •

edited

Loading