Grouped GEMM with ck_tile by matthiasdiener · Pull Request #434 · ROCm/TransformerEngine

matthiasdiener · 2026-01-28T15:49:27Z

Description

See https://github.com/ROCm/frameworks-internal/issues/13792 for context.

Primus-Turbo implementation: https://github.com/AMD-AGI/Primus-Turbo/blob/5bcd13785ef380fec0eec0911b7d6db5e606143e/csrc/kernels/grouped_gemm

TODOs:

Enable tests in test_numerics.py
Make kernels selectable & tunable
Handle gelu/bias (or make sure these are not passed in)
Performance analysis and improvements: https://github.com/ROCm/frameworks-internal/issues/15185#issuecomment-3863052452
More tests

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Implement ck_tile-based group GEMM, similar to Cutlass

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

This reverts commit 86fbbac.

…mTest

tests/pytorch/test_numerics.py

transformer_engine/common/gemm/ck_grouped_gemm.cpp

transformer_engine/common/gemm/ck_grouped_gemm.cuh

transformer_engine/common/gemm/cublaslt_gemm.cu

transformer_engine/common/CMakeLists.txt

transformer_engine/common/gemm/cublaslt_gemm.cu

ipanfilo · 2026-02-21T17:05:01Z

tests/pytorch/test_numerics.py

 )
 if IS_HIP_EXTENSION:
-    from transformer_engine.pytorch.utils import is_mi200, is_mi308
+    from transformer_engine.pytorch.utils import is_mi200, is_mi308, is_mi300_class


is_mi300_class methods is not needed, it is just 9.4 gfx family

Removed in 7910038

ipanfilo · 2026-02-21T17:06:34Z

transformer_engine/common/gemm/ck_grouped_gemm.cpp

@@ -0,0 +1,276 @@
+/* Copyright (c) 2026, Advanced Micro Devices, Inc. All rights reserved. */


Add proper copyright header

Thanks, done in f680d6a

ipanfilo · 2026-02-21T17:09:31Z

transformer_engine/common/gemm/ck_grouped_gemm.h

@@ -0,0 +1,11 @@
+/* Copyright (c) 2026, Advanced Micro Devices, Inc. All rights reserved. */


Put proper copyright header

Thanks, done in f680d6a

ipanfilo · 2026-02-21T17:19:16Z

transformer_engine/common/gemm/cublaslt_gemm.cu

+#endif
+
  const int current_device = transformer_engine::cuda::current_device();
  const bool is_hopper = (transformer_engine::cuda::sm_arch(current_device) == 90);


These constants are not used on ROCm

I disabled them in e8ebb0e.

ipanfilo · 2026-02-21T17:26:07Z

transformer_engine/common/gemm/cublaslt_gemm.cu

    return true;
  };

  auto all_groups_uniform_k128 = [&](const NVTETensor *p, bool trans) -> bool {


Unused on ROCm

Right, would you like me to #ifdef this function out (it is coming from upstream)?

ipanfilo · 2026-02-21T17:26:59Z

transformer_engine/common/gemm/cublaslt_gemm.cu

+    auto A_dt = inputA->data.dtype;
+    auto B_dt = inputB->data.dtype;
+    auto D_dt = OutputD->data.dtype;
+    return (A_dt == B_dt) && (A_dt == D_dt) &&


Are CK tile constraints the same as CUTLASS?

In terms of supported data types (which this function handles), yes - only bf16/fp16 are supported.

ipanfilo · 2026-02-21T17:38:17Z

transformer_engine/common/gemm/ck_grouped_gemm.cpp

+  }
+
+  // Normalize similar to upstream
+  // See https://github.com/NVIDIA/TransformerEngine/blob/59f6f3876767d07045152bfae07b5dd4c54e1725/transformer_engine/common/gemm/cutlass_grouped_gemm.cu#L54-L68


There is no similar code in referenced upstream file. And can you explain transA_use = transB and vice versa

In the referenced upstream code, the same swap is performed. For example, consider the following case in the upstream code:

} else if (!transb && transa) { grouped_gemm::CutlassGroupedGemm<false, true, T>(B, A, D, workspace, alpha, beta, num_gemms, stream, device, math_sm_count);

Here, transa==true and transb==false, but they get passed into the template as transa==false and transb==true, and A and B are swapped in the function call itself.

My best understanding regarding why the swap needs to be performed is that it matches the BLAS semantics regarding column-major storage (see e.g. https://rocm.docs.amd.com/projects/rocBLAS/en/latest/conceptual/rocblas-design-notes.html#column-major-storage-and-1-based-indexing).

ipanfilo · 2026-02-21T20:05:14Z

transformer_engine/common/gemm/ck_grouped_gemm.cpp

+                                    size_t workspace_bytes,
+                                    hipStream_t stream) {
+
+// FIXME: This could be a templated lambda function in C++20.


As an alternative dispatch_grouped can be incorporated to ck_tile_grouped_gemm with using of nested TRANSFORMER_ENGINE_SWITCH_CONDITION

What do you think of 6d85088?

matthiasdiener added 16 commits December 9, 2025 17:01

GEMM reference HIP implementation

ad748da

blockwise amax

11e090b

Merge branch 'dev' into compute-ref-offload

9006224

Change to use Tensor arguments, combine mxfp8/non-mxfp8 paths

3ecea7f

Merge remote-tracking branch 'origin/dev' into compute-ref-offload

cafee59

skip on SwizzleScale limitation on gfx950

86fbbac

Revert "skip on SwizzleScale limitation on gfx950"

54de3db

This reverts commit 86fbbac.

MXFP8 fix

311ddfe

Merge remote-tracking branch 'origin/dev' into compute-ref-offload

306e432

correct scale_inv packing and exp2(biased−127) conversion

445e64f

cleanups

462945f

Merge branch 'dev' into compute-ref-offload

e32fb3d

Merge remote-tracking branch 'origin/dev' into compute-ref-offload

7bf8adb

use Tensor class for more device objects

e11e400

Pass D Tensor into run_reference and move RefD allocation into Perfor…

325ece6

…mTest

[WIP] proof-of-concept: grouped GEMM with ck_tile

fc64b8c

matthiasdiener self-assigned this Jan 28, 2026

matthiasdiener added 3 commits January 28, 2026 09:51

Merge branch 'dev' into ck-grouped-gemm

134b350

restructure and enable tests

9091e6c

Merge remote-tracking branch 'origin/dev' into ck-grouped-gemm

7435062

matthiasdiener changed the title ~~[WIP] proof-of-concept: grouped GEMM with ck_tile~~ [WIP] Grouped GEMM with ck_tile Jan 29, 2026

matthiasdiener added 2 commits January 30, 2026 14:09

Merge remote-tracking branch 'origin/dev' into ck-grouped-gemm

a00a1c8

grid improvements

4e9ead9

wangye805 requested changes Feb 2, 2026

View reviewed changes

restructure

259645c

wenchenvincent requested a review from aris134 February 4, 2026 17:04

matthiasdiener added 4 commits February 4, 2026 15:41

reduce code duplication & simplify

9986bd4

make the code more similar to nv, check emopty gelu/bias

355ec2f

Merge branch 'dev' into ck-grouped-gemm

df5e3ea

further simplify & make closer to nv

a42f7ca

matthiasdiener added 3 commits February 4, 2026 17:07

add ck_tile reference

fac7c11

rename in error messages

71b97e0

allow flattened higher-D tensors

dd3ed2f

aris134 approved these changes Feb 5, 2026

View reviewed changes

matthiasdiener added 2 commits February 5, 2026 12:49

Merge remote-tracking branch 'origin/dev' into ck-grouped-gemm

7b0413e

relax tolerance on gfx942

ebc005f

matthiasdiener force-pushed the ck-grouped-gemm branch from 2095d3f to ebc005f Compare February 5, 2026 19:07

matthiasdiener added 2 commits February 5, 2026 14:53

enable more tests

c0bf502

return early when num_gemms<=0

0b16287

matthiasdiener force-pushed the ck-grouped-gemm branch from d1ab38e to 0b16287 Compare February 5, 2026 21:03

simplify normalization

58b34e7

matthiasdiener requested a review from wangye805 February 5, 2026 23:28

Merge remote-tracking branch 'origin/dev' into ck-grouped-gemm

74f229a

matthiasdiener changed the title ~~[WIP] Grouped GEMM with ck_tile~~ Grouped GEMM with ck_tile Feb 11, 2026

matthiasdiener added 6 commits February 11, 2026 13:04

run hipblaslt for num_gemms==1

e28c801

Merge remote-tracking branch 'origin/dev' into ck-grouped-gemm

6151b96

disable ck_tile when accumulate=true

5c57d47

Merge remote-tracking branch 'origin/dev' into ck-grouped-gemm

29d6ab7

Merge remote-tracking branch 'origin/dev' into ck-grouped-gemm

6e9aae4

remove test file

2e844d9

matthiasdiener marked this pull request as ready for review February 17, 2026 22:58

matthiasdiener requested review from ipanfilo and wenchenvincent as code owners February 17, 2026 22:58

Merge branch 'dev' into ck-grouped-gemm

4aa8229

ipanfilo requested changes Feb 23, 2026

View reviewed changes

matthiasdiener added 4 commits February 23, 2026 12:43

fix copyright header

f680d6a

simplify calls in dispatch_grouped

6d85088

remove is_mi3*0_class

7910038

disable unused constants

e8ebb0e

matthiasdiener requested a review from ipanfilo February 23, 2026 22:53

		@@ -0,0 +1,276 @@
		/* Copyright (c) 2026, Advanced Micro Devices, Inc. All rights reserved. */

		@@ -0,0 +1,11 @@
		/* Copyright (c) 2026, Advanced Micro Devices, Inc. All rights reserved. */

Comments

Conversation

matthiasdiener commented Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Changes

Checklist:

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

matthiasdiener commented Jan 28, 2026 •

edited

Loading