Fix LDS vectorization logic to allow for bank conflict resolution when kPack > kPerThread #2196

umangyadav · 2026-01-06T01:02:46Z

Motivation

Each thread is writing dPerThread x kPerThread or kPerThread x dPerThread slice of threadTile.

copyKPerThread doesn't necessarily have to be same as kPack.

LDS layout is [kPackPerBlock, dPerblock, kPack].

If copyKPerThread < kPack then consecutive threads will have to write enough values to fill kPack. in this case D dimension is not being contiguously written and therefore can not be vectorized.

Current logic prefers "vectorization" over bank conflict resolution. But in case of copyKPerThread < kpack it is neither doing vectorization nor rotating to avoid bank conflicts.

For example,
rocmlir-gen --arch gfx942 -t f16 -out_datatype f16 -g 1 -m 1500 -n 1152 -k 896 --perf_config="v4:96,256,8,96,32,32,4,1,2,0,0,4,1,1" --operation gemm --num_cu 80

For this case, kPack = 4.
Each thread is loading dxk = 3x2 threadTile from global memory.

Therefore in LDS, It appears like this following. This table shows only first 32 threads. B0, B1.. are bank mapping.
Each consecutive 4 columns forms kPack. This is only for representation purpose, inside LDS layout is actually kPackPerBlock, mPerBlock, kPack].

Notice that both thread1 and thread0 are writing to same kPack vector.
Existing logic will set vectorization flag but it can not happen in this case.

 bool isKContiguousDim = vecDimInfo.vectorDim == GemmDimension::K;
  // If kpack is less than the hardware max vector length, and we are
  // writing more contiguous kpack elements, there is a possibility to
  // vectorize that we want to preserve (i.e., we favour vectorization over
  // bank conflicts resolution)
  bool isPossibleToVectorizeD = (kpack < maxVlen && copyDPerThread > 1);

Data Table

m	k=0	k=1	k=2	k=3	k=4	k=5	k=6	k=7	k=8	k=9	k=10	k=11	k=12	k=13	k=14	k=15	k=16	k=17	k=18	k=19	k=20	k=21	k=22	k=23	k=24	k=25	k=26	k=27	k=28	k=29	k=30	k=31
0	T0 B0	T0 B0	T1 B1	T1 B1	T2 B0	T2 B0	T3 B1	T3 B1	T4 B0	T4 B0	T5 B1	T5 B1	T6 B0	T6 B0	T7 B1	T7 B1	T8 B0	T8 B0	T9 B1	T9 B1	T10 B0	T10 B0	T11 B1	T11 B1	T12 B0	T12 B0	T13 B1	T13 B1	T14 B0	T14 B0	T15 B1	T15 B1
1	T0 B2	T0 B2	T1 B3	T1 B3	T2 B2	T2 B2	T3 B3	T3 B3	T4 B2	T4 B2	T5 B3	T5 B3	T6 B2	T6 B2	T7 B3	T7 B3	T8 B2	T8 B2	T9 B3	T9 B3	T10 B2	T10 B2	T11 B3	T11 B3	T12 B2	T12 B2	T13 B3	T13 B3	T14 B2	T14 B2	T15 B3	T15 B3
2	T0 B4	T0 B4	T1 B5	T1 B5	T2 B4	T2 B4	T3 B5	T3 B5	T4 B4	T4 B4	T5 B5	T5 B5	T6 B4	T6 B4	T7 B5	T7 B5	T8 B4	T8 B4	T9 B5	T9 B5	T10 B4	T10 B4	T11 B5	T11 B5	T12 B4	T12 B4	T13 B5	T13 B5	T14 B4	T14 B4	T15 B5	T15 B5
3	T16 B0	T16 B0	T17 B1	T17 B1	T18 B0	T18 B0	T19 B1	T19 B1	T20 B0	T20 B0	T21 B1	T21 B1	T22 B0	T22 B0	T23 B1	T23 B1	T24 B0	T24 B0	T25 B1	T25 B1	T26 B0	T26 B0	T27 B1	T27 B1	T28 B0	T28 B0	T29 B1	T29 B1	T30 B0	T30 B0	T31 B1	T31 B1
4	T16 B6	T16 B6	T17 B7	T17 B7	T18 B6	T18 B6	T19 B7	T19 B7	T20 B6	T20 B6	T21 B7	T21 B7	T22 B6	T22 B6	T23 B7	T23 B7	T24 B6	T24 B6	T25 B7	T25 B7	T26 B6	T26 B6	T27 B7	T27 B7	T28 B6	T28 B6	T29 B7	T29 B7	T30 B6	T30 B6	T31 B7	T31 B7
5	T16 B10	T16 B10	T17 B11	T17 B11	T18 B10	T18 B10	T19 B11	T19 B11	T20 B10	T20 B10	T21 B11	T21 B11	T22 B10	T22 B10	T23 B11	T23 B11	T24 B10	T24 B10	T25 B11	T25 B11	T26 B10	T26 B10	T27 B11	T27 B11	T28 B10	T28 B10	T29 B11	T29 B11	T30 B10	T30 B10	T31 B11	T31 B11

Because each thread is writing 32 bits, compiler generates ds_write2_b32. This ds_write instruction happens in two phases.
Phase 1: 0-31 threads
Phase 2 : 32-63 thread

Looking at bank mapping you can find that it has 8 way bank conflict : T0, T2, T4, T6, T8, T10, T12, T14 all are mapped to B0.

Measuring bank conflict with this rocprof-compute, this kernel shows LDS Bank conflicts of rate 0.49

With this PR it drops to 0.07 with this PR.

This doesn't translate to better performance always though. I see compiler selecting single access ds_write_b32 instead of ds_write2_b32. But re-tuning selects some other perfConfig with better final performance, it goes from 88 TFLops to 97TFlops.

…efore it can't really be vectorized, in such cases we should rotate to avoid bank conflicts

Copilot

Pull request overview

This PR fixes the LDS vectorization logic for GEMM operations when kPack > kPerThread. The change ensures that bank conflict resolution is properly applied when vectorization is not actually possible due to non-contiguous memory writes.

Adds a check to verify that copyKPerThread >= kpack before allowing vectorization
Prevents false vectorization flags when consecutive threads cannot fill a complete kPack vector
Improves LDS bank conflict rate from 0.49 to 0.07 in the example case

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

dhernandez0 · 2026-01-07T10:40:13Z

please, can you provide a comparison of develop vs this branch? even if it is with greedy tuning for the tier1 gemm list? we want to make sure we have no perf regressions.

dhernandez0 · 2026-01-07T11:03:25Z

mlir/lib/Dialect/Rock/Transforms/GridwiseGemmToBlockwise.cpp

  // bank conflicts resolution)
-  bool isPossibleToVectorizeD = (kpack < maxVlen && copyDPerThread > 1);
+  bool isPossibleToVectorizeD =
+      (kpack < maxVlen && copyDPerThread > 1) && (copyKPerThread >= kpack);


can you add a comment to explain this?

nit: no need to have two parentheses.

if copyKPerThread<kPack then D will not be contiguous in LDS and ther…

6b7896c

…efore it can't really be vectorized, in such cases we should rotate to avoid bank conflicts

umangyadav requested a review from causten as a code owner January 6, 2026 01:02

Merge branch 'develop' into ldsVectorization

2259b4c

umangyadav self-assigned this Jan 6, 2026

umangyadav requested review from Copilot, justinrosner and pabloantoniom and removed request for causten January 6, 2026 01:05

Copilot AI reviewed Jan 6, 2026

View reviewed changes

umangyadav requested a review from dhernandez0 January 6, 2026 01:05

formatting

e3306f0

justinrosner approved these changes Jan 6, 2026

View reviewed changes

Merge branch 'develop' into ldsVectorization

e82c331

dhernandez0 reviewed Jan 7, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix LDS vectorization logic to allow for bank conflict resolution when kPack > kPerThread #2196

Fix LDS vectorization logic to allow for bank conflict resolution when kPack > kPerThread #2196

umangyadav commented Jan 6, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

dhernandez0 commented Jan 7, 2026 •

edited

Loading

Uh oh!

dhernandez0 Jan 7, 2026

Uh oh!

dhernandez0 Jan 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Fix LDS vectorization logic to allow for bank conflict resolution when kPack > kPerThread #2196

Are you sure you want to change the base?

Fix LDS vectorization logic to allow for bank conflict resolution when kPack > kPerThread #2196

Conversation

umangyadav commented Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Data Table

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

dhernandez0 commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dhernandez0 Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

dhernandez0 Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

umangyadav commented Jan 6, 2026 •

edited

Loading

dhernandez0 commented Jan 7, 2026 •

edited

Loading