Skip to content

Conversation

@umangyadav
Copy link
Member

@umangyadav umangyadav commented Jan 6, 2026

Motivation

Each thread is writing dPerThread x kPerThread or kPerThread x dPerThread slice of threadTile.

copyKPerThread doesn't necessarily have to be same as kPack.

LDS layout is [kPackPerBlock, dPerblock, kPack].

If copyKPerThread < kPack then consecutive threads will have to write enough values to fill kPack. in this case D dimension is not being contiguously written and therefore can not be vectorized.

Current logic prefers "vectorization" over bank conflict resolution. But in case of copyKPerThread < kpack it is neither doing vectorization nor rotating to avoid bank conflicts.

For example,
rocmlir-gen --arch gfx942 -t f16 -out_datatype f16 -g 1 -m 1500 -n 1152 -k 896 --perf_config="v4:96,256,8,96,32,32,4,1,2,0,0,4,1,1" --operation gemm --num_cu 80

For this case, kPack = 4.
Each thread is loading dxk = 3x2 threadTile from global memory.

Therefore in LDS, It appears like this following. This table shows only first 32 threads. B0, B1.. are bank mapping.
Each consecutive 4 columns forms kPack. This is only for representation purpose, inside LDS layout is actually kPackPerBlock, mPerBlock, kPack].

Notice that both thread1 and thread0 are writing to same kPack vector.
Existing logic will set vectorization flag but it can not happen in this case.

 bool isKContiguousDim = vecDimInfo.vectorDim == GemmDimension::K;
  // If kpack is less than the hardware max vector length, and we are
  // writing more contiguous kpack elements, there is a possibility to
  // vectorize that we want to preserve (i.e., we favour vectorization over
  // bank conflicts resolution)
  bool isPossibleToVectorizeD = (kpack < maxVlen && copyDPerThread > 1);

Data Table

m k=0 k=1 k=2 k=3 k=4 k=5 k=6 k=7 k=8 k=9 k=10 k=11 k=12 k=13 k=14 k=15 k=16 k=17 k=18 k=19 k=20 k=21 k=22 k=23 k=24 k=25 k=26 k=27 k=28 k=29 k=30 k=31
0 T0 B0 T0 B0 T1 B1 T1 B1 T2 B0 T2 B0 T3 B1 T3 B1 T4 B0 T4 B0 T5 B1 T5 B1 T6 B0 T6 B0 T7 B1 T7 B1 T8 B0 T8 B0 T9 B1 T9 B1 T10 B0 T10 B0 T11 B1 T11 B1 T12 B0 T12 B0 T13 B1 T13 B1 T14 B0 T14 B0 T15 B1 T15 B1
1 T0 B2 T0 B2 T1 B3 T1 B3 T2 B2 T2 B2 T3 B3 T3 B3 T4 B2 T4 B2 T5 B3 T5 B3 T6 B2 T6 B2 T7 B3 T7 B3 T8 B2 T8 B2 T9 B3 T9 B3 T10 B2 T10 B2 T11 B3 T11 B3 T12 B2 T12 B2 T13 B3 T13 B3 T14 B2 T14 B2 T15 B3 T15 B3
2 T0 B4 T0 B4 T1 B5 T1 B5 T2 B4 T2 B4 T3 B5 T3 B5 T4 B4 T4 B4 T5 B5 T5 B5 T6 B4 T6 B4 T7 B5 T7 B5 T8 B4 T8 B4 T9 B5 T9 B5 T10 B4 T10 B4 T11 B5 T11 B5 T12 B4 T12 B4 T13 B5 T13 B5 T14 B4 T14 B4 T15 B5 T15 B5
3 T16 B0 T16 B0 T17 B1 T17 B1 T18 B0 T18 B0 T19 B1 T19 B1 T20 B0 T20 B0 T21 B1 T21 B1 T22 B0 T22 B0 T23 B1 T23 B1 T24 B0 T24 B0 T25 B1 T25 B1 T26 B0 T26 B0 T27 B1 T27 B1 T28 B0 T28 B0 T29 B1 T29 B1 T30 B0 T30 B0 T31 B1 T31 B1
4 T16 B6 T16 B6 T17 B7 T17 B7 T18 B6 T18 B6 T19 B7 T19 B7 T20 B6 T20 B6 T21 B7 T21 B7 T22 B6 T22 B6 T23 B7 T23 B7 T24 B6 T24 B6 T25 B7 T25 B7 T26 B6 T26 B6 T27 B7 T27 B7 T28 B6 T28 B6 T29 B7 T29 B7 T30 B6 T30 B6 T31 B7 T31 B7
5 T16 B10 T16 B10 T17 B11 T17 B11 T18 B10 T18 B10 T19 B11 T19 B11 T20 B10 T20 B10 T21 B11 T21 B11 T22 B10 T22 B10 T23 B11 T23 B11 T24 B10 T24 B10 T25 B11 T25 B11 T26 B10 T26 B10 T27 B11 T27 B11 T28 B10 T28 B10 T29 B11 T29 B11 T30 B10 T30 B10 T31 B11 T31 B11

Because each thread is writing 32 bits, compiler generates ds_write2_b32. This ds_write instruction happens in two phases.
Phase 1: 0-31 threads
Phase 2 : 32-63 thread

Looking at bank mapping you can find that it has 8 way bank conflict : T0, T2, T4, T6, T8, T10, T12, T14 all are mapped to B0.

Measuring bank conflict with this rocprof-compute, this kernel shows LDS Bank conflicts of rate 0.49

With this PR it drops to 0.07 with this PR.

This doesn't translate to better performance always though. I see compiler selecting single access ds_write_b32 instead of ds_write2_b32. But re-tuning selects some other perfConfig with better final performance, it goes from 88 TFLops to 97TFlops.

…efore it can't really be vectorized, in such cases we should rotate to avoid bank conflicts
@umangyadav umangyadav requested a review from causten as a code owner January 6, 2026 01:02
@umangyadav umangyadav self-assigned this Jan 6, 2026
@umangyadav umangyadav requested review from Copilot, justinrosner and pabloantoniom and removed request for causten January 6, 2026 01:05
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes the LDS vectorization logic for GEMM operations when kPack > kPerThread. The change ensures that bank conflict resolution is properly applied when vectorization is not actually possible due to non-contiguous memory writes.

  • Adds a check to verify that copyKPerThread >= kpack before allowing vectorization
  • Prevents false vectorization flags when consecutive threads cannot fill a complete kPack vector
  • Improves LDS bank conflict rate from 0.49 to 0.07 in the example case

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@umangyadav umangyadav requested a review from dhernandez0 January 6, 2026 01:05
@dhernandez0
Copy link
Contributor

dhernandez0 commented Jan 7, 2026

please, can you provide a comparison of develop vs this branch? even if it is with greedy tuning for the tier1 gemm list? we want to make sure we have no perf regressions.

// bank conflicts resolution)
bool isPossibleToVectorizeD = (kpack < maxVlen && copyDPerThread > 1);
bool isPossibleToVectorizeD =
(kpack < maxVlen && copyDPerThread > 1) && (copyKPerThread >= kpack);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a comment to explain this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: no need to have two parentheses.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants