-
Notifications
You must be signed in to change notification settings - Fork 52
Fix LDS vectorization logic to allow for bank conflict resolution when kPack > kPerThread #2196
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
…efore it can't really be vectorized, in such cases we should rotate to avoid bank conflicts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR fixes the LDS vectorization logic for GEMM operations when kPack > kPerThread. The change ensures that bank conflict resolution is properly applied when vectorization is not actually possible due to non-contiguous memory writes.
- Adds a check to verify that
copyKPerThread >= kpackbefore allowing vectorization - Prevents false vectorization flags when consecutive threads cannot fill a complete
kPackvector - Improves LDS bank conflict rate from 0.49 to 0.07 in the example case
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
please, can you provide a comparison of develop vs this branch? even if it is with |
| // bank conflicts resolution) | ||
| bool isPossibleToVectorizeD = (kpack < maxVlen && copyDPerThread > 1); | ||
| bool isPossibleToVectorizeD = | ||
| (kpack < maxVlen && copyDPerThread > 1) && (copyKPerThread >= kpack); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a comment to explain this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: no need to have two parentheses.
Motivation
Each thread is writing
dPerThread x kPerThreadorkPerThread x dPerThreadslice of threadTile.copyKPerThreaddoesn't necessarily have to be same askPack.LDS layout is
[kPackPerBlock, dPerblock, kPack].If
copyKPerThread < kPackthen consecutive threads will have to write enough values to fillkPack. in this caseDdimension is not being contiguously written and therefore can not be vectorized.Current logic prefers "vectorization" over bank conflict resolution. But in case of
copyKPerThread < kpackit is neither doing vectorization nor rotating to avoid bank conflicts.For example,
rocmlir-gen --arch gfx942 -t f16 -out_datatype f16 -g 1 -m 1500 -n 1152 -k 896 --perf_config="v4:96,256,8,96,32,32,4,1,2,0,0,4,1,1" --operation gemm --num_cu 80For this case, kPack = 4.
Each thread is loading
dxk = 3x2threadTile from global memory.Therefore in LDS, It appears like this following. This table shows only first 32 threads.
B0, B1..are bank mapping.Each consecutive 4 columns forms
kPack. This is only for representation purpose, inside LDS layout is actuallykPackPerBlock, mPerBlock, kPack].Notice that both thread1 and thread0 are writing to same
kPackvector.Existing logic will set vectorization flag but it can not happen in this case.
Data Table
Because each thread is writing 32 bits, compiler generates
ds_write2_b32. Thisds_writeinstruction happens in two phases.Phase 1: 0-31 threads
Phase 2 : 32-63 thread
Looking at bank mapping you can find that it has 8 way bank conflict : T0, T2, T4, T6, T8, T10, T12, T14 all are mapped to B0.
Measuring bank conflict with this rocprof-compute, this kernel shows LDS Bank conflicts of rate
0.49With this PR it drops to
0.07with this PR.This doesn't translate to better performance always though. I see compiler selecting single access
ds_write_b32instead ofds_write2_b32. But re-tuning selects some other perfConfig with better final performance, it goes from 88 TFLops to 97TFlops.