Skip to content

HIP: Enable MMA flash attention for RDNA3 with head size 576#19063

Closed
linus-amg wants to merge 1 commit intoggml-org:masterfrom
linus-amg:hip-rdna3-mma-fattn-576
Closed

HIP: Enable MMA flash attention for RDNA3 with head size 576#19063
linus-amg wants to merge 1 commit intoggml-org:masterfrom
linus-amg:hip-rdna3-mma-fattn-576

Conversation

@linus-amg
Copy link

Summary

This PR enables MMA-based flash attention on RDNA3 GPUs (gfx1100/1101/1102) for models with head size 576, such as GLM-4.7-Flash and other MLA (Multi-head Latent Attention) models.

Previously, flash attention with head size 576 only worked on CUDA (via #18953) and RDNA4. RDNA3 users had to disable flash attention, resulting in ~3x slower inference.

Changes

  1. fattn.cu: Route RDNA3 + head size 576 to MMA kernel (was RDNA4-only)
  2. fattn-mma-f16.cuh:
    • Enable AMD WMMA guards for all RDNA3/RDNA4 (was RDNA4-only)
    • Allow DKQ == 576 in AMD path (was limited to ≤128)
  3. mma.cuh:
    • Add RDNA3 to make_identity_mat()
    • Add RDNA3 f16→f16 WMMA intrinsic with correct 4-argument signature

Performance

Tested on AMD RX 7900 XTX (gfx1100) with GLM-4.7-Flash-REAP-23B-A3B:

Configuration Generation Speed
FA off (before) ~77 t/s
FA on (before - broken) ~27 t/s
FA on (after fix) ~83 t/s

Testing

  • Builds successfully with -DGGML_HIP=ON -DGGML_HIP_ROCWMMA_FATTN=ON -DGPU_TARGETS="gfx1100"
  • GLM-4.7-Flash-REAP inference works with flash attention enabled
  • No regressions on standard head sizes (64, 128)

Related

This enables MMA-based flash attention on RDNA3 GPUs (gfx1100/1101/1102)
for models with head size 576, such as GLM-4.7-Flash and other MLA
(Multi-head Latent Attention) models.

Previously, flash attention with head size 576 only worked on CUDA
(via PR ggml-org#18953) and RDNA4. RDNA3 users had to disable flash attention,
resulting in ~3x slower inference.

Changes:
- fattn.cu: Route RDNA3 + head size 576 to MMA kernel (was RDNA4-only)
- fattn-mma-f16.cuh: Enable AMD WMMA for all RDNA3/RDNA4, allow DKQ==576
- mma.cuh: Add RDNA3 to make_identity_mat(), add f16->f16 WMMA intrinsic

Tested on AMD RX 7900 XTX (gfx1100) with GLM-4.7-Flash-REAP-23B:
- FA off: ~77 t/s
- FA on (before, broken): ~27 t/s
- FA on (after fix): ~83 t/s
@github-actions github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Jan 24, 2026
@linus-amg
Copy link
Author

Closing this PR - the RDNA3 f16→f16 WMMA implementation produces incorrect output due to unpacked output format incompatibility with the tile structure. RDNA3 works correctly with tile-based flash attention instead of MMA. May revisit with a proper fix in the future.

@linus-amg linus-amg closed this Jan 24, 2026
@linus-amg linus-amg deleted the hip-rdna3-mma-fattn-576 branch January 24, 2026 14:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant