Skip to content

CUDA: re-use MLA K data for V in MMA FA#19057

Merged
JohannesGaessler merged 1 commit intoggml-org:masterfrom
JohannesGaessler:cuda-fa-v-is-k
Jan 24, 2026
Merged

CUDA: re-use MLA K data for V in MMA FA#19057
JohannesGaessler merged 1 commit intoggml-org:masterfrom
JohannesGaessler:cuda-fa-v-is-k

Conversation

@JohannesGaessler
Copy link
Collaborator

Follow-up to #18986 .

This PR re-enables a performance optimization in the CUDA MMA FlashAttention kernel that re-uses part of the K data for V.

Performance
GPU Model Microbatch size Test t/s b7818 t/s f5cfe16 Speedup
RTX 3090 deepseek2 16B Q4_0 1 pp512@d32768 140.61 156.74 1.11
RTX 3090 deepseek2 16B Q4_0 2 pp512@d32768 158.14 163.31 1.03
RTX 3090 deepseek2 16B Q4_0 4 pp512@d32768 251.75 258.88 1.03
RTX 3090 deepseek2 16B Q4_0 8 pp512@d32768 383.29 399.11 1.04
RTX 3090 deepseek2 16B Q4_0 16 pp512@d32768 554.66 587.24 1.06
RTX 3090 deepseek2 16B Q4_0 32 pp512@d32768 728.44 770.71 1.06
RTX 3090 deepseek2 16B Q4_0 64 pp512@d32768 877.92 946.42 1.08
RTX 3090 deepseek2 16B Q4_0 128 pp512@d32768 971.13 1033.12 1.06
RTX 3090 deepseek2 16B Q4_0 256 pp512@d32768 1016.54 1060.38 1.04
RTX 3090 deepseek2 16B Q4_0 512 pp512@d32768 1110.34 1206.51 1.09
RTX 3090 deepseek2 ?B Q2_K_M 1 pp512@d32768 43.77 52.35 1.20
RTX 3090 deepseek2 ?B Q2_K_M 2 pp512@d32768 63.96 72.39 1.13
RTX 3090 deepseek2 ?B Q2_K_M 4 pp512@d32768 106.03 122.28 1.15
RTX 3090 deepseek2 ?B Q2_K_M 8 pp512@d32768 189.30 197.41 1.04
RTX 3090 deepseek2 ?B Q2_K_M 16 pp512@d32768 297.97 308.39 1.03
RTX 3090 deepseek2 ?B Q2_K_M 32 pp512@d32768 368.21 373.97 1.02
RTX 3090 deepseek2 ?B Q2_K_M 64 pp512@d32768 407.80 429.07 1.05
RTX 3090 deepseek2 ?B Q2_K_M 128 pp512@d32768 431.24 458.08 1.06
RTX 3090 deepseek2 ?B Q2_K_M 256 pp512@d32768 524.06 539.69 1.03
RTX 3090 deepseek2 ?B Q2_K_M 512 pp512@d32768 598.11 644.65 1.08
RTX 4090 deepseek2 16B Q4_0 1 pp512@d32768 182.18 183.62 1.01
RTX 4090 deepseek2 16B Q4_0 2 pp512@d32768 221.62 223.21 1.01
RTX 4090 deepseek2 16B Q4_0 4 pp512@d32768 369.57 372.53 1.01
RTX 4090 deepseek2 16B Q4_0 8 pp512@d32768 617.32 626.49 1.01
RTX 4090 deepseek2 16B Q4_0 16 pp512@d32768 965.72 988.45 1.02
RTX 4090 deepseek2 16B Q4_0 32 pp512@d32768 1397.13 1447.24 1.04
RTX 4090 deepseek2 16B Q4_0 64 pp512@d32768 1873.01 1958.05 1.05
RTX 4090 deepseek2 16B Q4_0 128 pp512@d32768 2158.83 2270.64 1.05
RTX 4090 deepseek2 16B Q4_0 256 pp512@d32768 2605.70 2790.26 1.07
RTX 4090 deepseek2 16B Q4_0 512 pp512@d32768 2850.17 3088.95 1.08
RTX 4090 deepseek2 ?B Q2_K_M 1 pp512@d32768 81.98 99.01 1.21
RTX 4090 deepseek2 ?B Q2_K_M 2 pp512@d32768 113.95 129.08 1.13
RTX 4090 deepseek2 ?B Q2_K_M 4 pp512@d32768 196.74 212.86 1.08
RTX 4090 deepseek2 ?B Q2_K_M 8 pp512@d32768 333.54 344.98 1.03
RTX 4090 deepseek2 ?B Q2_K_M 16 pp512@d32768 551.75 566.78 1.03
RTX 4090 deepseek2 ?B Q2_K_M 32 pp512@d32768 760.50 792.06 1.04
RTX 4090 deepseek2 ?B Q2_K_M 64 pp512@d32768 912.59 962.60 1.05
RTX 4090 deepseek2 ?B Q2_K_M 128 pp512@d32768 1010.51 1074.58 1.06
RTX 4090 deepseek2 ?B Q2_K_M 256 pp512@d32768 1235.87 1330.60 1.08
RTX 4090 deepseek2 ?B Q2_K_M 512 pp512@d32768 1358.25 1465.94 1.08
RTX 5090 deepseek2 16B Q4_0 1 pp512@d32768 180.70 183.59 1.02
RTX 5090 deepseek2 16B Q4_0 2 pp512@d32768 210.08 211.61 1.01
RTX 5090 deepseek2 16B Q4_0 4 pp512@d32768 376.96 379.17 1.01
RTX 5090 deepseek2 16B Q4_0 8 pp512@d32768 692.17 700.43 1.01
RTX 5090 deepseek2 16B Q4_0 16 pp512@d32768 1041.68 1063.63 1.02
RTX 5090 deepseek2 16B Q4_0 32 pp512@d32768 1601.00 1649.08 1.03
RTX 5090 deepseek2 16B Q4_0 64 pp512@d32768 2152.07 2230.35 1.04
RTX 5090 deepseek2 16B Q4_0 128 pp512@d32768 2576.08 2698.83 1.05
RTX 5090 deepseek2 16B Q4_0 256 pp512@d32768 3196.82 3386.39 1.06
RTX 5090 deepseek2 16B Q4_0 512 pp512@d32768 3603.42 3851.28 1.07
RTX 5090 deepseek2 ?B Q2_K_M 1 pp512@d32768 103.95 126.15 1.21
RTX 5090 deepseek2 ?B Q2_K_M 2 pp512@d32768 116.74 129.73 1.11
RTX 5090 deepseek2 ?B Q2_K_M 4 pp512@d32768 202.69 219.00 1.08
RTX 5090 deepseek2 ?B Q2_K_M 8 pp512@d32768 354.23 363.73 1.03
RTX 5090 deepseek2 ?B Q2_K_M 16 pp512@d32768 566.62 578.24 1.02
RTX 5090 deepseek2 ?B Q2_K_M 32 pp512@d32768 799.68 825.30 1.03
RTX 5090 deepseek2 ?B Q2_K_M 64 pp512@d32768 1007.89 1048.39 1.04
RTX 5090 deepseek2 ?B Q2_K_M 128 pp512@d32768 1144.56 1195.92 1.04
RTX 5090 deepseek2 ?B Q2_K_M 256 pp512@d32768 1473.16 1558.72 1.06
RTX 5090 deepseek2 ?B Q2_K_M 512 pp512@d32768 1695.63 1810.36 1.07
RX 9060 XT deepseek2 16B Q4_0 1 pp512@d32768 50.39 53.05 1.05
RX 9060 XT deepseek2 16B Q4_0 2 pp512@d32768 48.93 58.73 1.20
RX 9060 XT deepseek2 16B Q4_0 4 pp512@d32768 70.12 75.67 1.08
RX 9060 XT deepseek2 16B Q4_0 8 pp512@d32768 85.39 93.94 1.10
RX 9060 XT deepseek2 16B Q4_0 16 pp512@d32768 99.75 110.83 1.11
RX 9060 XT deepseek2 16B Q4_0 32 pp512@d32768 107.35 120.25 1.12
RX 9060 XT deepseek2 16B Q4_0 64 pp512@d32768 161.86 167.05 1.03
RX 9060 XT deepseek2 16B Q4_0 128 pp512@d32768 169.54 177.07 1.04
RX 9060 XT deepseek2 16B Q4_0 256 pp512@d32768 180.34 188.13 1.04
RX 9060 XT deepseek2 16B Q4_0 512 pp512@d32768 187.45 193.32 1.03
V100-PCIE-32GB deepseek2 16B Q4_0 1 pp512@d32768 90.18 89.34 0.99
V100-PCIE-32GB deepseek2 16B Q4_0 2 pp512@d32768 84.90 81.85 0.96
V100-PCIE-32GB deepseek2 16B Q4_0 4 pp512@d32768 132.86 130.78 0.98
V100-PCIE-32GB deepseek2 16B Q4_0 8 pp512@d32768 198.58 195.39 0.98
V100-PCIE-32GB deepseek2 16B Q4_0 16 pp512@d32768 260.22 260.29 1.00
V100-PCIE-32GB deepseek2 16B Q4_0 32 pp512@d32768 316.55 321.37 1.02
V100-PCIE-32GB deepseek2 16B Q4_0 64 pp512@d32768 304.07 309.34 1.02
V100-PCIE-32GB deepseek2 16B Q4_0 128 pp512@d32768 356.90 370.65 1.04
V100-PCIE-32GB deepseek2 16B Q4_0 256 pp512@d32768 436.12 435.77 1.00
V100-PCIE-32GB deepseek2 16B Q4_0 512 pp512@d32768 476.38 458.96 0.96
V100-PCIE-32GB deepseek2 ?B Q2_K_M 1 pp512@d32768 42.75 42.54 1.00
V100-PCIE-32GB deepseek2 ?B Q2_K_M 2 pp512@d32768 44.32 44.19 1.00
V100-PCIE-32GB deepseek2 ?B Q2_K_M 4 pp512@d32768 59.97 59.74 1.00
V100-PCIE-32GB deepseek2 ?B Q2_K_M 8 pp512@d32768 87.44 85.55 0.98
V100-PCIE-32GB deepseek2 ?B Q2_K_M 16 pp512@d32768 129.76 130.59 1.01
V100-PCIE-32GB deepseek2 ?B Q2_K_M 32 pp512@d32768 150.99 152.57 1.01
V100-PCIE-32GB deepseek2 ?B Q2_K_M 64 pp512@d32768 156.93 159.18 1.01
V100-PCIE-32GB deepseek2 ?B Q2_K_M 128 pp512@d32768 191.49 198.49 1.04
V100-PCIE-32GB deepseek2 ?B Q2_K_M 256 pp512@d32768 222.48 237.37 1.07
V100-PCIE-32GB deepseek2 ?B Q2_K_M 512 pp512@d32768 240.26 258.10 1.07

@github-actions github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Jan 23, 2026
Copy link
Member

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did some tests on DGX Spark and works correctly.

Comment on lines +274 to +276
if (!V_is_K_view) {
return BEST_FATTN_KERNEL_NONE;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my understanding, what prevents this path to run correctly? Likely we won't ever need it - just from implementation perspective what is the limitation.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no limitation, I'm just trying to cut out unused template specializations since MMQ and FA are the biggest contributors to compilation time and binary size when it comes to the CUDA backend.

@JohannesGaessler JohannesGaessler merged commit 8f91ca5 into ggml-org:master Jan 24, 2026
75 of 78 checks passed
Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Jan 24, 2026
ronaldmannak pushed a commit to PicoMLX/llama.cpp that referenced this pull request Jan 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants