CUDA: add gqa_ratio 4 for GLM 4.7 flash#18953
Conversation
|
Looks like there is a bug there, taking a look |
|
I would have thought the correct patch is this but the results are wrong. diff --git a/ggml/src/ggml-cuda/fattn-mma-f16.cuh b/ggml/src/ggml-cuda/fattn-mma-f16.cuh
index 9e98da95f..6cca5b2ec 100644
--- a/ggml/src/ggml-cuda/fattn-mma-f16.cuh
+++ b/ggml/src/ggml-cuda/fattn-mma-f16.cuh
@@ -510,7 +510,6 @@ static __device__ __forceinline__ void flash_attn_ext_f16_iter(
}
}
} else {
- static_assert(cols_per_warp != 8, "cols_per_warp == 8 not implemented");
#pragma unroll
for (int k_KQ_0 = k0_start; k_KQ_0 < k0_stop; k_KQ_0 += T_A_KQ::J) {
load_ldmatrix(Q_B[0], tile_Q + (threadIdx.y / np)*(T_B_KQ::I*stride_tile_Q) + k_KQ_0, stride_tile_Q);
@@ -522,14 +521,18 @@ static __device__ __forceinline__ void flash_attn_ext_f16_iter(
T_A_KQ K_A;
load_ldmatrix(K_A, tile_K + i_KQ_0*stride_tile_K + (k_KQ_0 - k0_start), stride_tile_K);
- // Wide version of KQ_C is column-major
+ if constexpr (cols_per_warp == 8) {
+ mma(KQ_C[i_KQ_00/(np*T_A_KQ::I)], K_A, Q_B[0]);
+ } else {
+ // Wide version of KQ_C is column-major
#if defined(AMD_WMMA_AVAILABLE)
- // RDNA matrix C is column-major.
- mma(KQ_C[i_KQ_00/(np*T_A_KQ::I)], K_A, Q_B[0]);
+ // RDNA matrix C is column-major.
+ mma(KQ_C[i_KQ_00/(np*T_A_KQ::I)], K_A, Q_B[0]);
#else
- // swap A and B for CUDA.
- mma(KQ_C[i_KQ_00/(np*T_A_KQ::I)], Q_B[0], K_A);
+ // swap A and B for CUDA.
+ mma(KQ_C[i_KQ_00/(np*T_A_KQ::I)], Q_B[0], K_A);
#endif // defined(AMD_WMMA_AVAILABLE)
+ }
}
}
} |
|
I forgot that the tests in |
|
Added rest of your comments. In general this model spews out nonsense with and without fa, I think that should be fixed with #18980 |
We can add the tests, but unless there is a technical blocker, it would be useful to support separate K and V data in the CUDA implementation as well. |
|
This patch is working well for me, combined with --override-kv deepseek2.expert_gating_func=int:2 to resolve most of the issues with GLM 4.7 Flash 2100 tokens/sec prompt processing and 90 tokens per second on RTX 6000 Blackwell. Output looks overall pretty decent when using the Unsloth FP16 GGUF |
|
@JohannesGaessler @am17an This PR should OK to merge. I have outlined a plan for improving the implementation and fixing the tests in #18986. Let's continue there after we merge this. |
|
In the CI, the following tests are failing, locally I am able get tests to pass. So I'm thinking there's probably bug in the tile kernel we haven't fixed |
|
I didn't touch any of the tests so it's expected that they are still failing. |
@am17an The test should fail locally too. For me, they fail on DGX Spark. On which setup are they passing for you? |
|
Btw, the windows workflow fails to compile: https://github.com/ggml-org/llama.cpp/actions/runs/21195664178/job/60970976275?pr=18953 Any ideas? |
|
The tests now fail for me too. Is the fix going to go in #18986? |
Yes, it needs some more work and will be merged after this PR. Regarding the windows build - from the logs, I think this is the problematic part: llama.cpp/ggml/src/ggml-cuda/fattn-mma-f16.cuh Lines 447 to 456 in fe1703a If |
|
Forgive me if this is covered in previous discussions and I'm not understanding but I'm attempting to build this PR myself I'm building my own docker containers (with a bunch of other stuff bundled, hence why not using the official ones) and finally building with And running into a failure during build after applying it: |
|
The compilation problems have to do with Volta, I'll push a fix. |
|
For Volta the minimum number of KQ columns is 32 so for the kernel templates with 8 and 16 columns the number of parallel warps per CUDA block was being calculated incorrectly. This tripped the static asserts even though that code should never actually be executed. I fixed the calculation of parallel warps to consider this edge case correctly. |
Awesome, glad I helped find something :) I'll give the updated branch a test shortly. |
|
is -ctk q8_0 -ctv q8_0 is broken too? For me it's again uses only cpu. Without all ok. |
|
I forgot: I reverted the change to exclude kernels with 8 KQ columns. I meant in my review comment that this variant should be excluded if and only if there are problems with it (but those are now fixed). |
|
Perplexity over Wikitext is fundamentally the wrong metric for judging the quality of an instruct-tuned model. What you should look at when it comes to the impact of quantization is KL divergence vs. the full-precision model. For judging the quality in an absolute sense there currently just isn't good tooling in the llama.cpp ecosystem. |
|
@JohannesGaessler Good to merge? |
|
I dont know if it's worth mentioning, but with my P40 + 3090 setup, FA on for PP is half the speed of FA off. Freshly build from this PR
build: a10d87b (7786)
build: a10d87b (7786) 3090 Only
build: a10d87b (7786)
build: a10d87b (7786) P40 Only
build: a10d87b (7786)
build: a10d87b (7786) |
JohannesGaessler
left a comment
There was a problem hiding this comment.
My general preference would be that we fix CI failures prior to a merge but it's fine if we take care of it soon after.
|
The changes do not improve performance with quantized KV cache, which still goes to CPU, is that to be expected?
build: b70d251 (7803) |
…tn-mma-f16 This reverts commit b70d251.
This enables MMA-based flash attention on RDNA3 GPUs (gfx1100/1101/1102) for models with head size 576, such as GLM-4.7-Flash and other MLA (Multi-head Latent Attention) models. Previously, flash attention with head size 576 only worked on CUDA (via PR ggml-org#18953) and RDNA4. RDNA3 users had to disable flash attention, resulting in ~3x slower inference. Changes: - fattn.cu: Route RDNA3 + head size 576 to MMA kernel (was RDNA4-only) - fattn-mma-f16.cuh: Enable AMD WMMA for all RDNA3/RDNA4, allow DKQ==576 - mma.cuh: Add RDNA3 to make_identity_mat(), add f16->f16 WMMA intrinsic Tested on AMD RX 7900 XTX (gfx1100) with GLM-4.7-Flash-REAP-23B: - FA off: ~77 t/s - FA on (before, broken): ~27 t/s - FA on (after fix): ~83 t/s
…tn-mma-f16 This reverts commit b70d251.
When V is a view of K but with different head dimensions (e.g., GLM-4.7-Flash with K=576, V=512), we cannot simply reuse K's data pointer for V. For MLA models, the K tensor layout is [kv_lora_scaled (DV), pe (DQK-DV)], so V data is the first DV elements of each K row. This fix extracts the correct V data from K when DQK != DV in: - ggml_sycl_op_flash_attn_1 (basic FA path) - ggml_sycl_op_flash_attn_coopmat (XMX path) - ggml_sycl_op_flash_attn_mkl (oneMKL path) Fixes GPU memory faults and incorrect results in backend tests for hsk=576,hsv=512 configurations. Aligns with upstream PRs ggml-org#18953, ggml-org#18986, ggml-org#19067 that implement V-less KV cache for MLA models like DeepSeek and GLM-4.7-Flash. Amp-Thread-ID: https://ampcode.com/threads/T-019bf97a-9105-718e-84fb-320913c5f0c6 Co-authored-by: Amp <amp@ampcode.com>
When V is a view of K but with different head dimensions (e.g., GLM-4.7-Flash with K=576, V=512), we cannot simply reuse K's data pointer for V. For MLA models, the K tensor layout is [kv_lora_scaled (DV), pe (DQK-DV)], so V data is the first DV elements of each K row. This fix extracts the correct V data from K when DQK != DV in: - ggml_sycl_op_flash_attn_1 (basic FA path) - ggml_sycl_op_flash_attn_coopmat (XMX path) - ggml_sycl_op_flash_attn_mkl (oneMKL path) Fixes GPU memory faults and incorrect results in backend tests for hsk=576,hsv=512 configurations. Aligns with upstream PRs ggml-org#18953, ggml-org#18986, ggml-org#19067 that implement V-less KV cache for MLA models like DeepSeek and GLM-4.7-Flash. Amp-Thread-ID: https://ampcode.com/threads/T-019bf97a-9105-718e-84fb-320913c5f0c6 Co-authored-by: Amp <amp@ampcode.com>
When V is a view of K but with different head dimensions (e.g., GLM-4.7-Flash with K=576, V=512), we cannot simply reuse K's data pointer for V. For MLA models, the K tensor layout is [kv_lora_scaled (DV), pe (DQK-DV)], so V data is the first DV elements of each K row. This fix extracts the correct V data from K when DQK != DV in: - ggml_sycl_op_flash_attn_1 (basic FA path) - ggml_sycl_op_flash_attn_coopmat (XMX path) - ggml_sycl_op_flash_attn_mkl (oneMKL path) Fixes GPU memory faults and incorrect results in backend tests for hsk=576,hsv=512 configurations. Aligns with upstream PRs ggml-org#18953, ggml-org#18986, ggml-org#19067 that implement V-less KV cache for MLA models like DeepSeek and GLM-4.7-Flash. Amp-Thread-ID: https://ampcode.com/threads/T-019bf97a-9105-718e-84fb-320913c5f0c6 Co-authored-by: Amp <amp@ampcode.com>

Enable FA for GLM 4.7, I'm not sure it's optimal but at least it does not go to CPU. Fixes #18944