-
Notifications
You must be signed in to change notification settings - Fork 14.8k
mla : make the V tensor a view of K #18986
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| // TODO: make this more generic by removing the notion of "MLA". | ||
| // for example "is V a view of K?" so we can skip loading it. | ||
| // V strides should be driven by V itself and avoid assumption of the data layout | ||
| const bool is_mla = V->op == GGML_OP_VIEW && V->src[0] == K; | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JohannesGaessler Let me know if this is clear. The proposal is that when V = ggml_view(K), the implementation can use this information, for example to avoid extra loads of data, etc. But technically, this is completely optional to do and the implementation can also just read from V directly.
This way, if the user code insists on passing different V data, then that should also work. In that case V won't be a view of K so it will be treated as a regular V tensor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should adjust the kernel selection logic in fattn.cu to only use the MMA kernel if this boolean is true. The MMA kernel has the MLA-specific optimization of re-using the K data that was previously loaded for calculation of KQ as V data when calculating VKQ. But the tile kernel simply loads the data from the K and V pointers with no re-use. So for now I would suggest that we condition the use of the MMA kernel on this boolean and use the tile kernel as a fallback.
We could in principle compile multiple template specializations but currently they would be unused for real models.
ggml/src/ggml-cuda/fattn-mma-f16.cuh
Outdated
| // Therefore, iterate over V in reverse and re-use the data if possible. | ||
| static_assert(!mla || nstages <= 1, "combination of MLA and multi-stage loading not implemented"); | ||
| constexpr int reusable_cutoff = mla ? (DKQ - 1) - (DKQ - 1) % (2*nbatch_K2) - (DKQ - DV) : DV; | ||
| constexpr int reusable_cutoff = mla ? (DV - 1) - (DV - 1) % (2*nbatch_K2) : DV; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure I completely understand the logic for reusing the K data, but to the best of my understanding this is the correct version now. Extra look is need to confirm.
To clarify, on master, the K/V data is layout like this in memory:
# K - 16 cells of K-only data
# S - 16 cells of shared data between both K and V
# a single latent head of 576 cells
kkKKssssssssssssssssSSSSSSSSSSSSSSSSIn this PR, we change the layout so the K data is at the end of the row:
# K - 16 cells of K-only data
# S - 16 cells of shared data between both K and V
# a single latent head of 576 cells
ssssssssssssssssSSSSSSSSSSSSSSSSkkKKThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we change the memory layout on master the loop structure in the MMA kernel needs to be done differently. On master the loop for K is going forward, the one for V is going backward. But if the reusable data is at the beginning the loop for K should be backward and the loop for V should be forward.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, I am not sure I understand. Is this just for performance reasons? The current implementation works correctly and passes the tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I think I know what you are thinking. I updated the comment to better reflect the layout. So to clarify - the order of the shared data is the same, it's just shifted forward. And the K-only data is also ordered the same - it's just moved at the end of the buffer.
The same transform is also being applied to the Q data. So at the end, the dot products of QK and QKV are the same. Does that clarify?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The MMA kernel uses only the K pointer which was meant as an assist for a more general implementation of deduplicating the KV cache for DeepSeek that never materialized. But it is also re-using the K data that is already in SRAM to avoid having to load it again. For that the loop over V is done in reverse because on consumer GPUs there simply isn't enough SRAM to hold the entire tile of K at once. So yes, that part is only for better performance.
| // {n_embd_head_qk_rope + kv_lora_rank, n_head, n_tokens} | ||
| // note: rope must go first for in-place context shifting in build_rope_shift() | ||
| ggml_tensor * Qcur = ggml_concat(ctx0, q_pe, q_nope_absorbed, 0); | ||
| ggml_tensor * Qcur = ggml_concat(ctx0, q_nope_absorbed, q_pe, 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a side-effect of this change is that build_rope_shift will no longer work correctly (may need to use a view when shifting)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should fix it:
diff --git a/src/llama-kv-cache.cpp b/src/llama-kv-cache.cpp
index fd9f97d52..2cc9efed8 100644
--- a/src/llama-kv-cache.cpp
+++ b/src/llama-kv-cache.cpp
@@ -1614,10 +1614,10 @@ ggml_cgraph * llama_kv_cache::build_graph_shift(llm_graph_result * res, llama_co
ggml_tensor * k =
ggml_view_3d(ctx, layer.k,
- n_embd_head_k, n_head_kv, get_size()*n_stream,
+ n_embd_head_k - hparams.n_lora_kv, n_head_kv, get_size()*n_stream,
ggml_row_size(layer.k->type, n_embd_head_k),
ggml_row_size(layer.k->type, n_embd_k_gqa),
- 0);
+ ggml_row_size(layer.k->type, hparams.n_lora_kv));
ggml_tensor * cur = build_rope_shift(cparams, ctx, k, inp->k_shift, rope_factors, freq_base_l, freq_scale_l);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The rope shift logic for PLM and minicpm3 has been broken since it didn't take into account the "nope" portion of the K embeddings. This is fixed now with 69d4fd7
f07c65b to
3c51513
Compare
3c51513 to
69d4fd7
Compare
|
This should be the minimum changes needed to stabilize the CI and allow further work on the MLA code paths in |
tests/test-backend-ops.cpp
Outdated
| v = ggml_view_4d(ctx, k, hsv_padded, kv, nh, nr23[1], k->nb[1], k->nb[2], k->nb[3], 0); | ||
|
|
||
| // note: this is the currently assumed layout by the CUDA FA implementation | ||
| // however, this layout is problematic, because the offset can become very inconveniet for quantized KV types (i.e. not multiple of 16) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what you mean here. Moving the K-only data to the front does not fix any issues with memory alignment. On master the check for memory alignment causes quantized KV cache to fall back to CPU but that is a bug with the check, see #19023 .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment is no longer relevant - I removed it.
For reference, the problem is that the Vulkan backend does not support views with offset % 16 != 0. At least when I tried, it asserted here:
llama.cpp/ggml/src/ggml-vulkan/ggml-vulkan.cpp
Lines 6084 to 6110 in 3c51513
| static vk_subbuffer ggml_vk_tensor_subbuffer( | |
| const ggml_backend_vk_context * ctx, const ggml_tensor * tensor, bool allow_misalign = false) { | |
| vk_buffer buffer = nullptr; | |
| size_t offset = 0; | |
| if (ctx->device->uma) { | |
| ggml_vk_host_get(ctx->device, tensor->data, buffer, offset); | |
| } | |
| if (!buffer) { | |
| auto buf_ctx = (ggml_backend_vk_buffer_context *)tensor->buffer->context; | |
| buffer = buf_ctx->dev_buffer; | |
| offset = vk_tensor_offset(tensor) + tensor->view_offs; | |
| } | |
| GGML_ASSERT(buffer != nullptr); | |
| size_t size = ggml_nbytes(tensor); | |
| size_t misalign_bytes = offset & (ctx->device->properties.limits.minStorageBufferOffsetAlignment - 1); | |
| // The shader must support misaligned offsets when indexing into the buffer | |
| GGML_ASSERT(allow_misalign || misalign_bytes == 0); | |
| offset &= ~misalign_bytes; | |
| size += misalign_bytes; | |
| return vk_subbuffer{buffer, offset, size}; | |
| } | |
But this is no longer relevant, because we change the memory layout in this PR and we no longer need to offset the view.
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* mla : pass V as a view of K to the FA op * cuda : adjust mla logic to new layout * kv-cache : fix rope shift * tests : remove comment * cuda : fix reusable_cutoff Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* mla : pass V as a view of K to the FA op * cuda : adjust mla logic to new layout * kv-cache : fix rope shift * tests : remove comment * cuda : fix reusable_cutoff Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
When V is a view of K but with different head dimensions (e.g., GLM-4.7-Flash with K=576, V=512), we cannot simply reuse K's data pointer for V. For MLA models, the K tensor layout is [kv_lora_scaled (DV), pe (DQK-DV)], so V data is the first DV elements of each K row. This fix extracts the correct V data from K when DQK != DV in: - ggml_sycl_op_flash_attn_1 (basic FA path) - ggml_sycl_op_flash_attn_coopmat (XMX path) - ggml_sycl_op_flash_attn_mkl (oneMKL path) Fixes GPU memory faults and incorrect results in backend tests for hsk=576,hsv=512 configurations. Aligns with upstream PRs ggml-org#18953, ggml-org#18986, ggml-org#19067 that implement V-less KV cache for MLA models like DeepSeek and GLM-4.7-Flash. Amp-Thread-ID: https://ampcode.com/threads/T-019bf97a-9105-718e-84fb-320913c5f0c6 Co-authored-by: Amp <amp@ampcode.com>
When V is a view of K but with different head dimensions (e.g., GLM-4.7-Flash with K=576, V=512), we cannot simply reuse K's data pointer for V. For MLA models, the K tensor layout is [kv_lora_scaled (DV), pe (DQK-DV)], so V data is the first DV elements of each K row. This fix extracts the correct V data from K when DQK != DV in: - ggml_sycl_op_flash_attn_1 (basic FA path) - ggml_sycl_op_flash_attn_coopmat (XMX path) - ggml_sycl_op_flash_attn_mkl (oneMKL path) Fixes GPU memory faults and incorrect results in backend tests for hsk=576,hsv=512 configurations. Aligns with upstream PRs ggml-org#18953, ggml-org#18986, ggml-org#19067 that implement V-less KV cache for MLA models like DeepSeek and GLM-4.7-Flash. Amp-Thread-ID: https://ampcode.com/threads/T-019bf97a-9105-718e-84fb-320913c5f0c6 Co-authored-by: Amp <amp@ampcode.com>
When V is a view of K but with different head dimensions (e.g., GLM-4.7-Flash with K=576, V=512), we cannot simply reuse K's data pointer for V. For MLA models, the K tensor layout is [kv_lora_scaled (DV), pe (DQK-DV)], so V data is the first DV elements of each K row. This fix extracts the correct V data from K when DQK != DV in: - ggml_sycl_op_flash_attn_1 (basic FA path) - ggml_sycl_op_flash_attn_coopmat (XMX path) - ggml_sycl_op_flash_attn_mkl (oneMKL path) Fixes GPU memory faults and incorrect results in backend tests for hsk=576,hsv=512 configurations. Aligns with upstream PRs ggml-org#18953, ggml-org#18986, ggml-org#19067 that implement V-less KV cache for MLA models like DeepSeek and GLM-4.7-Flash. Amp-Thread-ID: https://ampcode.com/threads/T-019bf97a-9105-718e-84fb-320913c5f0c6 Co-authored-by: Amp <amp@ampcode.com>
|
I have noticed a coherency regression when used with GLM-4-32B-0414 when using context shifting ( |
|
Could you provide logs during the loading with the model parameters (head size, n_rot, etc)? It's not obvious why this model would break from the changes here. |
|
Sure, here's a long of loading the model with vulkan To clarify: the inference itself is fine. The regression only happens after a KV shift is applied, like so if no shifting is performed, there inference both before and after this PR is normal. If shifting is performed, there is a large degradation observed after this specific PR. |
|
Also I suspect it's not just this old model but also GLM-4.7-Flash and possibly others affected, but as GLM-4.7-Flash was going through many other separate unrelated fixes I didn't want to use that as the model. |
|
Does it work correctly with CPU-only run? |
|
Surprisingly, Pure CPU takes forever, but it works correctly! Shifting produces the expected results both before and after this PR with pure CPU. Which feels even stranger. I did a few more tests: Pure CPU = Shifting works fine. Super, super slow PP. Fully coherent. Vulkan with FA and 50/62 layers offloaded = Very incoherent. Example of extreme incoherence: Trying without FA, same result, coherence is related to how many layers are offloaded. At 50 layers: |
|
Seems like an issue in the Vulkan backend - track #19296 for more info. |
cont #18953
This is based on top of the changes in #18953.
Currently, the CUDA FA implementation has certain hardcoded assumptions about the layout of the K and V tensors when MLA is involved (see #13435). The goal of this change is to make things more generic and avoid these assumptions:
[pe, kv]. The new one is[kv, pe]. This has certain implications for backends such as Vulkan and overall it's better layout in terms of memory alignmentVtensor as a view ofK. This can be used as a signal for the CUDA (and later other) backends to avoid loading extraVdata during compute. (the elimination of theVcomponent from thellama_kv_cachewill be done in follow-up PR - for now it's just a redundant data)currently these tests will still fail (for CUDA only) because we changed the layout (see the point above). This needs to be taken into account in the CUDA implementation. For more info, see the comments in the(CUDA impl fixed in f07c65b)test-backend-ops.cppNext PRs:
test-backend-opsand avoid hardcoded logicExpand CUDA kernel coverage for quantized MLA cache and general improvement of the MLA-related logic(resolved in CUDA: fix alignment check for FA #19023)