Conversation
|
The commit computes GLM-4.7-Flash config.json does not include the Was hoping to squeeze my GLM model a bit more. 😿 Nevermind. deepseek2.attention.key_length_mla |
@John-Dekka These are |
JohannesGaessler
left a comment
There was a problem hiding this comment.
The CUDA changes are correct, the changes in the llama.cpp user code seem correct to me though I am not as familiar with that part of the codebase.
|
Will the |
|
|
|
very cool master: PR: |
c843f3a to
6d7ce2e
Compare
| if (wo) { | ||
| cur = build_lora_mm(wo, cur); | ||
| if (arch == LLM_ARCH_GLM4 || arch == LLM_ARCH_GLM4_MOE) { | ||
| // GLM4 and GLM4_MOE seem to have numerical issues with half-precision accumulators | ||
| ggml_mul_mat_set_prec(cur, GGML_PREC_F32); | ||
| } | ||
| } |
There was a problem hiding this comment.
We might need to add LLM_ARCH_DEEPSEEK2 here in case we suspect similar numerical issues with GLM 4.7 Flash - something too keep in mind. cc @jeffbolznv @JohannesGaessler
There was a problem hiding this comment.
IIRC @ngxson tried this during our PR and it made no difference in his testing
There was a problem hiding this comment.
At that time, the wrong gating function was used, so can't conclude based on this. Plus this is somewhat backend-specific - e.g. it's not a problem for Metal since we always accumulate in F32.
When V is a view of K but with different head dimensions (e.g., GLM-4.7-Flash with K=576, V=512), we cannot simply reuse K's data pointer for V. For MLA models, the K tensor layout is [kv_lora_scaled (DV), pe (DQK-DV)], so V data is the first DV elements of each K row. This fix extracts the correct V data from K when DQK != DV in: - ggml_sycl_op_flash_attn_1 (basic FA path) - ggml_sycl_op_flash_attn_coopmat (XMX path) - ggml_sycl_op_flash_attn_mkl (oneMKL path) Fixes GPU memory faults and incorrect results in backend tests for hsk=576,hsv=512 configurations. Aligns with upstream PRs ggml-org#18953, ggml-org#18986, ggml-org#19067 that implement V-less KV cache for MLA models like DeepSeek and GLM-4.7-Flash. Amp-Thread-ID: https://ampcode.com/threads/T-019bf97a-9105-718e-84fb-320913c5f0c6 Co-authored-by: Amp <amp@ampcode.com>
When V is a view of K but with different head dimensions (e.g., GLM-4.7-Flash with K=576, V=512), we cannot simply reuse K's data pointer for V. For MLA models, the K tensor layout is [kv_lora_scaled (DV), pe (DQK-DV)], so V data is the first DV elements of each K row. This fix extracts the correct V data from K when DQK != DV in: - ggml_sycl_op_flash_attn_1 (basic FA path) - ggml_sycl_op_flash_attn_coopmat (XMX path) - ggml_sycl_op_flash_attn_mkl (oneMKL path) Fixes GPU memory faults and incorrect results in backend tests for hsk=576,hsv=512 configurations. Aligns with upstream PRs ggml-org#18953, ggml-org#18986, ggml-org#19067 that implement V-less KV cache for MLA models like DeepSeek and GLM-4.7-Flash. Amp-Thread-ID: https://ampcode.com/threads/T-019bf97a-9105-718e-84fb-320913c5f0c6 Co-authored-by: Amp <amp@ampcode.com>
When V is a view of K but with different head dimensions (e.g., GLM-4.7-Flash with K=576, V=512), we cannot simply reuse K's data pointer for V. For MLA models, the K tensor layout is [kv_lora_scaled (DV), pe (DQK-DV)], so V data is the first DV elements of each K row. This fix extracts the correct V data from K when DQK != DV in: - ggml_sycl_op_flash_attn_1 (basic FA path) - ggml_sycl_op_flash_attn_coopmat (XMX path) - ggml_sycl_op_flash_attn_mkl (oneMKL path) Fixes GPU memory faults and incorrect results in backend tests for hsk=576,hsv=512 configurations. Aligns with upstream PRs ggml-org#18953, ggml-org#18986, ggml-org#19067 that implement V-less KV cache for MLA models like DeepSeek and GLM-4.7-Flash. Amp-Thread-ID: https://ampcode.com/threads/T-019bf97a-9105-718e-84fb-320913c5f0c6 Co-authored-by: Amp <amp@ampcode.com>
cont #18986
Support V-less KV cache. This is useful for MLA models such as DeepSeek and GLM 4.7 Flash where we store combined latent data represented by the K cache. Results in almost x2 less memory for the KV cache.
Also:
llama_hparams::is_mla()llama_hparams::n_embd_head_k_mla()llama_hparams::n_embd_head_v_mla()llama_hparams::get_n_embd_out()->llama_hparams::n_embd_out()class llm_graph_input_attn_k- similar toclass llm_graph_input_attn_kv, but only K data