Skip to content

Conversation

@jeffbolznv
Copy link
Collaborator

This is stacked on #19281.

This allows the compiler to do a bit better at overlapping loads and math (e.g. loading V can start while computing Q*K^t is still happening). Worth a couple percent for coopmat2, less for coopmat1/scalar.

before

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -p 512 -n 0 -d 0-32768+8192 -m c:\models\GLM-4.7-Flash-Q4_K_M.gguf -m c:\models\gpt-oss-20b-mxfp4.gguf -m c:\models\Qwen_Qwen3-30B-A3B-Q4_K_M.gguf -m c:\models\Qwen3-Next-80B-A3B-Instruct-Q2_K_L.gguf -m c:\models\llama-2-7b.Q4_0.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| deepseek2 30B.A3B Q4_K - Medium |  16.88 GiB |    29.94 B | Vulkan     |  99 |  1 |           pp512 |      8396.02 ± 81.63 |
| deepseek2 30B.A3B Q4_K - Medium |  16.88 GiB |    29.94 B | Vulkan     |  99 |  1 |   pp512 @ d8192 |      3221.30 ± 20.40 |
| deepseek2 30B.A3B Q4_K - Medium |  16.88 GiB |    29.94 B | Vulkan     |  99 |  1 |  pp512 @ d16384 |       1989.70 ± 3.67 |
| deepseek2 30B.A3B Q4_K - Medium |  16.88 GiB |    29.94 B | Vulkan     |  99 |  1 |  pp512 @ d24576 |       1426.74 ± 1.67 |
| deepseek2 30B.A3B Q4_K - Medium |  16.88 GiB |    29.94 B | Vulkan     |  99 |  1 |  pp512 @ d32768 |       1102.89 ± 1.15 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |           pp512 |    10954.70 ± 173.33 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |   pp512 @ d8192 |     9356.51 ± 104.58 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |  pp512 @ d16384 |      8217.04 ± 75.10 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |  pp512 @ d24576 |      7279.99 ± 60.94 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |  pp512 @ d32768 |      6558.87 ± 43.19 |
| qwen3moe 30B.A3B Q4_K - Medium |  17.35 GiB |    30.53 B | Vulkan     |  99 |  1 |           pp512 |     10510.00 ± 90.81 |
| qwen3moe 30B.A3B Q4_K - Medium |  17.35 GiB |    30.53 B | Vulkan     |  99 |  1 |   pp512 @ d8192 |      7342.35 ± 87.09 |
| qwen3moe 30B.A3B Q4_K - Medium |  17.35 GiB |    30.53 B | Vulkan     |  99 |  1 |  pp512 @ d16384 |      5577.50 ± 42.00 |
| qwen3moe 30B.A3B Q4_K - Medium |  17.35 GiB |    30.53 B | Vulkan     |  99 |  1 |  pp512 @ d24576 |      4494.34 ± 18.09 |
| qwen3moe 30B.A3B Q4_K - Medium |  17.35 GiB |    30.53 B | Vulkan     |  99 |  1 |  pp512 @ d32768 |      3753.69 ± 20.90 |
| qwen3next 80B.A3B Q2_K - Medium |  27.23 GiB |    79.67 B | Vulkan     |  99 |  1 |           pp512 |      4582.17 ± 17.28 |
| qwen3next 80B.A3B Q2_K - Medium |  27.23 GiB |    79.67 B | Vulkan     |  99 |  1 |   pp512 @ d8192 |     4057.73 ± 150.24 |
| qwen3next 80B.A3B Q2_K - Medium |  27.23 GiB |    79.67 B | Vulkan     |  99 |  1 |  pp512 @ d16384 |     3684.80 ± 115.32 |
| qwen3next 80B.A3B Q2_K - Medium |  27.23 GiB |    79.67 B | Vulkan     |  99 |  1 |  pp512 @ d24576 |      3375.87 ± 90.14 |
| qwen3next 80B.A3B Q2_K - Medium |  27.23 GiB |    79.67 B | Vulkan     |  99 |  1 |  pp512 @ d32768 |      3115.29 ± 82.11 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |           pp512 |      12876.33 ± 7.22 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |   pp512 @ d8192 |     9110.98 ± 340.42 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |  pp512 @ d16384 |     7155.14 ± 252.22 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |  pp512 @ d24576 |     5736.91 ± 216.05 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |  pp512 @ d32768 |     4896.37 ± 190.76 |

after

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -p 512 -n 0 -d 0-32768+8192 -m c:\models\GLM-4.7-Flash-Q4_K_M.gguf -m c:\models\gpt-oss-20b-mxfp4.gguf -m c:\models\Qwen_Qwen3-30B-A3B-Q4_K_M.gguf -m c:\models\Qwen3-Next-80B-A3B-Instruct-Q2_K_L.gguf -m c:\models\llama-2-7b.Q4_0.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| deepseek2 30B.A3B Q4_K - Medium |  16.88 GiB |    29.94 B | Vulkan     |  99 |  1 |           pp512 |      8383.60 ± 84.99 |
| deepseek2 30B.A3B Q4_K - Medium |  16.88 GiB |    29.94 B | Vulkan     |  99 |  1 |   pp512 @ d8192 |      3296.14 ± 17.41 |
| deepseek2 30B.A3B Q4_K - Medium |  16.88 GiB |    29.94 B | Vulkan     |  99 |  1 |  pp512 @ d16384 |       2049.17 ± 2.63 |
| deepseek2 30B.A3B Q4_K - Medium |  16.88 GiB |    29.94 B | Vulkan     |  99 |  1 |  pp512 @ d24576 |       1477.91 ± 2.57 |
| deepseek2 30B.A3B Q4_K - Medium |  16.88 GiB |    29.94 B | Vulkan     |  99 |  1 |  pp512 @ d32768 |       1147.37 ± 2.10 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |           pp512 |    11072.72 ± 108.27 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |   pp512 @ d8192 |      9449.75 ± 55.99 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |  pp512 @ d16384 |      8265.75 ± 80.46 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |  pp512 @ d24576 |      7300.51 ± 42.59 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |  pp512 @ d32768 |      6568.90 ± 36.88 |
| qwen3moe 30B.A3B Q4_K - Medium |  17.35 GiB |    30.53 B | Vulkan     |  99 |  1 |           pp512 |     10492.97 ± 97.52 |
| qwen3moe 30B.A3B Q4_K - Medium |  17.35 GiB |    30.53 B | Vulkan     |  99 |  1 |   pp512 @ d8192 |      7406.26 ± 85.21 |
| qwen3moe 30B.A3B Q4_K - Medium |  17.35 GiB |    30.53 B | Vulkan     |  99 |  1 |  pp512 @ d16384 |      5654.28 ± 49.88 |
| qwen3moe 30B.A3B Q4_K - Medium |  17.35 GiB |    30.53 B | Vulkan     |  99 |  1 |  pp512 @ d24576 |      4561.42 ± 45.52 |
| qwen3moe 30B.A3B Q4_K - Medium |  17.35 GiB |    30.53 B | Vulkan     |  99 |  1 |  pp512 @ d32768 |      3835.78 ± 22.02 |
| qwen3next 80B.A3B Q2_K - Medium |  27.23 GiB |    79.67 B | Vulkan     |  99 |  1 |           pp512 |      4587.46 ± 17.63 |
| qwen3next 80B.A3B Q2_K - Medium |  27.23 GiB |    79.67 B | Vulkan     |  99 |  1 |   pp512 @ d8192 |     4098.24 ± 139.56 |
| qwen3next 80B.A3B Q2_K - Medium |  27.23 GiB |    79.67 B | Vulkan     |  99 |  1 |  pp512 @ d16384 |     3749.66 ± 121.45 |
| qwen3next 80B.A3B Q2_K - Medium |  27.23 GiB |    79.67 B | Vulkan     |  99 |  1 |  pp512 @ d24576 |     3472.26 ± 103.67 |
| qwen3next 80B.A3B Q2_K - Medium |  27.23 GiB |    79.67 B | Vulkan     |  99 |  1 |  pp512 @ d32768 |      3232.42 ± 92.36 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |           pp512 |    12715.10 ± 144.59 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |   pp512 @ d8192 |     9253.01 ± 138.89 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |  pp512 @ d16384 |     7211.87 ± 238.40 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |  pp512 @ d24576 |     5783.65 ± 236.97 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |  pp512 @ d32768 |     5019.27 ± 187.56 |

Write out a 2-bit code per block and avoid loading the mask when it
matches these two common cases.

Apply this optimization when the mask is relatively large (i.e. prompt
processing).
@github-actions github-actions bot added testing Everything test related Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Feb 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning testing Everything test related Vulkan Issues specific to the Vulkan backend

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant