ggml : add Flash Attention by ggerganov · Pull Request #5021 · ggml-org/llama.cpp

ggerganov · 2024-01-18T17:06:57Z

Setting up what's needed for Flash Attention support in ggml and llama.cpp

The proposed operator performs:

// new
res = ggml_flash_attn(ctx, q, k, v, kq_mask, kq_scale);

// fused scale + mask + soft_max (old)
kq  = ggml_mul_mat     (ctx, k,  q);
kq  = ggml_soft_max_ext(ctx, kq, kq_mask, kq_scale);
kqv = ggml_mul_mat     (ctx, v,  kq);
kqv = ggml_permute     (ctx, kqv, 0, 2, 1, 3);
res = ggml_cont_2d     (ctx, kqv, n_embd_head_k*n_head, n_tokens);

// unfused (old)
kq  = ggml_mul_mat (ctx, k,  q);
kq  = ggml_scale   (ctx, kq, kq_scale);
kq  = ggml_add     (ctx, kq, kq_mask);
kq  = ggml_soft_max(ctx, kq);
kqv = ggml_mul_mat (ctx, v,  kq);
kqv = ggml_permute (ctx, kqv, 0, 2, 1, 3);
res = ggml_cont_2d (ctx, kqv, n_embd_head_k*n_head, n_tokens);

Suggestions and comments for the API are welcome.
Looking for help in implementing efficient GPU kernels - please open PR to this branch if you have proposals

ggml API: ggml_flash_attn_ext()
llama.cpp use in llm_build_kqv()
add test-backend-ops test
CPU implementation (slow, just for testing)
CUDA implementation (CUDA: Faster FlashAttention kernel #6374)
Metal implementation
GGML_PREC_F32 support (CUDA) (CUDA: faster FlashAttention for batch sizes > 1 #6646)
GGML_PREC_F32 support (Metal)

Changes to `ggml`/`llama`

Things to consider

Pass KQ list with/instead of KQ mask
Pass block-wise KQ mask
Support Alibi
Finally transform Alibi as ggml_add()? (low-prio)
No longer store transposed V-cache (gg/flash-attn-online)

Testing

./tests/test-backend-ops -o FLASH_ATTN_EXT

main, server: add -fa
llama-bench: add -fa 1

Benchmark

Baseline:

# CUDA
LLAMA_CUBLAS=1 make -j tests && ./tests/test-backend-ops -o ATTN -b CUDA0 perf

# Metal
LLAMA_CUBLAS=1 make -j tests && ./tests/test-backend-ops -o ATTN -b Metal perf

FA kernel:

# CUDA
LLAMA_CUBLAS=1 make -j tests && ./tests/test-backend-ops -o FLASH_ATTN_EXT -b CUDA0 perf

# Metal
LLAMA_CUBLAS=1 make -j tests && ./tests/test-backend-ops -o FLASH_ATTN_EXT -b Metal perf

Text-generation after long prompt:

# without flash attention
./batched-bench models/mistral-instruct-7b-v0.2/ggml-model-f16.gguf 10000 2048 512 0 1 99 8192 256 1

# with flash attention
./batched-bench models/mistral-instruct-7b-v0.2/ggml-model-f16.gguf 10000 2048 512 1 1 99 8192 256 1

References

https://arxiv.org/pdf/1805.02867.pdf Online softmax
https://arxiv.org/pdf/2112.05682.pdf O(n) memory self-attention
https://arxiv.org/pdf/2307.08691.pdf Flash-attention 2

slaren · 2024-01-18T17:27:37Z

Since we are doing this from scratch, wouldn't it be better to remove the custom attention mask entirely and pass a list of KV cells used in each sequence? Considering our implementation of batching, I think we should be looking at implementing something closer to paged attention rather than flash attention. I suppose it is possible to convert the mask to a list of sequences in the kernels, but it would be less efficient.

ggerganov · 2024-01-18T17:37:05Z

Yes, we can pass list instead of mask. I am not sure of the format though - if each list has different length I feel it will hinder the GPU performance.

Edit: I just got an idea - we can pass both the kq_mask as it is, plus a second boolean tensor that tells each token to which KV blocks it should attend. For example, we split the KV cache in blocks of 128 (or some other round number) and a token (i.e. row in q) attends to a block if atleast one of the cells in it belongs to the token's sequence. This way, we can skip entire blocks of the KV cache that do not belong to the current sequence and keep the problem parallel-friendly. Thoughts?

slaren · 2024-01-18T17:43:29Z

We could use a vector with dimension [num_seqs] that contains the length of the sequences, and a 2D tensor with dimensions [max_seq_len, num_seqs] that contains the KV cells in each sequence, padded to the length of the longest sequence.

slaren · 2024-01-18T17:48:08Z

It seems that vLLM has added a new version of paged attention since it looked into the implementation (vllm-project/vllm#1348). I am not sure what are the changes, but I think it is worth looking into what they are doing. The kernel is in https://github.com/vllm-project/vllm/blob/main/csrc/attention/attention_kernels.cu

slaren · 2024-01-18T17:54:40Z

Alibi could also be done in this kernel.

ggerganov · 2024-01-18T18:08:12Z

Regarding the Alibi, I feel reinterpreting it as a KQ_mask via ggml_add() is a more general solution - we will avoid having a ggml_alibi() operator and explicit support in the kernels that we write (like in vLLM).

It remains to be seen though if the KQ_mask will be a bottleneck - my feeling is that just avoiding the extra read/write of KQ will bring us close to the optimal performance, even with the existing "cross-KV compute" drawback.

Will take a look at the vLLM code and I've updated the description with some of the things from this discussion

calvintwr · 2024-01-19T01:52:01Z

@ggerganov @slaren Together with @JohannesGaessler and @FSSRepo we are working on the same thing over at Pints-AI#1 which we intend to do a pull to llamacpp once work is done.

However, I think we will converge into this one. Given the amount of work here, @ggerganov @slaren how do you want to organise this? The 3 of us are in a temporary discord group actually to work this out, perhaps we can use that?

What are your thoughts?

ggerganov · 2024-01-19T12:03:50Z

Discord is not an option for me - I prefer to communicate over Github issues / discussions / e-mail.

Happy to see you have started work on the CUDA implementation. Please take into account the proposed API here - note that it is still a WIP and can change. I can review the implementation that you have when you think it is in a good state. Would prefer PR's that are compatible with this branch so we can verify correctness using test-backend-ops and support for all backends.

calvintwr · 2024-01-20T03:14:42Z

@ggerganov Got it. Let us work on a plan to converge with this PR.

cebtenzzre · 2024-01-20T17:43:55Z

~~test-backend-ops -o FLASH_ATTN_EXT fails for Metal on my M2 Pro, is this known?~~
edit: I see, not implemented yet.

JianbangZ · 2024-01-20T19:26:10Z

Any performance numbers?

ochafik · 2024-04-30T16:19:15Z

@slaren ah ok, thanks for the explanation! I'm not seeing any effect of -t anymore (not sure what happened w/ previous run). Looks like the M3 Pro GPU gets 100% busy already w/ -t 1.

jdecourval · 2024-04-30T16:59:07Z

 .\test-backend-ops.exe -o FLASH_ATTN_EXT -b ROCm0 perf
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon RX 7900 XTX, compute capability 11.0, VMM: no
Testing 2 backends

Backend 1/2 (CPU)
  Skipping
Backend 2/2 (ROCm0)
  Backend name: ROCm0
  FLASH_ATTN_EXT(hs=64,nh=32,kv=512,nb=1):                       8098 runs -     9.54 us/run -     4144 kB/run -  414.18 GB/s
  FLASH_ATTN_EXT(hs=64,nh=32,kv=512,nb=2):                       8066 runs -     6.14 us/run -     4160 kB/run -  645.83 GB/s
  FLASH_ATTN_EXT(hs=64,nh=32,kv=512,nb=4):                       8005 runs -     5.78 us/run -     4192 kB/run -  691.90 GB/s
  FLASH_ATTN_EXT(hs=64,nh=32,kv=512,nb=8):                       7885 runs -     5.90 us/run -     4256 kB/run -  688.21 GB/s
  FLASH_ATTN_EXT(hs=64,nh=32,kv=1024,nb=1):                      4057 runs -     8.48 us/run -     8272 kB/run -  930.48 GB/s
  FLASH_ATTN_EXT(hs=64,nh=32,kv=1024,nb=2):                      4049 runs -     5.79 us/run -     8288 kB/run - 1364.52 GB/s
  FLASH_ATTN_EXT(hs=64,nh=32,kv=1024,nb=4):                      4033 runs -     6.28 us/run -     8320 kB/run - 1263.13 GB/s
  FLASH_ATTN_EXT(hs=64,nh=32,kv=1024,nb=8):                      4003 runs -     6.36 us/run -     8384 kB/run - 1257.08 GB/s
  FLASH_ATTN_EXT(hs=80,nh=32,kv=512,nb=1):                       6488 runs -     6.34 us/run -     5172 kB/run -  777.90 GB/s
  FLASH_ATTN_EXT(hs=80,nh=32,kv=512,nb=2):                       6463 runs -     6.45 us/run -     5192 kB/run -  767.88 GB/s
  FLASH_ATTN_EXT(hs=80,nh=32,kv=512,nb=4):                       6414 runs -     6.53 us/run -     5232 kB/run -  763.68 GB/s
  FLASH_ATTN_EXT(hs=80,nh=32,kv=512,nb=8):                       6317 runs -     6.56 us/run -     5312 kB/run -  772.46 GB/s
  FLASH_ATTN_EXT(hs=80,nh=32,kv=1024,nb=1):                      3251 runs -     6.24 us/run -    10324 kB/run - 1578.33 GB/s
  FLASH_ATTN_EXT(hs=80,nh=32,kv=1024,nb=2):                      3244 runs -     6.49 us/run -    10344 kB/run - 1520.91 GB/s
  FLASH_ATTN_EXT(hs=80,nh=32,kv=1024,nb=4):                      3232 runs -     6.50 us/run -    10384 kB/run - 1523.31 GB/s
  FLASH_ATTN_EXT(hs=80,nh=32,kv=1024,nb=8):                      3207 runs -     6.37 us/run -    10464 kB/run - 1567.57 GB/s
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=1):                      4065 runs -    10.07 us/run -     8256 kB/run -  781.87 GB/s
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=2):                      4049 runs -     5.57 us/run -     8288 kB/run - 1419.22 GB/s
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=4):                      4018 runs -     6.38 us/run -     8352 kB/run - 1248.24 GB/s
  FLASH_ATTN_EXT(hs=128,nh=32,kv=512,nb=8):                      3957 runs -     3.74 us/run -     8480 kB/run - 2164.12 GB/s
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=1):                     2037 runs -     9.49 us/run -    16480 kB/run - 1656.13 GB/s
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=2):                     2033 runs -     5.75 us/run -    16512 kB/run - 2738.56 GB/s
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=4):                     2025 runs -     6.51 us/run -    16576 kB/run - 2429.16 GB/s
  FLASH_ATTN_EXT(hs=128,nh=32,kv=1024,nb=8):                     2009 runs -     6.00 us/run -    16704 kB/run - 2653.27 GB/s
  FLASH_ATTN_EXT(hs=256,nh=32,kv=512,nb=1):                      2037 runs -    11.38 us/run -    16480 kB/run - 1381.55 GB/s
  FLASH_ATTN_EXT(hs=256,nh=32,kv=512,nb=2):                      2029 runs -     6.82 us/run -    16544 kB/run - 2314.56 GB/s
  FLASH_ATTN_EXT(hs=256,nh=32,kv=512,nb=4):                      2013 runs -     5.87 us/run -    16672 kB/run - 2710.54 GB/s
  FLASH_ATTN_EXT(hs=256,nh=32,kv=512,nb=8):                      1983 runs -     7.21 us/run -    16928 kB/run - 2238.37 GB/s
  FLASH_ATTN_EXT(hs=256,nh=32,kv=1024,nb=1):                     1021 runs -    13.04 us/run -    32896 kB/run - 2406.53 GB/s
  FLASH_ATTN_EXT(hs=256,nh=32,kv=1024,nb=2):                     1019 runs -     3.45 us/run -    32960 kB/run - 9122.85 GB/s
  FLASH_ATTN_EXT(hs=256,nh=32,kv=1024,nb=4):                     1015 runs -     8.03 us/run -    33088 kB/run - 3929.40 GB/s
  FLASH_ATTN_EXT(hs=256,nh=32,kv=1024,nb=8):                     1007 runs -     8.58 us/run -    33344 kB/run - 3705.81 GB/s
  Backend ROCm0: OK

2/2 backends passed
OK

.\test-backend-ops.exe -o ATTN -b ROCm0 perf
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon RX 6700 XT, compute capability 10.3, VMM: no
Testing 2 backends

Backend 1/2 (CPU)
  Skipping
Backend 2/2 (ROCm0)
  Backend name: ROCm0
  Backend ROCm0: OK

2/2 backends passed
OK

@ggerganov F-AT is not enabled ROCM in general,right?

It's currently disabled, yes.
It needs cleanup, but I enabled back the feature in this branch if you want to have a look.
#7011
I'll post some numbers soon, but don't expect much additional speed, but the VRAM saving can be significant.

Dampfinchen · 2024-04-30T17:34:51Z

Sadly I'm not seeing any benefit from this. No reduction in VRAM usage, no speedup, even when fully offloading.

Infact, I'm only seeing slower speeds when using partial offloading.

strawberrymelonpanda · 2024-04-30T18:21:59Z

For me (Windows, CUDA, 24GB VRAM) the difference is definitely there, but it depends on the model and I have best results with a large amount of context data.

The most pronounced for me is Mixtral-8x7B-Instruct-v0.1-requant-imat-IQ3_XS which I can fully offload. It ~~nearly doubles in speed~~ according to the timings and I was able to up the ctx from 16K to 32K.

Edit: I saw the below "old timings" across at least 4x runs each last night, but today w/o FA is hitting close to 39-40 t/s, so must have been an edge case there, but FA seemed to help with it.

With FA:

llama_print_timings:        load time =    9259.83 ms
llama_print_timings:      sample time =      65.38 ms /   328 runs   (    0.20 ms per token,  5017.13 tokens per second)
llama_print_timings: prompt eval time =    7894.93 ms /  7840 tokens (    1.01 ms per token,   993.04 tokens per second)
llama_print_timings:        eval time =    7317.74 ms /   327 runs   (   22.38 ms per token,    44.69 tokens per second)
llama_print_timings:       total time =   15517.04 ms /  8167 tokens

Without FA: (updated)

llama_print_timings:        load time =    9860.39 ms
llama_print_timings:      sample time =      25.49 ms /   128 runs   (    0.20 ms per token,  5021.18 tokens per second)
llama_print_timings: prompt eval time =    7806.70 ms /  7169 tokens (    1.09 ms per token,   918.31 tokens per second)
llama_print_timings:        eval time =    3115.48 ms /   127 runs   (   24.53 ms per token,    40.76 tokens per second)
llama_print_timings:       total time =   11127.62 ms /  7296 tokens

old w/o timings

llama_print_timings:        load time =    9722.33 ms
llama_print_timings:      sample time =      64.05 ms /   322 runs   (    0.20 ms per token,  5027.24 tokens per second)
llama_print_timings: prompt eval time =   12847.00 ms /  7840 tokens (    1.64 ms per token,   610.26 tokens per second)
llama_print_timings:        eval time =   14036.72 ms /   321 runs   (   43.73 ms per token,    22.87 tokens per second)
llama_print_timings:       total time =   27208.66 ms /  8161 tokens

Other models are less remarkable, but I'm able to store a lot more context.

New tests:

Llamabench with -p 512,1024 is less dramatic but measurable, TG ~46 -> ~50:

./llama-bench -m Mixtral-8x7B-Instruct-v0.1-requant-imat-IQ3_XS.gguf -fa 0,1 -p 512,1024 -n 128,256,512,1024

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
| model                          |       size |     params | backend    | ngl |         fa | test       |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------: | ---------- | ---------------: |
| llama 8x7B IQ3_XS - 3.3 bpw    |  34.96 GiB |    91.80 B | CUDA       |  99 |          0 | pp 512     |   1146.77 ± 9.12 |
| llama 8x7B IQ3_XS - 3.3 bpw    |  34.96 GiB |    91.80 B | CUDA       |  99 |          0 | pp 1024    |   1130.55 ± 4.81 |
| llama 8x7B IQ3_XS - 3.3 bpw    |  34.96 GiB |    91.80 B | CUDA       |  99 |          0 | tg 128     |     46.81 ± 0.58 |
| llama 8x7B IQ3_XS - 3.3 bpw    |  34.96 GiB |    91.80 B | CUDA       |  99 |          0 | tg 256     |     47.07 ± 0.16 |
| llama 8x7B IQ3_XS - 3.3 bpw    |  34.96 GiB |    91.80 B | CUDA       |  99 |          0 | tg 512     |     46.50 ± 0.70 |
| llama 8x7B IQ3_XS - 3.3 bpw    |  34.96 GiB |    91.80 B | CUDA       |  99 |          0 | tg 1024    |     46.46 ± 0.44 |
| llama 8x7B IQ3_XS - 3.3 bpw    |  34.96 GiB |    91.80 B | CUDA       |  99 |          1 | pp 512     |  1159.59 ± 10.34 |
| llama 8x7B IQ3_XS - 3.3 bpw    |  34.96 GiB |    91.80 B | CUDA       |  99 |          1 | pp 1024    |   1155.55 ± 5.00 |
| llama 8x7B IQ3_XS - 3.3 bpw    |  34.96 GiB |    91.80 B | CUDA       |  99 |          1 | tg 128     |     51.22 ± 0.06 |
| llama 8x7B IQ3_XS - 3.3 bpw    |  34.96 GiB |    91.80 B | CUDA       |  99 |          1 | tg 256     |     50.83 ± 0.17 |
| llama 8x7B IQ3_XS - 3.3 bpw    |  34.96 GiB |    91.80 B | CUDA       |  99 |          1 | tg 512     |     50.82 ± 0.13 |
| llama 8x7B IQ3_XS - 3.3 bpw    |  34.96 GiB |    91.80 B | CUDA       |  99 |          1 | tg 1024    |     50.28 ± 0.35 |

build: a68a1e7e (2772)

The differences are more obvious at -p 8096, 16192, 32384: From PP 819 -> 1005 @ 16K, and OOM -> 879 @ 32K.

| model                          |       size |     params | backend    | ngl |         fa | test       |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------: | ---------- | ---------------: |
| llama 8x7B IQ3_XS - 3.3 bpw    |  34.96 GiB |    91.80 B | CUDA       |  99 |          0 | pp 1024    |   1128.12 ± 9.99 |
| llama 8x7B IQ3_XS - 3.3 bpw    |  34.96 GiB |    91.80 B | CUDA       |  99 |          0 | pp 8096    |    961.26 ± 4.45 |
| llama 8x7B IQ3_XS - 3.3 bpw    |  34.96 GiB |    91.80 B | CUDA       |  99 |          0 | pp 16192   |    819.92 ± 1.58 |
| llama 8x7B IQ3_XS - 3.3 bpw    |  34.96 GiB |    91.80 B | CUDA       |  99 |          0 | pp 32384   |    OOM           |
| llama 8x7B IQ3_XS - 3.3 bpw    |  34.96 GiB |    91.80 B | CUDA       |  99 |          1 | pp 1024    |   1150.17 ± 5.60 |
| llama 8x7B IQ3_XS - 3.3 bpw    |  34.96 GiB |    91.80 B | CUDA       |  99 |          1 | pp 8096    |   1075.53 ± 2.02 |
| llama 8x7B IQ3_XS - 3.3 bpw    |  34.96 GiB |    91.80 B | CUDA       |  99 |          1 | pp 16192   |   1005.41 ± 1.65 |
| llama 8x7B IQ3_XS - 3.3 bpw    |  34.96 GiB |    91.80 B | CUDA       |  99 |          1 | pp 32384   |    879.37 ± 1.96 |

ddh0 · 2024-04-30T20:10:16Z

Performance on Macbook Air M2, 24GB using latest llama.cpp, before and after using the -fa argument:

Without Flash Attention:

./llama.cpp/main -m ./models/Meta-Llama-3-8B-Instruct-q8_0.gguf -c 8192 -n 4096 -t 4 -tb 8 -b 512 -ngl 999

main: build = 2774 (f364eb6f)
main: built with Apple clang version 15.0.0 (clang-1500.3.9.4) for arm64-apple-darwin23.4.0
main: seed  = 1714507110

llama_kv_cache_init:      Metal KV buffer size =  1024.00 MiB
llama_new_context_with_model: KV self size  = 1024.00 MiB, K (f16):  512.00 MiB, V (f16):  512.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.49 MiB
llama_new_context_with_model:      Metal compute buffer size =   560.00 MiB
llama_new_context_with_model:        CPU compute buffer size =    24.01 MiB

llama_print_timings:        load time =     542.92 ms
llama_print_timings:      sample time =     200.91 ms /  4096 runs   (    0.05 ms per token, 20387.14 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_print_timings:        eval time =  470814.64 ms /  4096 runs   (  114.94 ms per token,     8.70 tokens per second)
llama_print_timings:       total time =  472380.66 ms /  4097 tokens

With Flash Attention:

./llama.cpp/main -m ./models/Meta-Llama-3-8B-Instruct-q8_0.gguf -c 8192 -n 4096 -t 4 -tb 8 -b 512 -ngl 999 -fa

main: build = 2774 (f364eb6f)
main: built with Apple clang version 15.0.0 (clang-1500.3.9.4) for arm64-apple-darwin23.4.0
main: seed  = 1714507110

llama_kv_cache_init:      Metal KV buffer size =  1024.00 MiB
llama_new_context_with_model: KV self size  = 1024.00 MiB, K (f16):  512.00 MiB, V (f16):  512.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.49 MiB
llama_new_context_with_model:      Metal compute buffer size =   258.50 MiB
llama_new_context_with_model:        CPU compute buffer size =    24.01 MiB

llama_print_timings:        load time =     543.86 ms
llama_print_timings:      sample time =     188.17 ms /  4096 runs   (    0.05 ms per token, 21767.32 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_print_timings:        eval time =  422651.03 ms /  4096 runs   (  103.19 ms per token,     9.69 tokens per second)
llama_print_timings:       total time =  424131.56 ms /  4097 tokens

TL;DR: Generation speed increases from 8.70 t/s to 9.69 t/s, memory usage decreases slightly, prompt processing is not tested in this case.

x4080 · 2024-04-30T20:49:46Z

Hi is server has flash attention yet ? Or is it automatically using flash attention ?

edit: just add -fa too in server got it

LostRuins · 2024-05-01T05:50:36Z

Hi, I am having issues building this on CUDA 11.4 now after this PR.

Notably, I am getting error : identifier "__hmax" is undefined and error : identifier "__hmax2" is undefined within fattn.cu

This is not the first time this has happened, previously we added #define CUDART_HMAX 11070 and then wrapped hmax and hmax2 functionality behind CUDART_VERSION >= CUDART_HMAX however this time this is not the case and thus the compile fails.

JohannesGaessler · 2024-05-01T08:07:36Z

@LostRuins can you check whether this fix #7019 works?

Dampfinchen · 2024-05-03T08:05:20Z

Sadly I'm not seeing any benefit from this. No reduction in VRAM usage, no speedup, even when fully offloading.

Infact, I'm only seeing slower speeds when using partial offloading.

It seems this only applies to a low context like 4K.

Testing a very small LLM on my system with a context size of 13.000 Tokens and no GQA, the difference is massive.

VRAM savings from 2.8 to 1.2 GB, Text Generation from 37 to 71 token/s, pp from 1300 token/s to 2300 token/s.

Great work!

dagbdagb · 2024-05-04T19:57:31Z

From the dialogue above, I think I understand that the support for -fa needs to be coded per backend. Can someone confirm that? Not having much luck using -fa for the vulkan backend. I do not expect said support to materialize either, just want to clarify.

JohannesGaessler · 2024-05-04T20:11:43Z

It does need to be implemented per backend.

LukeLIN-web · 2024-06-24T07:51:30Z

Why metal test LLAMA_CUBLAS=1 make -j tests && ./tests/test-backend-ops -o ATTN -b Metal perf requires nvcc ? I run on my MacBook it shows /bin/sh: nvcc: command not found I NVCC: /bin/sh: nvcc: command not found

ddh0 · 2024-06-24T18:06:04Z

Why metal test LLAMA_CUBLAS=1 make -j tests && ./tests/test-backend-ops -o ATTN -b Metal perf requires nvcc ? I run on my MacBook it shows `/bin/sh: nvcc: command not found

I NVCC: /bin/sh: nvcc: command not found`

@LukeLIN-web because you're compiling with LLAMA_CUBLAS (which is deprecated by the way, use LLAMA_CUDA). You can't use CUDA on a MacBook

Autumnlight02 · 2024-07-17T23:18:05Z

Any updates on context shift compability?

Implemented ggml_sycl_op_soft_max() F16 src1(mask) support for which a pragma deprecation warning was added during #5021. To do this, had to decouple it from ggml_sycl_op_flatten which always considered src1 to be of fp32 type(many OP functions are dependent on it). * SYCL: SOFTMAX F16 mask support and other fixes * test-backend-ops: Add F16 mask test cases

Implemented ggml_sycl_op_soft_max() F16 src1(mask) support for which a pragma deprecation warning was added during ggml-org#5021. To do this, had to decouple it from ggml_sycl_op_flatten which always considered src1 to be of fp32 type(many OP functions are dependent on it). * SYCL: SOFTMAX F16 mask support and other fixes * test-backend-ops: Add F16 mask test cases

ggml : add ggml_flash_attn_ext API

a1c004e

ggerganov added help wanted Needs help from the community performance Speed related topics labels Jan 18, 2024

ggerganov closed this Jan 18, 2024

ggerganov reopened this Jan 18, 2024

ggerganov marked this pull request as draft January 18, 2024 17:09

ggml : fix GQA support in ggml_flash_attn_ext

fa7ebcc

ggerganov added 2 commits January 20, 2024 10:12

Merge branch 'master' into gg/flash-attn

c3cdfff

ggml : online attention (CPU)

a9681fe

ggerganov added 3 commits January 21, 2024 10:15

metal : initial implementation

1173f49

metal : f16 precision

528da75

metal : reduce branches

52ae085

ggerganov force-pushed the gg/flash-attn branch from e0ba0da to 52ae085 Compare January 21, 2024 09:59

ggerganov added 6 commits January 21, 2024 12:01

metal : specialize for head size

b973258

wip : 8 rows per simd group

8cde449

wip : 4 rows per simd group

f31955f

wip : template for rows per warp

a4b6341

metal : parallelize across KV size

77d08f3

metal : parallel reduce across heads

17720fa

qnixsynapse mentioned this pull request May 8, 2024

[SYCL] Implement Flash attention. #7141

Closed

Animaxx mentioned this pull request May 9, 2024

Unable to load on iOS device after update - Compiler encountered an internal error #7085

Closed

sammcj mentioned this pull request May 9, 2024

feat: add support for flash_attn ollama/ollama#4120

Merged

mudler mentioned this pull request May 13, 2024

feat(llama.cpp): add flash_attention and no_kv_offloading mudler/LocalAI#2310

Merged

ThiloteE mentioned this pull request Jul 9, 2024

[Feature] Flash Attention In-app Feature Option Request nomic-ai/gpt4all#2460

Open

GilesBathgate mentioned this pull request Jul 16, 2024

./llama2_q4: No such file or directory ankan-ban/llama_cu_awq#6

Open

mseri mentioned this pull request Nov 3, 2024

[Feat]: quantized KV cache and flash attention a-ghorbani/pocketpal-ai#79

Closed

qnixsynapse mentioned this pull request Jan 16, 2025

SYCL: SOFTMAX F16 mask support and other fixes #11261

Merged

loretoparisi mentioned this pull request Jan 30, 2025

How to run DeepSeek-R1 IQ1_S 1.58bit at 140 Token/Sec unslothai/unsloth#1591

Closed

kacpnowak mentioned this pull request Apr 14, 2025

Add support for running on with Apple GPUs ecmwf/WeatherGenerator#181

Open

pestopoppa mentioned this pull request Jan 13, 2026

llama-finetune broken on modern transformers: cascading failures in SET_ROWS, FLASH_ATTN_EXT, and graph allocation #18805

Closed

Conversation

ggerganov commented Jan 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes to ggml/llama

Things to consider

Testing

Benchmark

References

Uh oh!

slaren commented Jan 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov commented Jan 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

slaren commented Jan 18, 2024

Uh oh!

slaren commented Jan 18, 2024

Uh oh!

slaren commented Jan 18, 2024

Uh oh!

ggerganov commented Jan 18, 2024

Uh oh!

calvintwr commented Jan 19, 2024

Uh oh!

ggerganov commented Jan 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

calvintwr commented Jan 20, 2024

Uh oh!

cebtenzzre commented Jan 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JianbangZ commented Jan 20, 2024

Uh oh!

ochafik commented Apr 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jdecourval commented Apr 30, 2024

Uh oh!

Dampfinchen commented Apr 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

strawberrymelonpanda commented Apr 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ddh0 commented Apr 30, 2024

Uh oh!

x4080 commented Apr 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LostRuins commented May 1, 2024

Uh oh!

JohannesGaessler commented May 1, 2024

Uh oh!

Dampfinchen commented May 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dagbdagb commented May 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JohannesGaessler commented May 4, 2024

Uh oh!

LukeLIN-web commented Jun 24, 2024

Uh oh!

ddh0 commented Jun 24, 2024

Uh oh!

Autumnlight02 commented Jul 17, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

ggerganov commented Jan 18, 2024 •

edited

Loading

Changes to `ggml`/`llama`

slaren commented Jan 18, 2024 •

edited

Loading

ggerganov commented Jan 18, 2024 •

edited

Loading

ggerganov commented Jan 19, 2024 •

edited

Loading

cebtenzzre commented Jan 20, 2024 •

edited

Loading

ochafik commented Apr 30, 2024 •

edited

Loading

Dampfinchen commented Apr 30, 2024 •

edited

Loading

strawberrymelonpanda commented Apr 30, 2024 •

edited

Loading

x4080 commented Apr 30, 2024 •

edited

Loading

Dampfinchen commented May 3, 2024 •

edited

Loading

dagbdagb commented May 4, 2024 •

edited

Loading