-
Notifications
You must be signed in to change notification settings - Fork 56
Open
Description
Great work on this project! 🎉
I benchmarked the FP8 decode attention on H200 and the results are impressive:
============================================================
GPU: NVIDIA H200
Dtype: FP8
Batch: 128, Q_len: 1, KV_len: 2048
Heads: 64 (Q) / 8 (KV), Dim: 128
Warmup: 10, Iters: 100
============================================================
[hpc-ops] Allocating buffers...
[hpc-ops] Block size: 64
[FA3] Allocating buffers...
[FA3] Block size: 448
============================================================
Results:
============================================================
hpc-ops:
Avg time: 0.147 ms
Throughput: 6800.0 iter/s
Bandwidth: 3672.1 GB/s
TFLOPS: 58.41
FA3:
Avg time: 0.263 ms
Throughput: 3804.7 iter/s
Bandwidth: 2054.6 GB/s
TFLOPS: 32.68
------------------------------------------------------------
hpc-ops is 1.79x faster than FA3
------------------------------------------------------------
Feature Request:
Would it be possible to support q_len > 1 (e.g., q_len = 4) in attention_decode_fp8?
This would enable support for Multi-Token Prediction (MTP) models like DeepSeek-V3, which predicts multiple tokens per forward pass.
Currently returns: RuntimeError: num_seq_q must be 1
Ref: src/attention/entry.cc:310
Thanks!
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels