[Feature Request] Support q_len > 1 for FP8 decode (Multi-Token Prediction)

Great work on this project! 🎉 

I benchmarked the FP8 decode attention on H200 and the results are impressive:

```
============================================================
GPU: NVIDIA H200
Dtype: FP8
Batch: 128, Q_len: 1, KV_len: 2048
Heads: 64 (Q) / 8 (KV), Dim: 128
Warmup: 10, Iters: 100
============================================================

[hpc-ops] Allocating buffers...
[hpc-ops] Block size: 64

[FA3] Allocating buffers...
[FA3] Block size: 448

============================================================
Results:
============================================================

hpc-ops:
Avg time: 0.147 ms
Throughput: 6800.0 iter/s
Bandwidth: 3672.1 GB/s
TFLOPS: 58.41

FA3:
Avg time: 0.263 ms
Throughput: 3804.7 iter/s
Bandwidth: 2054.6 GB/s
TFLOPS: 32.68

------------------------------------------------------------
hpc-ops is 1.79x faster than FA3
------------------------------------------------------------
```


**Feature Request:**

Would it be possible to support `q_len > 1` (e.g., `q_len = 4`) in `attention_decode_fp8`? 

This would enable support for Multi-Token Prediction (MTP) models like DeepSeek-V3, which predicts multiple tokens per forward pass.

Currently returns: `RuntimeError: num_seq_q must be 1`

Ref: `src/attention/entry.cc:310`

Thanks!


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] Support q_len > 1 for FP8 decode (Multi-Token Prediction) #4

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature Request] Support q_len > 1 for FP8 decode (Multi-Token Prediction) #4

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions