Skip to content

[Feature Request] Support q_len > 1 for FP8 decode (Multi-Token Prediction) #4

@ZZBoom

Description

@ZZBoom

Great work on this project! 🎉

I benchmarked the FP8 decode attention on H200 and the results are impressive:

============================================================
GPU: NVIDIA H200
Dtype: FP8
Batch: 128, Q_len: 1, KV_len: 2048
Heads: 64 (Q) / 8 (KV), Dim: 128
Warmup: 10, Iters: 100
============================================================

[hpc-ops] Allocating buffers...
[hpc-ops] Block size: 64

[FA3] Allocating buffers...
[FA3] Block size: 448

============================================================
Results:
============================================================

hpc-ops:
Avg time: 0.147 ms
Throughput: 6800.0 iter/s
Bandwidth: 3672.1 GB/s
TFLOPS: 58.41

FA3:
Avg time: 0.263 ms
Throughput: 3804.7 iter/s
Bandwidth: 2054.6 GB/s
TFLOPS: 32.68

------------------------------------------------------------
hpc-ops is 1.79x faster than FA3
------------------------------------------------------------

Feature Request:

Would it be possible to support q_len > 1 (e.g., q_len = 4) in attention_decode_fp8?

This would enable support for Multi-Token Prediction (MTP) models like DeepSeek-V3, which predicts multiple tokens per forward pass.

Currently returns: RuntimeError: num_seq_q must be 1

Ref: src/attention/entry.cc:310

Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions