Skip to content

Blog: DeepSeek-V3.2 on GB300: Performance Breakthrough#164

Draft
nicole-lihui wants to merge 4 commits intovllm-project:mainfrom
nicole-lihui:blog/20260206-gb300-deepseek
Draft

Blog: DeepSeek-V3.2 on GB300: Performance Breakthrough#164
nicole-lihui wants to merge 4 commits intovllm-project:mainfrom
nicole-lihui:blog/20260206-gb300-deepseek

Conversation

@nicole-lihui
Copy link

add blog post on running DeepSeek on GB300

@chatgpt-codex-connector
Copy link

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

Signed-off-by: nicole-lihui <nicole.li@daocloud.io>
Co-authored-by: chaunceyjiang <chaunceyjiang@gmail.com>
Co-authored-by: Peter Pan <Peter.Pan@daocloud.io>
Co-authored-by: Kebe <mail@kebe7jun.com>
Signed-off-by: nicole-lihui <nicole.li@daocloud.io>
Copy link
Contributor

@xinli-sw xinli-sw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! left some suggestions for you to consider :)

In general I think we could also ask cursor to fix some small grammar issues or naming inconsistencies


# Summary

**DeepSeek-V3.2** has been successfully and smoothly run on **GB300** (SM103 - Blackwell Ultra). Leveraging FP4 quantization, it achieves a single-GPU throughput of <span style="color: #2ecc71; font-weight: 600;">7360 TGS</span> in a prefill only scenario. In a _(ISL=2k,OSL=1K)_ context mixed scenario (P+D), the output throughput per GPU is **2816 TGS**.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we define TGS here in it's first appearance ?

GB300/B300 single-GPU memory is 288GB. Two GPUs are sufficient to hold the FP4 format weights of the DeepSeek series models.

```bash
vllm serve nvidia/DeepSeek-V3.2-NVFP4 -tp 2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

qq: is EP worse in this case?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessarily — EP isn’t worse by default. The TP=2 recommendation is mainly for common community workloads with large ISL and small OSL where prefill dominates; for output-heavy cases, EP=2 is preferred due to its TPOT advantage (as discussed later in the article).

### 4. Optimize Batch Configuration

Below are reference values for the max boundary batch to achieve better prefill throughput for TP=2, using the additional parameter `--max-num-batched-tokens`:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you are up for it, these flags / env vars could also help increase perf (but I haven't tried on GB300)

    --kv_cache_dtype: fp8
    --stream-interval 20
    --compilation_config.pass_config.fuse_allreduce_rms true
    --compilation_config.pass_config.fuse_attn_quant true
    --compilation_config.custom_ops+=+quant_fp8,+rms_norm

If you beleive the decode batch size can exceed 1024, this also helps.

--max-cudagraph-capture-size: 2048

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry I'm not entirely sure of fp8 kv-cache is supported on DSV3.2, if not please ignore :)


Compared to FP8, FP4 weights yield a **14%** improvement in prefill-only scenarios, and **2x** output throughput in mixed scenarios.

In Prefill, the major time occupations are attention/Indexer/KV cache writes, etc. FP4 primarily optimizes the weight read bandwidth and GEMM for MoE/MLP, thus the overall improvement is limited (14%).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NVFP4 should also have 1.5x extra flops vs FP8

Image

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory NVFP4 should also deliver roughly 1.5× FLOPs compared to FP8, but in our experiments we observed that with tp=4, the prefill-only speedup was not high; we attribute this to the finer partitioning of work per card reducing the per-GPU compute granularity, which makes TensorCores less able to fully utilize FP4 compute and results in lower marginal speedup, whereas in an FP4 + TP2 configuration the compute advantage was better realized, ultimately yielding a prefill speedup closer to and even beyond the theoretical expectation (~1.8×, 7360 TGS).

Co-authored-by: Xin Li <119016172+xinli-sw@users.noreply.github.com>
Signed-off-by: Nicole LiHui 🥜 <nicolelihui@outlook.com>
Signed-off-by: nicole-lihui <nicole.li@daocloud.io>

FP4 allows the model to run on only 2 GPUs with `-tp=2`, and achieve up to **14720 (7360 per GPU)** total prefill throughput _(ISL=2k, OSL=1, batch=64)_ and **5632 (2816 per GPU)** total throughput for P+D mixing _(ISL=2k, OSL=1K, batch=512)_. Compared to FP8 with `-tp=4`, this configuration achieves ~ **1.8×** higher prefill-only throughput and ~ **8×** higher total throughput in P+D mixing.

**FP4 with TP2 clearly emerges as our recommended configuration** because lower precision alone isn’t enough. FP4 significantly reduces the model and KV cache footprint — which lowers memory bandwidth pressure and allows larger cache capacity and higher compute utilization per GPU. In practice, TP2 strikes a better balance between parallelization and per-card workload size, enabling TensorCores to more fully exploit FP4’s higher FLOPs and bandwidth efficiency, while TP4’s finer partitioning dilutes that advantage.
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks to @xinli-sw for pointing out that FP4 offers a ~1.5× FLOPs improvement over FP8 — that insight inspired me. We should emphasize more clearly that the performance benefits of DS V3.2’s FP4 mode are best realized with the FP4 + TP2 configuration, otherwise it could be easily misunderstood or underestimated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants