Blog: DeepSeek-V3.2 on GB300: Performance Breakthrough#164
Blog: DeepSeek-V3.2 on GB300: Performance Breakthrough#164nicole-lihui wants to merge 4 commits intovllm-project:mainfrom
Conversation
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
Signed-off-by: nicole-lihui <nicole.li@daocloud.io> Co-authored-by: chaunceyjiang <chaunceyjiang@gmail.com> Co-authored-by: Peter Pan <Peter.Pan@daocloud.io> Co-authored-by: Kebe <mail@kebe7jun.com>
e904b15 to
8d03810
Compare
59e76cc to
f683871
Compare
f683871 to
ceb5ee3
Compare
ceb5ee3 to
84784e4
Compare
84784e4 to
9d64a93
Compare
Signed-off-by: nicole-lihui <nicole.li@daocloud.io>
9d64a93 to
a4cc69a
Compare
xinli-sw
left a comment
There was a problem hiding this comment.
Thank you! left some suggestions for you to consider :)
In general I think we could also ask cursor to fix some small grammar issues or naming inconsistencies
_posts/2026-02-06-gb300-deepseek.md
Outdated
|
|
||
| # Summary | ||
|
|
||
| **DeepSeek-V3.2** has been successfully and smoothly run on **GB300** (SM103 - Blackwell Ultra). Leveraging FP4 quantization, it achieves a single-GPU throughput of <span style="color: #2ecc71; font-weight: 600;">7360 TGS</span> in a prefill only scenario. In a _(ISL=2k,OSL=1K)_ context mixed scenario (P+D), the output throughput per GPU is **2816 TGS**. |
There was a problem hiding this comment.
could we define TGS here in it's first appearance ?
| GB300/B300 single-GPU memory is 288GB. Two GPUs are sufficient to hold the FP4 format weights of the DeepSeek series models. | ||
|
|
||
| ```bash | ||
| vllm serve nvidia/DeepSeek-V3.2-NVFP4 -tp 2 |
There was a problem hiding this comment.
qq: is EP worse in this case?
There was a problem hiding this comment.
Not necessarily — EP isn’t worse by default. The TP=2 recommendation is mainly for common community workloads with large ISL and small OSL where prefill dominates; for output-heavy cases, EP=2 is preferred due to its TPOT advantage (as discussed later in the article).
| ### 4. Optimize Batch Configuration | ||
|
|
||
| Below are reference values for the max boundary batch to achieve better prefill throughput for TP=2, using the additional parameter `--max-num-batched-tokens`: | ||
|
|
There was a problem hiding this comment.
if you are up for it, these flags / env vars could also help increase perf (but I haven't tried on GB300)
--kv_cache_dtype: fp8
--stream-interval 20
--compilation_config.pass_config.fuse_allreduce_rms true
--compilation_config.pass_config.fuse_attn_quant true
--compilation_config.custom_ops+=+quant_fp8,+rms_normIf you beleive the decode batch size can exceed 1024, this also helps.
--max-cudagraph-capture-size: 2048
There was a problem hiding this comment.
sorry I'm not entirely sure of fp8 kv-cache is supported on DSV3.2, if not please ignore :)
_posts/2026-02-06-gb300-deepseek.md
Outdated
|
|
||
| Compared to FP8, FP4 weights yield a **14%** improvement in prefill-only scenarios, and **2x** output throughput in mixed scenarios. | ||
|
|
||
| In Prefill, the major time occupations are attention/Indexer/KV cache writes, etc. FP4 primarily optimizes the weight read bandwidth and GEMM for MoE/MLP, thus the overall improvement is limited (14%). |
There was a problem hiding this comment.
In theory NVFP4 should also deliver roughly 1.5× FLOPs compared to FP8, but in our experiments we observed that with tp=4, the prefill-only speedup was not high; we attribute this to the finer partitioning of work per card reducing the per-GPU compute granularity, which makes TensorCores less able to fully utilize FP4 compute and results in lower marginal speedup, whereas in an FP4 + TP2 configuration the compute advantage was better realized, ultimately yielding a prefill speedup closer to and even beyond the theoretical expectation (~1.8×, 7360 TGS).
Co-authored-by: Xin Li <119016172+xinli-sw@users.noreply.github.com> Signed-off-by: Nicole LiHui 🥜 <nicolelihui@outlook.com>
63ebb4a to
6576f7b
Compare
Signed-off-by: nicole-lihui <nicole.li@daocloud.io>
6576f7b to
2732e55
Compare
|
|
||
| FP4 allows the model to run on only 2 GPUs with `-tp=2`, and achieve up to **14720 (7360 per GPU)** total prefill throughput _(ISL=2k, OSL=1, batch=64)_ and **5632 (2816 per GPU)** total throughput for P+D mixing _(ISL=2k, OSL=1K, batch=512)_. Compared to FP8 with `-tp=4`, this configuration achieves ~ **1.8×** higher prefill-only throughput and ~ **8×** higher total throughput in P+D mixing. | ||
|
|
||
| **FP4 with TP2 clearly emerges as our recommended configuration** because lower precision alone isn’t enough. FP4 significantly reduces the model and KV cache footprint — which lowers memory bandwidth pressure and allows larger cache capacity and higher compute utilization per GPU. In practice, TP2 strikes a better balance between parallelization and per-card workload size, enabling TensorCores to more fully exploit FP4’s higher FLOPs and bandwidth efficiency, while TP4’s finer partitioning dilutes that advantage. |
There was a problem hiding this comment.
Thanks to @xinli-sw for pointing out that FP4 offers a ~1.5× FLOPs improvement over FP8 — that insight inspired me. We should emphasize more clearly that the performance benefits of DS V3.2’s FP4 mode are best realized with the FP4 + TP2 configuration, otherwise it could be easily misunderstood or underestimated.

add blog post on running DeepSeek on GB300