Blog: DeepSeek-V3.2 on GB300: Performance Breakthrough by nicole-lihui · Pull Request #164 · vllm-project/vllm-project.github.io

nicole-lihui · 2026-02-06T08:07:44Z

add blog post on running DeepSeek on GB300

chatgpt-codex-connector · 2026-02-06T08:07:49Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

Signed-off-by: nicole-lihui <nicole.li@daocloud.io> Co-authored-by: chaunceyjiang <chaunceyjiang@gmail.com> Co-authored-by: Peter Pan <Peter.Pan@daocloud.io> Co-authored-by: Kebe <mail@kebe7jun.com>

_posts/2026-02-06-gb300-deepseek.md

Signed-off-by: nicole-lihui <nicole.li@daocloud.io>

xinli-sw

Thank you! left some suggestions for you to consider :)

In general I think we could also ask cursor to fix some small grammar issues or naming inconsistencies

xinli-sw · 2026-02-06T21:46:56Z

_posts/2026-02-06-gb300-deepseek.md

+
+# Summary
+
+**DeepSeek-V3.2** has been successfully and smoothly run on **GB300** (SM103 - Blackwell Ultra). Leveraging FP4 quantization, it achieves a single-GPU throughput of <span style="color: #2ecc71; font-weight: 600;">7360 TGS</span> in a prefill only scenario. In a _(ISL=2k,OSL=1K)_ context mixed scenario (P+D), the output throughput per GPU is **2816 TGS**.


could we define TGS here in it's first appearance ?

_posts/2026-02-06-gb300-deepseek.md

xinli-sw · 2026-02-06T22:17:56Z

_posts/2026-02-06-gb300-deepseek.md

+GB300/B300 single-GPU memory is 288GB. Two GPUs are sufficient to hold the FP4 format weights of the DeepSeek series models.
+
+```bash
+vllm serve nvidia/DeepSeek-V3.2-NVFP4    -tp 2


qq: is EP worse in this case?

Not necessarily — EP isn’t worse by default. The TP=2 recommendation is mainly for common community workloads with large ISL and small OSL where prefill dominates; for output-heavy cases, EP=2 is preferred due to its TPOT advantage (as discussed later in the article).

_posts/2026-02-06-gb300-deepseek.md

xinli-sw · 2026-02-06T22:24:56Z

_posts/2026-02-06-gb300-deepseek.md

+### 4. Optimize Batch Configuration
+
+Below are reference values for the max boundary batch to achieve better prefill throughput for TP=2, using the additional parameter `--max-num-batched-tokens`:
+


if you are up for it, these flags / env vars could also help increase perf (but I haven't tried on GB300)

--kv_cache_dtype: fp8 --stream-interval 20 --compilation_config.pass_config.fuse_allreduce_rms true --compilation_config.pass_config.fuse_attn_quant true --compilation_config.custom_ops+=+quant_fp8,+rms_norm

If you beleive the decode batch size can exceed 1024, this also helps.

--max-cudagraph-capture-size: 2048

sorry I'm not entirely sure of fp8 kv-cache is supported on DSV3.2, if not please ignore :)

xinli-sw · 2026-02-06T22:31:12Z

_posts/2026-02-06-gb300-deepseek.md

+
+Compared to FP8, FP4 weights yield a **14%** improvement in prefill-only scenarios, and **2x** output throughput in mixed scenarios.
+
+In Prefill, the major time occupations are attention/Indexer/KV cache writes, etc. FP4 primarily optimizes the weight read bandwidth and GEMM for MoE/MLP, thus the overall improvement is limited (14%).


NVFP4 should also have 1.5x extra flops vs FP8

In theory NVFP4 should also deliver roughly 1.5× FLOPs compared to FP8, but in our experiments we observed that with tp=4, the prefill-only speedup was not high; we attribute this to the finer partitioning of work per card reducing the per-GPU compute granularity, which makes TensorCores less able to fully utilize FP4 compute and results in lower marginal speedup, whereas in an FP4 + TP2 configuration the compute advantage was better realized, ultimately yielding a prefill speedup closer to and even beyond the theoretical expectation (~1.8×, 7360 TGS).

Co-authored-by: Xin Li <119016172+xinli-sw@users.noreply.github.com> Signed-off-by: Nicole LiHui 🥜 <nicolelihui@outlook.com>

Signed-off-by: nicole-lihui <nicole.li@daocloud.io>

nicole-lihui · 2026-02-08T13:52:30Z

_posts/2026-02-06-gb300-deepseek.md

+
+FP4 allows the model to run on only 2 GPUs with `-tp=2`, and achieve up to **14720 (7360 per GPU)** total prefill throughput _(ISL=2k, OSL=1, batch=64)_ and **5632 (2816 per GPU)** total throughput for P+D mixing _(ISL=2k, OSL=1K, batch=512)_. Compared to FP8 with `-tp=4`, this configuration achieves ~ **1.8×** higher prefill-only throughput and ~ **8×** higher total throughput in P+D mixing.
+
+**FP4 with TP2 clearly emerges as our recommended configuration** because lower precision alone isn’t enough. FP4 significantly reduces the model and KV cache footprint — which lowers memory bandwidth pressure and allows larger cache capacity and higher compute utilization per GPU. In practice, TP2 strikes a better balance between parallelization and per-card workload size, enabling TensorCores to more fully exploit FP4’s higher FLOPs and bandwidth efficiency, while TP4’s finer partitioning dilutes that advantage.


Thanks to @xinli-sw for pointing out that FP4 offers a ~1.5× FLOPs improvement over FP8 — that insight inspired me. We should emphasize more clearly that the performance benefits of DS V3.2’s FP4 mode are best realized with the FP4 + TP2 configuration, otherwise it could be easily misunderstood or underestimated.

vercel bot deployed to Preview February 6, 2026 08:08 View deployment

nicole-lihui marked this pull request as draft February 6, 2026 08:10

Blog: DeepSeek-V3.2 on GB300: Performance Breakthrough

8d03810

Signed-off-by: nicole-lihui <nicole.li@daocloud.io> Co-authored-by: chaunceyjiang <chaunceyjiang@gmail.com> Co-authored-by: Peter Pan <Peter.Pan@daocloud.io> Co-authored-by: Kebe <mail@kebe7jun.com>

nicole-lihui force-pushed the blog/20260206-gb300-deepseek branch from e904b15 to 8d03810 Compare February 6, 2026 08:18

vercel bot deployed to Preview February 6, 2026 08:18 View deployment

chaunceyjiang reviewed Feb 6, 2026

View reviewed changes

_posts/2026-02-06-gb300-deepseek.md Show resolved Hide resolved

vercel bot deployed to Preview February 6, 2026 09:21 View deployment

nicole-lihui force-pushed the blog/20260206-gb300-deepseek branch from 59e76cc to f683871 Compare February 6, 2026 09:24

vercel bot deployed to Preview February 6, 2026 09:25 View deployment

nicole-lihui force-pushed the blog/20260206-gb300-deepseek branch from f683871 to ceb5ee3 Compare February 6, 2026 09:28

vercel bot deployed to Preview February 6, 2026 09:29 View deployment

nicole-lihui force-pushed the blog/20260206-gb300-deepseek branch from ceb5ee3 to 84784e4 Compare February 6, 2026 09:35

vercel bot deployed to Preview February 6, 2026 09:36 View deployment

nicole-lihui force-pushed the blog/20260206-gb300-deepseek branch from 84784e4 to 9d64a93 Compare February 6, 2026 10:13

vercel bot deployed to Preview February 6, 2026 10:13 View deployment

fix image & add vllm versions notes

a4cc69a

Signed-off-by: nicole-lihui <nicole.li@daocloud.io>

nicole-lihui force-pushed the blog/20260206-gb300-deepseek branch from 9d64a93 to a4cc69a Compare February 6, 2026 10:29

vercel bot deployed to Preview February 6, 2026 10:29 View deployment

nicole-lihui requested a review from chaunceyjiang February 6, 2026 10:49

xinli-sw reviewed Feb 6, 2026

View reviewed changes

Apply suggestions from code review

d1280ca

Co-authored-by: Xin Li <119016172+xinli-sw@users.noreply.github.com> Signed-off-by: Nicole LiHui 🥜 <nicolelihui@outlook.com>

vercel bot deployed to Preview February 8, 2026 11:58 View deployment

vercel bot deployed to Preview February 8, 2026 13:09 View deployment

nicole-lihui force-pushed the blog/20260206-gb300-deepseek branch from 63ebb4a to 6576f7b Compare February 8, 2026 13:09

vercel bot deployed to Preview February 8, 2026 13:09 View deployment

nicole-lihui requested a review from xinli-sw February 8, 2026 13:10

fix grammar and naming

2732e55

Signed-off-by: nicole-lihui <nicole.li@daocloud.io>

nicole-lihui force-pushed the blog/20260206-gb300-deepseek branch from 6576f7b to 2732e55 Compare February 8, 2026 13:47

vercel bot deployed to Preview February 8, 2026 13:47 View deployment

nicole-lihui commented Feb 8, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Blog: DeepSeek-V3.2 on GB300: Performance Breakthrough#164

Blog: DeepSeek-V3.2 on GB300: Performance Breakthrough#164
nicole-lihui wants to merge 4 commits intovllm-project:mainfrom
nicole-lihui:blog/20260206-gb300-deepseek

nicole-lihui commented Feb 6, 2026

Uh oh!

chatgpt-codex-connector bot commented Feb 6, 2026

Uh oh!

Uh oh!

xinli-sw left a comment

Uh oh!

xinli-sw Feb 6, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

xinli-sw Feb 6, 2026

Uh oh!

nicole-lihui Feb 8, 2026

Uh oh!

Uh oh!

xinli-sw Feb 6, 2026

Uh oh!

xinli-sw Feb 6, 2026

Uh oh!

xinli-sw Feb 6, 2026

Uh oh!

nicole-lihui Feb 8, 2026

Uh oh!

nicole-lihui Feb 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants


		# Summary

		DeepSeek-V3.2 has been successfully and smoothly run on GB300 (SM103 - Blackwell Ultra). Leveraging FP4 quantization, it achieves a single-GPU throughput of <span style="color: #2ecc71; font-weight: 600;">7360 TGS</span> in a prefill only scenario. In a _(ISL=2k,OSL=1K)_ context mixed scenario (P+D), the output throughput per GPU is 2816 TGS.

		### 4. Optimize Batch Configuration

		Below are reference values for the max boundary batch to achieve better prefill throughput for TP=2, using the additional parameter `--max-num-batched-tokens`:


		Compared to FP8, FP4 weights yield a 14% improvement in prefill-only scenarios, and 2x output throughput in mixed scenarios.

		In Prefill, the major time occupations are attention/Indexer/KV cache writes, etc. FP4 primarily optimizes the weight read bandwidth and GEMM for MoE/MLP, thus the overall improvement is limited (14%).


		FP4 allows the model to run on only 2 GPUs with `-tp=2`, and achieve up to 14720 (7360 per GPU) total prefill throughput _(ISL=2k, OSL=1, batch=64)_ and 5632 (2816 per GPU) total throughput for P+D mixing _(ISL=2k, OSL=1K, batch=512)_. Compared to FP8 with `-tp=4`, this configuration achieves ~ 1.8× higher prefill-only throughput and ~ 8× higher total throughput in P+D mixing.

		FP4 with TP2 clearly emerges as our recommended configuration because lower precision alone isn’t enough. FP4 significantly reduces the model and KV cache footprint — which lowers memory bandwidth pressure and allows larger cache capacity and higher compute utilization per GPU. In practice, TP2 strikes a better balance between parallelization and per-card workload size, enabling TensorCores to more fully exploit FP4’s higher FLOPs and bandwidth efficiency, while TP4’s finer partitioning dilutes that advantage.

Conversation

nicole-lihui commented Feb 6, 2026

Uh oh!

chatgpt-codex-connector bot commented Feb 6, 2026

Uh oh!

Uh oh!

xinli-sw left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants