Feat/support torch lora fp8 base model #1

shaharmor98 · 2025-05-04T13:51:44Z

No description provided.

…LE_LIST] (NVIDIA#4131) * infra: WAR for Argument list too long of globalVars[CACHED_CHANGED_FILE_LIST] Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com> * Fix init value Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com> * Update jenkins/L0_MergeRequest.groovy Co-authored-by: Yanchao Lu <yanchaol@nvidia.com> Signed-off-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com> --------- Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com> Signed-off-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com> Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>

NVIDIA#4019) * Add slurm support with RTXPro6000 PostMerge Tests Signed-off-by: Yuanjing Xue <197832395+yuanjingx87@users.noreply.github.com> * remove H100 post merge test from testing Signed-off-by: Yuanjing Xue <197832395+yuanjingx87@users.noreply.github.com> --------- Signed-off-by: Yuanjing Xue <197832395+yuanjingx87@users.noreply.github.com>

Waive L0 test Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>

Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>

Signed-off-by: Tracin <10434017+Tracin@users.noreply.github.com>

update waive list Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com> Co-authored-by: Larry <197874197+LarryXFly@users.noreply.github.com>

…A#4053) * add MMLU, GPQADiamond check for llama-4 models Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com> * add nomotron cases Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com> * add online quant test cases Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com> * remove trt flow cases Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com> * update threshold Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com> * adjust parallelism strategy Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com> * fix fail Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com> * update sanity list Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com> * fix comment Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com> * skip nemotron-h test case Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com> --------- Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com> Co-authored-by: Larry <197874197+LarryXFly@users.noreply.github.com>

…lon (NVIDIA#3992) Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>

…lm-eval (NVIDIA#3946) * fix formula Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> * update doc Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> * fix Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> * 1st version Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> * polish Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> * fix Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> --------- Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

feat: align logprob definition of PyTorch flow Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com> Co-authored-by: Erin <14718778+hchings@users.noreply.github.com>

* support multi lora, tp Signed-off-by: Shahar Mor <17088876+shaharmor98@users.noreply.github.com>

…NVIDIA#4080) * Fallback to NCCL for various patterns when input size is large. Move the previous implementation to cpp side. Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com> * Revising. Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com> --------- Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>

* feat: TRT-LLM Gen FP8 MoE Llama4 Signed-off-by: Nikita Korobov <nkorobov@nvidia.com> * feat: TRT-LLM Gen llama4 MoE Top1 routing Signed-off-by: Jiqun Tu <jtu@nvidia.com> * feat: add per tensor FP8 TRT-LLM Gen GEMMs Signed-off-by: Nikita Korobov <nkorobov@nvidia.com> * Update Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com> * Update Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com> * Add license for cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/gemmCubins Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com> * Add guard for routingIndicesClusterKernel Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com> * Guard sm90+ for routingkernels Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com> * Guard sm90+ for routingkernels Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com> --------- Signed-off-by: Nikita Korobov <nkorobov@nvidia.com> Signed-off-by: Jiqun Tu <jtu@nvidia.com> Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com> Co-authored-by: Nikita Korobov <nkorobov@nvidia.com> Co-authored-by: Jiqun Tu <jtu@nvidia.com>

Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>

Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>

Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>

…VIDIA#3661) support ring attn for bert_attention plugin and dit model Signed-off-by: ChunhuanLin <lch_xdu@163.com>

fix alltoall padding for chunked MoE Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>

…IA#4164) feat: Allow overriding cli args with yaml file in trtllm-serve Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>

…VIDIA#4141) * fix bug of attention dp on qwen3 Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com> * fix pre-commit changes Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com> * fix bug of attention dp 8 Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com> --------- Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>

Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>

…ist (NVIDIA#4083) Signed-off-by: Stanley Sun <190317771+StanleySun639@users.noreply.github.com>

Add Piecewise CUDA Graph Support Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>

Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>

…input sequence length. (NVIDIA#4089) Fix apply_per_channel_scale for extremely large input seq length. Signed-off-by: Jiang Shao <91270701+StudyingShao@users.noreply.github.com> Co-authored-by: crazy-JiangDongHua <759421566@qq.com>

[fix] Fix trtllm-bench for llama 4 Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com> Co-authored-by: Zhihan Jiang <68881590+nvzhihanj@users.noreply.github.com>

…+ pipeline reuse (NVIDIA#4169) Fix import break caused by rebase. Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>

… case for pre-ada (NVIDIA#4095) skip pre ada Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>

…accuracy test suite (NVIDIA#3440) * add mistral-7b-v0.1 torch flow test case Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com> * rearrange mistral Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com> * rearrange mixtral case Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com> * remove api function test Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com> * move mistral nemo cases Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com> * move mixtral cases Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com> * update threshold Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com> * fix failure Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com> * fix name Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com> * fix failure cases Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com> * update list Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com> * update threshold Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com> * remove awq llmapi test Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com> * adjust threshold Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com> * fix ci Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com> * fix partial comments Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com> * fix path Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com> * update thres Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com> * update Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com> * remove duplicate test case Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com> * fix ci Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com> --------- Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>

* Add fp8 kv cache tests to DSV3-Lite integration tests. Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com> * Refactor. Make fp8kv parallel to attention_dp, overlap_scheduler and cuda_graph. Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com> * Update gsm8k. Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com> * Update CI list. Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com> * Update TestDeepSeekR1. Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com> * Fix test list. Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com> * Need quant_config besides pytorch_config. Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com> * Update waive list (bug 5239087). Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com> * Update waive list. Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com> * Correct test name. Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com> * Update waive list. Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com> --------- Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com> Signed-off-by: Bo Li <bobboli0202@gmail.com> Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com> Co-authored-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

…VIDIA#4126) * fix relaxed acceptance to support enable this feature in context phase. Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com> * fix sample_and_accept_draft_tokens unit test. Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com> --------- Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>

* skip tests on b200 Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com> * skip phi-3-128k Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com> --------- Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com>

Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>

Signed-off-by: Shahar Mor <17088876+shaharmor98@users.noreply.github.com>

shaharmor98 force-pushed the feat/support-torch-lora-fp8-base-model branch from 1a8a48f to b2781c0 Compare May 4, 2025 13:57

shaharmor98 force-pushed the feat/support-multi-adapter-with-target-modules branch from b1c990d to 8737923 Compare May 4, 2025 14:06

shaharmor98 force-pushed the feat/support-torch-lora-fp8-base-model branch from b2781c0 to 5efa91d Compare May 4, 2025 14:13

shaharmor98 force-pushed the feat/support-multi-adapter-with-target-modules branch from ddcc823 to 921edc5 Compare May 5, 2025 09:19

shaharmor98 force-pushed the feat/support-torch-lora-fp8-base-model branch from 1df64e0 to 4f89ff7 Compare May 5, 2025 09:25

syuoni force-pushed the feat/support-multi-adapter-with-target-modules branch from 921edc5 to 24896f5 Compare May 6, 2025 02:49

shaharmor98 force-pushed the feat/support-multi-adapter-with-target-modules branch 2 times, most recently from 037e8b3 to 4fda88d Compare May 6, 2025 16:58

ZhanruiSunCh and others added 22 commits May 8, 2025 14:36

[Infra] Waive L0 flaky test (NVIDIA#4148)

ce8832e

Waive L0 test Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>

doc: TRTLLM-4797 Update perf-analysis.md (NVIDIA#4100)

d1fa80d

Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>

Fix TP8 for NVFP4 kv dupilcation. (NVIDIA#4143)

b0dd581

Signed-off-by: Tracin <10434017+Tracin@users.noreply.github.com>

test: [CI] remove closed bugs (NVIDIA#4046)

4468158

update waive list Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com> Co-authored-by: Larry <197874197+LarryXFly@users.noreply.github.com>

chore: enhance the cmake experience by ignoring the additional semico…

4dfa3cc

…lon (NVIDIA#3992) Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>

feat: adopt new logprob definition in PyTorch flow (NVIDIA#4057)

5b93273

feat: align logprob definition of PyTorch flow Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com> Co-authored-by: Erin <14718778+hchings@users.noreply.github.com>

infra: Add NIXL into the Dockerfile (NVIDIA#3981)

99313af

feat: support multi lora adapters and TP (NVIDIA#3885)

7d94c95

* support multi lora, tp Signed-off-by: Shahar Mor <17088876+shaharmor98@users.noreply.github.com>

[fix] [AutoDeploy] flashinfer usage on H100 (NVIDIA#4162)

48ed38a

Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>

Fix incorrect conversion. (NVIDIA#4112)

57afbf6

Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>

[fix] Fix llama4 + eagle3 (NVIDIA#3998)

9afe510

Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>

Support RingAttention in the BertAttention plugin and the DiT model (N…

9477661

…VIDIA#3661) support ring attn for bert_attention plugin and dit model Signed-off-by: ChunhuanLin <lch_xdu@163.com>

fix: alltoall padding for chunked MoE (NVIDIA#4157)

7147efb

fix alltoall padding for chunked MoE Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>

[feat] Allow overriding cli args with yaml file in trtllm-serve (NVID…

836c142

…IA#4164) feat: Allow overriding cli args with yaml file in trtllm-serve Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>

chore: Clean up the legacy DeepseekAllreudceFusionOp. (NVIDIA#4081)

5b61486

Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>

StanleySun639 and others added 13 commits May 9, 2025 11:03

test: add qwen3 and disaggregated serving accuracy tests to qa test l…

fb31f91

…ist (NVIDIA#4083) Signed-off-by: Stanley Sun <190317771+StanleySun639@users.noreply.github.com>

[TRTLLM-3105][feat] Add Piecewise CUDA Graph Support (NVIDIA#3804)

91bf5e6

Add Piecewise CUDA Graph Support Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>

fix: change pp broadcast pattern for LPs (NVIDIA#4130)

cdf5ae1

Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>

[nvbug/5262268][fix] Fix trtllm-bench for llama 4 (NVIDIA#4104)

d80dc40

[fix] Fix trtllm-bench for llama 4 Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com> Co-authored-by: Zhihan Jiang <68881590+nvzhihanj@users.noreply.github.com>

chore: Fix pipeline break caused by previous PR (NVIDIA#4081) rebase …

c9cac43

…+ pipeline reuse (NVIDIA#4169) Fix import break caused by rebase. Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>

[https://nvbugspro.nvidia.com/bug/5260676]test: skip fp8 quantization…

c2d4c2a

… case for pre-ada (NVIDIA#4095) skip pre ada Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>

infra: Fix pipeline step error in post merge (NVIDIA#3948)

e30c76c

Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>

add changes for fp8, nemotron-nas, API

1b65bf1

Signed-off-by: Shahar Mor <17088876+shaharmor98@users.noreply.github.com>

shaharmor98 force-pushed the feat/support-torch-lora-fp8-base-model branch from a34d2a4 to 1b65bf1 Compare May 9, 2025 07:31

shaharmor98 closed this May 9, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat/support torch lora fp8 base model #1

Feat/support torch lora fp8 base model #1

Uh oh!

shaharmor98 commented May 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Feat/support torch lora fp8 base model #1

Feat/support torch lora fp8 base model #1

Uh oh!

Conversation

shaharmor98 commented May 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants