-
Notifications
You must be signed in to change notification settings - Fork 3k
[ray,rollout,trtllm] feat: Adding tensorrt_llm as new rollout engine #4665
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces TensorRT-LLM as a new rollout engine, which is a significant feature addition. The implementation is comprehensive, including a new Dockerfile, example scripts, documentation, and the core logic for the TRTLLM-based rollout in a hybrid colocated mode. The design leverages Ray for orchestration and IPC for efficient weight updates.
My review has identified a critical bug in one of the new example scripts (recipe/dapo/test_dapo_7b_math_trtllm.sh) where a variable is used before definition, which would lead to incorrect checkpoint paths. I've also found a potential IndexError in the GPU device ID mapping logic in verl/workers/rollout/trtllm_rollout/trtllm_rollout.py that could cause a crash in certain environments.
Overall, the changes are well-structured and the addition of TensorRT-LLM is a valuable enhancement. Addressing the identified issues will improve the robustness of this new feature.
ecbbb35 to
6545eac
Compare
7254bdf to
3f357db
Compare
280fe2e to
d0cdca7
Compare
Signed-off-by: Jonas Yang <joyang@nvidia.com>
Signed-off-by: Jonas Yang <joyang@nvidia.com>
Signed-off-by: Jonas Yang <joyang@nvidia.com>
Signed-off-by: Jonas Yang <joyang@nvidia.com>
da011fc to
454a83c
Compare
This reverts commit ff77ac6.
What does this PR do?
TensorRT-LLM has recently added Ray orchestrator and essential features required for the RL workflow. This PR introduces TensorRT-LLM as a new rollout engine for VeRL.
VeRL currently supports several rollout modes:
WorkerDictclass to manage multiple workers within a single process group. Communication between training and rollout workers takes place within the same process, allowing them to share the Torch GPU memory pool.Unlike other rollout engines, TensorRT-LLM primarily targets the colocated mode. However, instead of relying purely on standard colocated mode, we introduced a mixed design combining aspects of the hybrid engine and colocated mode. The design goals are:
TRTLLMHttpServer.This PR aims to make the integration as minimally intrusive as possible to VeRL's infrastructure. Currently, it only invokes
RolloutReplica.init_hybrid_colocated()when both the hybrid engine is enabled and the rollout engine is set to TensorRT-LLM.High Level Design
Please refer to workers/rollout/trtllm_rollout/trtllm_async_rollout.md for more details.
%%{init: {'theme':'base', 'themeVariables': { 'fontSize':'18px', 'edgeLabelBackground':'#eeeeee'}}}%% flowchart TB space1[" "] style space1 fill:none,stroke:none subgraph VERL["<b>VERL Training Pipeline</b>"] subgraph Workers["<b>Training Workers</b>"] Actor["<b>Actor Worker</b>"] Critic["<b>Critic Worker</b>"] RefModel["<b>Ref Model Worker</b>"] end Actor -->|<b>Weight Updates<br/>IPC</b>| Rollout["<b>TensorRT-LLM Rollout</b>"] subgraph RayCluster["<b>Rollout Workers<br/>(Ray Cluster)</b>"] space2[" "] style space2 fill:none,stroke:none subgraph AsyncRollout["<b>TRTLLMAsyncRollout<br/>(per DP rank)</b>"] DPLeader["<b>• DP Leader coordination</b>"] IPCMgmt["<b>• IPC handle management</b>"] HTTPAdapter["<b>• HTTP adapter for server communication</b>"] end AsyncRollout -->|<b>HTTP/REST API</b>| HTTPServer subgraph HTTPServer["<b>TRTLLMHttpServer<br/>(Ray Actor per Replica)</b>"] OpenAI["<b>• OpenAI Server wrapper</b>"] EngMgmt["<b>• AsyncLLM engine management</b>"] MemMgmt["<b>• Memory management (resume/release)</b>"] end HTTPServer --> AsyncLLM subgraph AsyncLLM["<b>TensorRT-LLM<br/>AsyncLLM Engine</b>"] GPUWorkers["<b>• GPU workers (Tensor Parallel)</b>"] KVCache["<b>• KV Cache management</b>"] CUDAGraph["<b>• CUDA Graph optimization</b>"] end end end space1 ~~~ VERL style VERL fill:#e1f5ff style RayCluster fill:#fff4e6 style AsyncRollout fill:#f3e5f5 style HTTPServer fill:#e8f5e9 style AsyncLLM fill:#fce4ecExperiments results:
Setup: single node with H100 * 8/slurm env.
FSDP/GRPO: Qwen2-7B (TP1 * 8 on 8 GPUs, launching cmd
bash examples/grpo_trainer/run_qwen2-7b_math_trtllm.sh 1)Convergence:

Validation:

FSDP/GRPO: Qwen2-7B (TP4 * 2 on 8 GPUs, launching cmd
bash examples/grpo_trainer/run_qwen2-7b_math_trtllm.sh 4)Convergence:

Validation:

Megatron/GRPO: Qwen2-7B (TP1 * 8 on 8 GPUs, launching cmd
bash examples/grpo_trainer/run_qwen2-7b_math_megatron_trtllm.sh 1)Convergence:

Validation:

Megatron/GRPO: Qwen2-7B (TP2 * 2 on 8 GPUs, launching cmd
bash examples/grpo_trainer/run_qwen2-7b_math_megatron_trtllm.sh 4)Convergence:

Validation:

Special notes for using VeRL with TensorRT-LLM:
pip install -e ".[trtllm]" --extra-index-url https://pypi.nvidia.com/.export RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES=1and the following env settings before launching the Ray cluster. While these have been included in any example scripts or tests added, we will work toward removing such dependencies to improve the user experience in the near future.Outstanding issues for this MR:
Upcoming works (in separate MRs)
Important
Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=alwaysci-requestchannel in theverlSlack workspace. (If not accessible, please try the Feishu group (飞书群).)