-
Notifications
You must be signed in to change notification settings - Fork 3k
[rollout][DONT-MERGE] Test TRTLLM PPO CI workflow #4868
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces support for TensorRT-LLM as a rollout engine. The changes are extensive, touching Dockerfiles, documentation, configuration, and core Python code for workers and trainers. The implementation adds a new trtllm rollout worker, server, and replica, integrating with the existing Ray-based infrastructure. The use of IPC for weight updates and careful management of placement groups for co-location of training and inference workers are notable. Overall, the changes seem well-designed to integrate TRTLLM, but there are a couple of points in the Dockerfile that could be improved for better reliability and maintainability.
| RUN git clone -b v2.3.1 https://github.com/NVIDIA/gdrcopy.git && \ | ||
| git clone https://github.com/deepseek-ai/DeepEP.git && cd DeepEP && git checkout a84a248 && \ | ||
| cd /home/dpsk_a2a && \ | ||
| wget https://developer.nvidia.com/downloads/assets/secure/nvshmem/nvshmem_src_3.2.5-1.txz && \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The wget command on line 17 downloads nvshmem from a URL that contains /secure/. This often indicates that the resource is behind a login or requires some form of authentication, which can cause the Docker build to fail in automated CI environments. It's better to either host this dependency in a more accessible location or to use a base image that already includes it if possible. If this is a public but temporary URL, it's still a risk for long-term build reproducibility.
| RUN pip install git+https://github.com/volcengine/verl.git@v0.6.0 | ||
| RUN pip uninstall -y verl |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Installing verl from git and then immediately uninstalling it is a confusing pattern for managing dependencies. While this might be intended to cache the dependencies in a Docker layer, it makes the Dockerfile harder to understand and maintain. A more explicit approach would be better. For example, you could clone the repository, generate a requirements.txt from setup.py, and then install dependencies from that file. This would make the intent clear and separate dependency installation from the package installation.
578968c to
9db950d
Compare
Signed-off-by: Jonas Yang <joyang@nvidia.com>
Signed-off-by: Jonas Yang <joyang@nvidia.com>
Signed-off-by: Jonas Yang <joyang@nvidia.com>
Signed-off-by: Jonas Yang <joyang@nvidia.com>
Signed-off-by: Jonas Yang <joyang@nvidia.com>
b04560a to
aafd494
Compare
What does this PR do?
Checklist Before Starting
[{modules}] {type}: {description}(This will be checked by the CI){modules}includefsdp,megatron,sglang,vllm,rollout,trainer,ci,training_utils,recipe,hardware,deployment,ray,worker,single_controller,misc,perf,model,algo,env,tool,ckpt,doc,data,cfg,reward,like[megatron, fsdp, doc]{type}is infeat,fix,refactor,chore,test[BREAKING]to the beginning of the title.[BREAKING][fsdp, megatron] feat: dynamic batchingTest
API and Usage Example
# Add code snippet or script demonstrating how to use thisDesign & Code Changes
Checklist Before Submitting
Important
Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=alwaysci-requestchannel in theverlSlack workspace. (If not accessible, please try the Feishu group (飞书群).)recipesubmodule, please also update the reference to the submodule commit viagit submodule update --remoteorcd recipe && git pull origin main.