[vllm, rollout] fix: Add TP-aware weight loading for async rollout with different parallelism configs #4876
+149
−41
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What does this PR do?
Adds automatic tensor parallel (TP) resharding support when training and inference use different TP configurations. This fixes weight loading failures in fully async workflows where Megatron exports full HuggingFace weights but vLLM expects TP-sharded weights.
Related to #400, #1063, #4497
Problem
When using fully async training with:
tensor_model_parallel_size=8,use_mbridge=Truetensor_model_parallel_size=32Weight synchronization fails because:
weight_loaderattribute cause assertion errors:This affects large MoE models (DeepSeek-V3/R1) where training requires pipeline parallelism but inference benefits from higher TP across nodes.
Solution
VERL_ENABLE_TP_RESHARDenvironment variable (opt-in to avoid breaking existing setups)weight_loaderto use a TP-aware loader that:tp_sizetp_rankpatch_vllm_moe_model_weight_loaderto handle both MoE expert patching and general TP reshardingTest
Validated on DeepSeek-V3-Base with:
use_mbridge=TrueBefore fix: Weight loading fails with shape mismatch assertions
After fix: Weights load correctly, training proceeds normally
Config used
Launch command
API and Usage Example
# Enable TP resharding for async rollout VERL_ENABLE_TP_RESHARD=1 python -m verl.trainer.main ...No config changes required. The fix automatically detects and handles TP mismatches.
Design & Code Changes
File:
verl/utils/vllm/patch.py_get_tp_rank_and_size(): Helper to get vLLM's tensor parallel configuration_create_tp_aware_weight_loader(): Creates weight loaders that:tp_sizetimes larger in one dimensionpatch_vllm_moe_model_weight_loader(): Extended to patch all parameters withoutweight_loaderwhenVERL_ENABLE_TP_RESHARD=1Checklist Before Submitting
Related Issues