-
Notifications
You must be signed in to change notification settings - Fork 3k
NPU Qwen3-235B-megatron-grpo-64card运行脚本 #4855
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
xiejiahao2333 seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account. You have signed the CLA already but the status is still pending? Let us recheck it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
这个 PR 添加了一个在 64 张 NPU 卡上运行 Qwen3-235B 模型的 GRPO 训练脚本。代码审查发现了一些关键问题:脚本中使用了未定义的变量来指定训练和验证数据集,这将导致执行失败。此外,为 actor 和 ref 模型配置的并行度(流水线、张量、专家)与可用的 GPU 总数不匹配,这也会导致错误。我还建议了一个小改动以增强脚本的健robustness。请查看具体的审查意见以获取详细信息。
| data.train_files=$TRAIN_DATA_PATH \ | ||
| data.val_files=$TEST_DATA_PATH \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| actor_rollout_ref.actor.megatron.pipeline_model_parallel_size=16 \ | ||
| actor_rollout_ref.actor.megatron.tensor_model_parallel_size=4 \ | ||
| actor_rollout_ref.actor.megatron.expert_model_parallel_size=4 \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
训练并行度配置似乎不正确。脚本配置了 trainer.nnodes=8 和 trainer.n_gpus_per_node=8,总共有 64 个 GPU。但是,actor 和 ref 模型的并行设置(pipeline_model_parallel_size=16, tensor_model_parallel_size=4, expert_model_parallel_size=4)的乘积为 16 * 4 * 4 = 256,这与可用的 64 个 GPU 不匹配。这同样适用于 ref 模型的配置。这将导致 Megatron 初始化失败。请检查并修正这些并行度参数,以确保它们的乘积等于总 GPU 数量 64。
| TRAIN_FILE=/dataset/dapo-math-17k.parquet | ||
| TEST_FILE=/dataset/aime-2024.parquet | ||
|
|
||
| mkdir logs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this PR do?
Checklist Before Starting
[{modules}] {type}: {description}(This will be checked by the CI){modules}includefsdp,megatron,sglang,vllm,rollout,trainer,ci,training_utils,recipe,hardware,deployment,ray,worker,single_controller,misc,perf,model,algo,env,tool,ckpt,doc,data,cfg,reward,like[megatron, fsdp, doc]{type}is infeat,fix,refactor,chore,test[BREAKING]to the beginning of the title.[BREAKING][fsdp, megatron] feat: dynamic batchingTest
API and Usage Example
# Add code snippet or script demonstrating how to use thisDesign & Code Changes
Checklist Before Submitting
Important
Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=alwaysci-requestchannel in theverlSlack workspace. (If not accessible, please try the Feishu group (飞书群).)recipesubmodule, please also update the reference to the submodule commit viagit submodule update --remoteorcd recipe && git pull origin main.