NPU Qwen3-235B-megatron-grpo-64card运行脚本 #4855

Xiejiahao233 · 2026-01-09T06:59:56Z

What does this PR do?

Add executable scripts for running Qwen3-235B on NPU A2 with 64 cards.

Checklist Before Starting

Search for similar PRs. Paste at least one query link here: ...
Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)
- {modules} include fsdp, megatron, sglang, vllm, rollout, trainer, ci, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data, cfg, reward
- If this PR involves multiple modules, separate them with , like [megatron, fsdp, doc]
- {type} is in feat, fix, refactor, chore, test
- If this PR breaks any API (CLI arguments, config, function signature, etc.), add [BREAKING] to the beginning of the title.
- Example: [BREAKING][fsdp, megatron] feat: dynamic batching

Test

For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc.

API and Usage Example

Demonstrate how the API changes if any, and provide usage example(s) if possible.

# Add code snippet or script demonstrating how to use this

Design & Code Changes

Demonstrate the high-level design if this PR is complex, and list the specific changes.

Checklist Before Submitting

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

Read the Contribute Guide.
Apply pre-commit checks: pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always
Add / Update the documentation.
Add unit or end-to-end test(s) to the CI workflow to cover all the code. If not feasible, explain why: ...
Once your PR is ready for CI, send a message in the ci-request channel in the verl Slack workspace. (If not accessible, please try the Feishu group (飞书群).)
If your PR is related to the recipe submodule, please also update the reference to the submodule commit via git submodule update --remote or cd recipe && git pull origin main.

CLAassistant · 2026-01-09T07:00:08Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.

xiejiahao2333 seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

gemini-code-assist

Code Review

这个 PR 添加了一个在 64 张 NPU 卡上运行 Qwen3-235B 模型的 GRPO 训练脚本。代码审查发现了一些关键问题：脚本中使用了未定义的变量来指定训练和验证数据集，这将导致执行失败。此外，为 actor 和 ref 模型配置的并行度（流水线、张量、专家）与可用的 GPU 总数不匹配，这也会导致错误。我还建议了一个小改动以增强脚本的健robustness。请查看具体的审查意见以获取详细信息。

gemini-code-assist · 2026-01-09T07:01:12Z

examples/grpo_trainer/run_qwen3-235b_megatron_grpo_64card_npu.sh

+    data.train_files=$TRAIN_DATA_PATH \
+    data.val_files=$TEST_DATA_PATH \


脚本中使用了未定义的变量 $TRAIN_DATA_PATH 和 $TEST_DATA_PATH。根据脚本开头的定义，应该使用 $TRAIN_FILE 和 $TEST_FILE。这将导致脚本因找不到数据文件而失败。

Suggested change

data.train_files=$TRAIN_DATA_PATH \

data.val_files=$TEST_DATA_PATH \

data.train_files=$TRAIN_FILE \

data.val_files=$TEST_FILE \

gemini-code-assist · 2026-01-09T07:01:13Z

examples/grpo_trainer/run_qwen3-235b_megatron_grpo_64card_npu.sh

+    actor_rollout_ref.actor.megatron.pipeline_model_parallel_size=16 \
+    actor_rollout_ref.actor.megatron.tensor_model_parallel_size=4 \
+    actor_rollout_ref.actor.megatron.expert_model_parallel_size=4 \


训练并行度配置似乎不正确。脚本配置了 trainer.nnodes=8 和 trainer.n_gpus_per_node=8，总共有 64 个 GPU。但是，actor 和 ref 模型的并行设置（pipeline_model_parallel_size=16, tensor_model_parallel_size=4, expert_model_parallel_size=4）的乘积为 16 * 4 * 4 = 256，这与可用的 64 个 GPU 不匹配。这同样适用于 ref 模型的配置。这将导致 Megatron 初始化失败。请检查并修正这些并行度参数，以确保它们的乘积等于总 GPU 数量 64。

gemini-code-assist · 2026-01-09T07:01:13Z

examples/grpo_trainer/run_qwen3-235b_megatron_grpo_64card_npu.sh

+TRAIN_FILE=/dataset/dapo-math-17k.parquet
+TEST_FILE=/dataset/aime-2024.parquet
+
+mkdir logs


mkdir logs 命令在 logs 目录已存在时会执行失败。为了确保脚本的可重复执行性，建议使用 mkdir -p logs，它会在目录不存在时创建目录，而已存在时不会报错。

Suggested change

mkdir logs

mkdir -p logs

Xiejiahao233 requested review from FightingZhen, PeterSH6, ji-huazhong and vermouth1992 as code owners January 9, 2026 06:59

gemini-code-assist bot reviewed Jan 9, 2026

View reviewed changes

NPU Qwen3-235B-megatron-grpo-64card运行脚本

a94c1b5

Xiejiahao233 force-pushed the main branch from 3b2264d to a94c1b5 Compare January 9, 2026 07:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

NPU Qwen3-235B-megatron-grpo-64card运行脚本 #4855

NPU Qwen3-235B-megatron-grpo-64card运行脚本 #4855

Xiejiahao233 commented Jan 9, 2026

Uh oh!

CLAassistant commented Jan 9, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Jan 9, 2026

Uh oh!

gemini-code-assist bot Jan 9, 2026

Uh oh!

gemini-code-assist bot Jan 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		data.train_files=$TRAIN_DATA_PATH \
		data.val_files=$TEST_DATA_PATH \

NPU Qwen3-235B-megatron-grpo-64card运行脚本 #4855

Are you sure you want to change the base?

NPU Qwen3-235B-megatron-grpo-64card运行脚本 #4855

Conversation

Xiejiahao233 commented Jan 9, 2026

What does this PR do?

Checklist Before Starting

Test

API and Usage Example

Design & Code Changes

Checklist Before Submitting

Uh oh!

CLAassistant commented Jan 9, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants