Skip to content

Conversation

@Xiejiahao233
Copy link

What does this PR do?

Add executable scripts for running Qwen3-235B on NPU A2 with 64 cards.

Checklist Before Starting

  • Search for similar PRs. Paste at least one query link here: ...
  • Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)
    • {modules} include fsdp, megatron, sglang, vllm, rollout, trainer, ci, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data, cfg, reward
    • If this PR involves multiple modules, separate them with , like [megatron, fsdp, doc]
    • {type} is in feat, fix, refactor, chore, test
    • If this PR breaks any API (CLI arguments, config, function signature, etc.), add [BREAKING] to the beginning of the title.
    • Example: [BREAKING][fsdp, megatron] feat: dynamic batching

Test

For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc.

API and Usage Example

Demonstrate how the API changes if any, and provide usage example(s) if possible.

# Add code snippet or script demonstrating how to use this

Design & Code Changes

Demonstrate the high-level design if this PR is complex, and list the specific changes.

Checklist Before Submitting

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.


xiejiahao2333 seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

这个 PR 添加了一个在 64 张 NPU 卡上运行 Qwen3-235B 模型的 GRPO 训练脚本。代码审查发现了一些关键问题:脚本中使用了未定义的变量来指定训练和验证数据集,这将导致执行失败。此外,为 actor 和 ref 模型配置的并行度(流水线、张量、专家)与可用的 GPU 总数不匹配,这也会导致错误。我还建议了一个小改动以增强脚本的健robustness。请查看具体的审查意见以获取详细信息。

Comment on lines +29 to +30
data.train_files=$TRAIN_DATA_PATH \
data.val_files=$TEST_DATA_PATH \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

脚本中使用了未定义的变量 $TRAIN_DATA_PATH$TEST_DATA_PATH。根据脚本开头的定义,应该使用 $TRAIN_FILE$TEST_FILE。这将导致脚本因找不到数据文件而失败。

Suggested change
data.train_files=$TRAIN_DATA_PATH \
data.val_files=$TEST_DATA_PATH \
data.train_files=$TRAIN_FILE \
data.val_files=$TEST_FILE \

Comment on lines +54 to +56
actor_rollout_ref.actor.megatron.pipeline_model_parallel_size=16 \
actor_rollout_ref.actor.megatron.tensor_model_parallel_size=4 \
actor_rollout_ref.actor.megatron.expert_model_parallel_size=4 \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

训练并行度配置似乎不正确。脚本配置了 trainer.nnodes=8trainer.n_gpus_per_node=8,总共有 64 个 GPU。但是,actor 和 ref 模型的并行设置(pipeline_model_parallel_size=16, tensor_model_parallel_size=4, expert_model_parallel_size=4)的乘积为 16 * 4 * 4 = 256,这与可用的 64 个 GPU 不匹配。这同样适用于 ref 模型的配置。这将导致 Megatron 初始化失败。请检查并修正这些并行度参数,以确保它们的乘积等于总 GPU 数量 64。

TRAIN_FILE=/dataset/dapo-math-17k.parquet
TEST_FILE=/dataset/aime-2024.parquet

mkdir logs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

mkdir logs 命令在 logs 目录已存在时会执行失败。为了确保脚本的可重复执行性,建议使用 mkdir -p logs,它会在目录不存在时创建目录,而已存在时不会报错。

Suggested change
mkdir logs
mkdir -p logs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants