Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
51 commits
Select commit Hold shift + click to select a range
aae5443
update
Jintao-Huang Jan 30, 2026
4d21203
Merge branch 'main' into refactor_megaron_swift_v4
Jintao-Huang Feb 2, 2026
4939ccb
remove megatron.training
Jintao-Huang Feb 2, 2026
bcfe512
update
Jintao-Huang Feb 2, 2026
332e184
update
Jintao-Huang Feb 2, 2026
e58dde8
update
Jintao-Huang Feb 3, 2026
7084f77
update
Jintao-Huang Feb 3, 2026
f383eb2
update
Jintao-Huang Feb 3, 2026
92ab184
update
Jintao-Huang Feb 3, 2026
a5fea81
update
Jintao-Huang Feb 3, 2026
12c39bc
update
Jintao-Huang Feb 3, 2026
899f172
update
Jintao-Huang Feb 3, 2026
6d918d8
update
Jintao-Huang Feb 3, 2026
62f82b8
update
Jintao-Huang Feb 3, 2026
1a40342
update
Jintao-Huang Feb 3, 2026
349f6e5
fix
Jintao-Huang Feb 3, 2026
81e6427
update
Jintao-Huang Feb 4, 2026
7233e9a
update
Jintao-Huang Feb 4, 2026
7b8e28c
update
Jintao-Huang Feb 4, 2026
29f1440
update
Jintao-Huang Feb 4, 2026
5014e4c
fix
Jintao-Huang Feb 4, 2026
0d96b1a
Merge branch 'main' into refactor_megaron_swift_v4
Jintao-Huang Feb 4, 2026
46a23cc
update
Jintao-Huang Feb 4, 2026
2cf83b9
update
Jintao-Huang Feb 5, 2026
c2b385a
update
Jintao-Huang Feb 5, 2026
b60b233
Merge remote-tracking branch 'refs/remotes/origin/refactor_megaron_sw…
Jintao-Huang Feb 5, 2026
5503a1b
update
Jintao-Huang Feb 5, 2026
a7a1118
update
Jintao-Huang Feb 5, 2026
69b8c25
update
Jintao-Huang Feb 5, 2026
0a2d2d6
update
Jintao-Huang Feb 5, 2026
6bacd38
update
Jintao-Huang Feb 5, 2026
12adf2a
fix
Jintao-Huang Feb 5, 2026
198f549
update
Jintao-Huang Feb 5, 2026
6e51567
update
Jintao-Huang Feb 5, 2026
cb0b219
update
Jintao-Huang Feb 5, 2026
c9a7e16
update
Jintao-Huang Feb 6, 2026
96b4169
update
Jintao-Huang Feb 6, 2026
39ac20f
update
Jintao-Huang Feb 6, 2026
071bfb0
Merge remote-tracking branch 'refs/remotes/origin/refactor_megaron_sw…
Jintao-Huang Feb 6, 2026
2eee590
update
Jintao-Huang Feb 6, 2026
c6238ea
update
Jintao-Huang Feb 6, 2026
ba23ee1
update
Jintao-Huang Feb 6, 2026
c17b863
update
Jintao-Huang Feb 6, 2026
08d3cc0
Merge remote-tracking branch 'refs/remotes/origin/refactor_megaron_sw…
Jintao-Huang Feb 6, 2026
f3bb068
Merge branch 'main' into refactor_megaron_swift_v4
Jintao-Huang Feb 7, 2026
8e418e9
Merge remote-tracking branch 'refs/remotes/origin/refactor_megaron_sw…
Jintao-Huang Feb 7, 2026
17a2df6
update
Jintao-Huang Feb 7, 2026
f314806
update
Jintao-Huang Feb 7, 2026
2088a44
update
Jintao-Huang Feb 7, 2026
0bb0d47
Merge branch 'main' into refactor_megaron_swift_v4
Jintao-Huang Feb 7, 2026
002e6d9
update
Jintao-Huang Feb 7, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/source/Instruction/Frequently-asked-questions.md
Original file line number Diff line number Diff line change
Expand Up @@ -592,7 +592,7 @@ megatron sft \
```text
RuntimeError: ColumnParallelLinear was called with gradient_accumulation_fusion set to True but the custom CUDA extension fused_weight_gradient_mlp_cuda module is not found. To use gradient_accumulation_fusion you must install APEX with --cpp_ext and --cuda_ext. For example: pip install --global-option="--cpp_ext" --global-option="--cuda_ext ." Note that the extension requires CUDA>=11. Otherwise, you must turn off gradient accumulation fusion.
```
设置一下`--no_gradient_accumulation_fusion true`。
设置一下`--gradient_accumulation_fusion false`。

### Q163: moe的lora训练,target_modules参数设置了all-linear,是包括了路由器模块吗?
看gate是否是nn.Linear实现,如果是nn.Parameter就不训练,详见命令行参数[target_parameters](https://swift.readthedocs.io/zh-cn/latest/Instruction/Command-line-parameters.html#tuner)。
Expand Down
8 changes: 4 additions & 4 deletions docs/source/Megatron-SWIFT/Ascend.md
Original file line number Diff line number Diff line change
Expand Up @@ -182,7 +182,7 @@ def train_step(self, forward_step_func, data_iterator, model, optimizer, opt_par

### 使能

另外,由于msprobe不支持融合计算,需要在启动脚本添加`--no_bias_dropout_fusion True`、`--no_bias_swiglu_fusion True`、`--cross_entropy_loss_fusion False`
另外,由于msprobe不支持融合计算,需要在启动脚本添加`--bias_dropout_fusion false`、`--bias_swiglu_fusion false`、`--cross_entropy_loss_fusion false`

#### 示例
```shell
Expand All @@ -196,7 +196,7 @@ megatron sft \
'swift/self-cognition#500' \
--tensor_model_parallel_size 2 \
...
--no_bias_dropout_fusion True \
--no_bias_swiglu_fusion True \
--cross_entropy_loss_fusion False
--bias_dropout_fusion false \
--bias_swiglu_fusion false \
--cross_entropy_loss_fusion false
```
24 changes: 12 additions & 12 deletions docs/source/Megatron-SWIFT/Command-line-parameters.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,12 +23,12 @@
- 🔥max_epochs: 指定训练的epochs数。当使用非流式数据集时,该参数会为你自动计算train_iters而不需要手动传入`train_iters`。当使用流式数据集时,该参数会在训练到`max_epochs`时强制退出训练,并对权重进行验证和保存。默认为None。
- 🔥log_interval: log的时间间隔(单位:iters),默认为5。
- tensorboard_dir: tensorboard日志写入的目录。默认None,即存储在`f'{save}/runs'`目录下。
- no_masked_softmax_fusion: 默认为False。用于禁用query_key_value的scaling, masking, and softmax融合。
- no_bias_dropout_fusion: 默认为False。用于禁用bias和dropout的融合
- no_bias_swiglu_fusion: 默认为False。指定`--no_bias_dropout_fusion true`,用于禁止bias和swiglu融合
- no_rope_fusion: 默认为False。指定`--no_rope_fusion true`用于禁止rope融合
- **当使用mrope等不支持rope_fusion的位置编码时,该参数会自动设置为True**。
- no_gradient_accumulation_fusion: 默认为False。指定`--no_gradient_accumulation_fusion true`用于禁用梯度累加融合
- masked_softmax_fusion: 默认为True。用于开启query_key_value的scaling, masking, and softmax融合。
- bias_dropout_fusion: 默认为True。用于开启bias和dropout的融合
- bias_swiglu_fusion: 默认为True。用于开启bias和swiglu融合
- apply_rope_fusion: 默认为True。用于开启rope融合
- **当使用mrope等不支持rope_fusion的位置编码时,该参数会自动设置为False**。
- gradient_accumulation_fusion: 默认为True。用于开启梯度累加融合
- 🔥cross_entropy_loss_fusion: 启动交叉熵损失计算融合。默认为False。
- cross_entropy_fusion_impl: 交叉熵损失融合的实现。可选为'native'和'te'。默认为'native'。
- calculate_per_token_loss: 根据全局批次中的非填充token数量来对交叉熵损失进行缩放。默认为True。
Expand All @@ -53,7 +53,7 @@
- seed: python、numpy、pytorch和cuda的随机种子,默认为42。
- 🔥num_workers: dataloader的workers数量,默认为4。
- 注意:若设置`--streaming true`,则设置为1。
- no_data_sharding: 当`--train_dataloader_shuffle true`时对 train_dataloader 生效,默认为False。该参数控制数据集随机的范围。若设置为False,则先对数据集进行分片,然后对每个分片进行随机处理(略节约内存);若设置为True,则先对数据集进行随机,再进行分片(更好的随机效果)。使用该参数需"ms-swift>=3.12"
- data_sharding: 当`--train_dataloader_shuffle true`时对 train_dataloader 生效,默认为False。该参数控制数据集随机的范围。若设置为True,则先对数据集进行分片,然后对每个分片进行随机处理(略节约内存);若设置为False,则先对数据集进行随机,再进行分片(更好的随机效果)。
- seq_length: 默认为None,即设置为`max_length`。对数据集长度进行限制建议使用“基本参数”中的`--max_length`控制,无需设置此参数。
- use_cpu_initialization: 在cpu上初始化权重,默认为False。在进行HF和MCore权重转换时会被使用。通常不需要修改该值。
- 🔥megatron_extra_kwargs: 额外需要透传入megatron的其他参数,使用json传递。默认为None。
Expand Down Expand Up @@ -96,7 +96,7 @@
- 注意:**断点续训**你需要设置`--load`(lora训练需要额外设置`--adapter_load`),若设置`--finetune true`,将不加载优化器状态和随机种子状态并将迭代数设置为0,不会进行数据集跳过;若设置`--finetune false`,将读取迭代数并跳过之前训练的数据集数量,优化器状态和随机种子状态的读取通过`--no_load_optim`和`--no_load_rng`控制。
- 流式数据集`--streaming`,暂不支持跳过数据集。
- ckpt_format: checkpoint的格式。可选为'torch', 'torch_dist', 'zarr'。默认为'torch_dist'。(暂时权重转换只支持'torch_dist'格式)
- no_initialization: 不对权重进行初始化,默认为True
- perform_initialization: 对权重进行初始化,默认为False
- auto_detect_ckpt_format: 自动检测ckpt format为legacy还是distributed格式。默认为True。
- exit_on_missing_checkpoint: 如果设置了`–-load`,但**找不到检查点,则直接退出**,而不是初始化。默认为True。
- 🔥async_save: 使用异步检查点保存。目前仅适用于`torch_dist`分布式检查点格式。默认为False。
Expand Down Expand Up @@ -135,7 +135,7 @@
- tensorboard_log_interval: 记录到tensorboard的间隔(steps),默认为1。
- tensorboard_queue_size: 用于暂存事件和摘要的 TensorBoard 队列大小;当队列中待处理的事件和摘要数量达到该大小时,下一次调用 "add" 相关方法会触发将数据刷新写入磁盘。默认为50。
- log_timers_to_tensorboard: 记录timers到tensorboard。默认为True。
- no_log_learning_rate_to_tensorboard: 不记录学习率到tensorboard。默认为False
- log_learning_rate_to_tensorboard: 记录学习率到tensorboard。默认为True
- log_validation_ppl_to_tensorboard: 将验证困惑度写入tensorboard。默认为True。
- log_memory_to_tensorboard: 将内存日志写入tensorboard。默认为True。
- logging_level: 日志级别。默认为None。
Expand Down Expand Up @@ -183,7 +183,7 @@
- activation_func_clamp_value: 限制激活函数中 linear_fc1 的输出值范围。仅在 `activation_func` 为 `quick_gelu` 时使用。默认为None。
- glu_linear_offset: GLU 激活函数中的偏移项:`activation_func(x[0]) * (x[1] + offset)`。仅在 gated_linear_unit 为 True 时使用。默认为0.。
- untie_embeddings_and_output_weights: 解开embedding和输出权重的绑定,默认为True。
- disable_bias_linear: 禁用linear层的bias。默认为True。
- add_bias_linear: 开启linear层的bias。默认为True。
- add_qkv_bias: 仅在QKV的linear中增加bias,默认为True。
- attention_dropout: 默认为0.。
- hidden_dropout: 默认为0.。
Expand All @@ -199,9 +199,9 @@


**MoE参数**:
- num_experts: MoE的专家数,默认为None。自动从config.json读取。
- num_moe_experts: MoE的专家数,默认为None。自动从config.json读取。
- moe_layer_freq: MoE 层与 Dense 层之间的分布频率。默认为None。从config.json中读取。
- moe_ffn_hidden_size: 每个专家的前馈网络(ffn)的隐藏层大小。默认为None,自动从config.json读取。若未读取到且`num_experts`不为None,则设置为ffn_hidden_size。
- moe_ffn_hidden_size: 每个专家的前馈网络(ffn)的隐藏层大小。默认为None,自动从config.json读取。若未读取到且`num_moe_experts`不为None,则设置为ffn_hidden_size。
- moe_shared_expert_intermediate_size: 共享专家的总FFN隐藏层大小。如果有多个共享专家,它应等于 `num_shared_experts * ffn_size_of_each_shared_expert`。 默认为None。自动从config.json读取。
- moe_router_topk: 每个token路由到的专家数量。默认为None。自动从config.json读取。
- moe_router_num_groups: 将专家分成的组数,用于组限制路由。参考DeepSeek-V2和DeepSeek-V3。默认为None。自动从config.json读取。
Expand Down
9 changes: 5 additions & 4 deletions docs/source/Megatron-SWIFT/Mcore-Bridge.md
Original file line number Diff line number Diff line change
Expand Up @@ -285,9 +285,10 @@ megatron export \
```python
import torch

from swift.megatron import MegatronArguments, convert_hf_config, get_megatron_model_meta
from swift.megatron import (
MegatronArguments, convert_hf_config, get_megatron_model_meta, initialize_megatron
)
from swift.model import get_processor
from megatron.training.initialize import initialize_megatron

model_id = 'Qwen/Qwen3-4B-Instruct-2507'
processor = get_processor(model_id, download_model=True)
Expand Down Expand Up @@ -327,10 +328,10 @@ LoRA权重的加载、导出和存储同理,运行`CUDA_VISIBLE_DEVICES=0,1,2,
import torch

from swift.megatron import (
MegatronArguments, convert_hf_config, get_megatron_model_meta, prepare_mcore_model
MegatronArguments, convert_hf_config, get_megatron_model_meta,
prepare_mcore_model, initialize_megatron
)
from swift.model import get_processor
from megatron.training.initialize import initialize_megatron

model_id = 'Qwen/Qwen3-30B-A3B-Instruct-2507'
processor = get_processor(model_id, download_model=True)
Expand Down
2 changes: 1 addition & 1 deletion docs/source/Megatron-SWIFT/Quick-start.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ pip install pybind11
pip install --no-build-isolation transformer_engine[pytorch]

# apex
# 提示:Megatron-SWIFT可以在不含apex的环境下运行,额外设置`--no_gradient_accumulation_fusion true`即可。
# 提示:Megatron-SWIFT可以在不含apex的环境下运行,额外设置`--gradient_accumulation_fusion false`即可。
git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./
Expand Down
2 changes: 1 addition & 1 deletion docs/source_en/Instruction/Frequently-asked-questions.md
Original file line number Diff line number Diff line change
Expand Up @@ -592,7 +592,7 @@ Saving checkpoints per epoch is not yet supported.
```text
RuntimeError: ColumnParallelLinear was called with gradient_accumulation_fusion set to True but the custom CUDA extension fused_weight_gradient_mlp_cuda module is not found. To use gradient_accumulation_fusion you must install APEX with --cpp_ext and --cuda_ext. For example: pip install --global-option="--cpp_ext" --global-option="--cuda_ext ." Note that the extension requires CUDA>=11. Otherwise, you must turn off gradient accumulation fusion.
```
Set `--no_gradient_accumulation_fusion true`.
Set `--gradient_accumulation_fusion false`.

### Q163: For MoE LoRA training, if target_modules is set to all-linear, does this include the router modules?
It depends on whether the gate is implemented as nn.Linear. If it's an nn.Parameter, it won't be trained. For details, see the command-line parameter [target_parameters](https://swift.readthedocs.io/en/latest/Instruction/Command-line-parameters.html#tuner-arguments).
Expand Down
8 changes: 4 additions & 4 deletions docs/source_en/Megatron-SWIFT/Ascend.md
Original file line number Diff line number Diff line change
Expand Up @@ -186,7 +186,7 @@ def train_step(self, forward_step_func, data_iterator, model, optimizer, opt_par

### Enable

Additionally, since msprobe does not support fusion computation, you need to add `--no_bias_dropout_fusion True`, `--no_bias_swiglu_fusion True`, `--cross_entropy_loss_fusion False` to the launch script.
Additionally, since msprobe does not support fusion computation, you need to add `--bias_dropout_fusion false`, `--bias_swiglu_fusion false`, `--cross_entropy_loss_fusion false` to the launch script.

#### Example
```shell
Expand All @@ -200,7 +200,7 @@ megatron sft \
'swift/self-cognition#500' \
--tensor_model_parallel_size 2 \
...
--no_bias_dropout_fusion True \
--no_bias_swiglu_fusion True \
--cross_entropy_loss_fusion False
--bias_dropout_fusion false \
--bias_swiglu_fusion false \
--cross_entropy_loss_fusion false
```
Loading
Loading