modelscope · Jintao-Huang · Jan 30, 2026 · Feb 2, 2026 · Feb 2, 2026 · Feb 2, 2026
diff --git a/docs/source/Instruction/Frequently-asked-questions.md b/docs/source/Instruction/Frequently-asked-questions.md
@@ -592,7 +592,7 @@ megatron sft \
 ```text
 RuntimeError: ColumnParallelLinear was called with gradient_accumulation_fusion set to True but the custom CUDA extension fused_weight_gradient_mlp_cuda module is not found. To use gradient_accumulation_fusion you must install APEX with --cpp_ext and --cuda_ext. For example: pip install --global-option="--cpp_ext" --global-option="--cuda_ext ." Note that the extension requires CUDA>=11. Otherwise, you must turn off gradient accumulation fusion.
 ```
-设置一下`--no_gradient_accumulation_fusion true`。
+设置一下`--gradient_accumulation_fusion false`。
 
 ### Q163: moe的lora训练，target_modules参数设置了all-linear，是包括了路由器模块吗？
 看gate是否是nn.Linear实现，如果是nn.Parameter就不训练，详见命令行参数[target_parameters](https://swift.readthedocs.io/zh-cn/latest/Instruction/Command-line-parameters.html#tuner)。

diff --git a/docs/source/Megatron-SWIFT/Ascend.md b/docs/source/Megatron-SWIFT/Ascend.md
@@ -182,7 +182,7 @@ def train_step(self, forward_step_func, data_iterator, model, optimizer, opt_par
 
 ### 使能
 
-另外，由于msprobe不支持融合计算，需要在启动脚本添加`--no_bias_dropout_fusion True`、`--no_bias_swiglu_fusion True`、`--cross_entropy_loss_fusion False`
+另外，由于msprobe不支持融合计算，需要在启动脚本添加`--bias_dropout_fusion false`、`--bias_swiglu_fusion false`、`--cross_entropy_loss_fusion false`
 
 #### 示例
 ```shell
@@ -196,7 +196,7 @@ megatron sft \
               'swift/self-cognition#500' \
     --tensor_model_parallel_size 2 \
     ...
-    --no_bias_dropout_fusion True \
-    --no_bias_swiglu_fusion True \
-    --cross_entropy_loss_fusion False
+    --bias_dropout_fusion false \
+    --bias_swiglu_fusion false \
+    --cross_entropy_loss_fusion false
 ```
diff --git a/docs/source/Megatron-SWIFT/Command-line-parameters.md b/docs/source/Megatron-SWIFT/Command-line-parameters.md
@@ -23,12 +23,12 @@
 - 🔥max_epochs: 指定训练的epochs数。当使用非流式数据集时，该参数会为你自动计算train_iters而不需要手动传入`train_iters`。当使用流式数据集时，该参数会在训练到`max_epochs`时强制退出训练，并对权重进行验证和保存。默认为None。
 - 🔥log_interval: log的时间间隔（单位：iters），默认为5。
 - tensorboard_dir: tensorboard日志写入的目录。默认None，即存储在`f'{save}/runs'`目录下。
-- no_masked_softmax_fusion: 默认为False。用于禁用query_key_value的scaling, masking, and softmax融合。
-- no_bias_dropout_fusion: 默认为False。用于禁用bias和dropout的融合。
-- no_bias_swiglu_fusion: 默认为False。指定`--no_bias_dropout_fusion true`，用于禁止bias和swiglu融合。
-- no_rope_fusion: 默认为False。指定`--no_rope_fusion true`用于禁止rope融合。
-  - **当使用mrope等不支持rope_fusion的位置编码时，该参数会自动设置为True**。
-- no_gradient_accumulation_fusion: 默认为False。指定`--no_gradient_accumulation_fusion true`用于禁用梯度累加融合。
+- masked_softmax_fusion: 默认为True。用于开启query_key_value的scaling, masking, and softmax融合。
+- bias_dropout_fusion: 默认为True。用于开启bias和dropout的融合。
+- bias_swiglu_fusion: 默认为True。用于开启bias和swiglu融合。
+- apply_rope_fusion: 默认为True。用于开启rope融合。
+  - **当使用mrope等不支持rope_fusion的位置编码时，该参数会自动设置为False**。
+- gradient_accumulation_fusion: 默认为True。用于开启梯度累加融合。
 - 🔥cross_entropy_loss_fusion: 启动交叉熵损失计算融合。默认为False。
 - cross_entropy_fusion_impl: 交叉熵损失融合的实现。可选为'native'和'te'。默认为'native'。
 - calculate_per_token_loss: 根据全局批次中的非填充token数量来对交叉熵损失进行缩放。默认为True。
@@ -53,7 +53,7 @@
 - seed: python、numpy、pytorch和cuda的随机种子，默认为42。
 - 🔥num_workers: dataloader的workers数量，默认为4。
   - 注意：若设置`--streaming true`，则设置为1。
-- no_data_sharding: 当`--train_dataloader_shuffle true`时对 train_dataloader 生效，默认为False。该参数控制数据集随机的范围。若设置为False，则先对数据集进行分片，然后对每个分片进行随机处理（略节约内存）；若设置为True，则先对数据集进行随机，再进行分片（更好的随机效果）。使用该参数需"ms-swift>=3.12"。
+- data_sharding: 当`--train_dataloader_shuffle true`时对 train_dataloader 生效，默认为False。该参数控制数据集随机的范围。若设置为True，则先对数据集进行分片，然后对每个分片进行随机处理（略节约内存）；若设置为False，则先对数据集进行随机，再进行分片（更好的随机效果）。
 - seq_length: 默认为None，即设置为`max_length`。对数据集长度进行限制建议使用“基本参数”中的`--max_length`控制，无需设置此参数。
 - use_cpu_initialization: 在cpu上初始化权重，默认为False。在进行HF和MCore权重转换时会被使用。通常不需要修改该值。
 - 🔥megatron_extra_kwargs: 额外需要透传入megatron的其他参数，使用json传递。默认为None。
@@ -96,7 +96,7 @@
   - 注意：**断点续训**你需要设置`--load`（lora训练需要额外设置`--adapter_load`），若设置`--finetune true`，将不加载优化器状态和随机种子状态并将迭代数设置为0，不会进行数据集跳过；若设置`--finetune false`，将读取迭代数并跳过之前训练的数据集数量，优化器状态和随机种子状态的读取通过`--no_load_optim`和`--no_load_rng`控制。
   - 流式数据集`--streaming`，暂不支持跳过数据集。
 - ckpt_format: checkpoint的格式。可选为'torch', 'torch_dist', 'zarr'。默认为'torch_dist'。（暂时权重转换只支持'torch_dist'格式）
-- no_initialization: 不对权重进行初始化，默认为True。
+- perform_initialization: 对权重进行初始化，默认为False。
 - auto_detect_ckpt_format: 自动检测ckpt format为legacy还是distributed格式。默认为True。
 - exit_on_missing_checkpoint: 如果设置了`–-load`，但**找不到检查点，则直接退出**，而不是初始化。默认为True。
 - 🔥async_save: 使用异步检查点保存。目前仅适用于`torch_dist`分布式检查点格式。默认为False。
@@ -135,7 +135,7 @@
 - tensorboard_log_interval: 记录到tensorboard的间隔（steps），默认为1。
 - tensorboard_queue_size: 用于暂存事件和摘要的 TensorBoard 队列大小；当队列中待处理的事件和摘要数量达到该大小时，下一次调用 "add" 相关方法会触发将数据刷新写入磁盘。默认为50。
 - log_timers_to_tensorboard: 记录timers到tensorboard。默认为True。
-- no_log_learning_rate_to_tensorboard: 不记录学习率到tensorboard。默认为False。
+- log_learning_rate_to_tensorboard: 记录学习率到tensorboard。默认为True。
 - log_validation_ppl_to_tensorboard: 将验证困惑度写入tensorboard。默认为True。
 - log_memory_to_tensorboard: 将内存日志写入tensorboard。默认为True。
 - logging_level: 日志级别。默认为None。
@@ -183,7 +183,7 @@
 - activation_func_clamp_value: 限制激活函数中 linear_fc1 的输出值范围。仅在 `activation_func` 为 `quick_gelu` 时使用。默认为None。
 - glu_linear_offset: GLU 激活函数中的偏移项：`activation_func(x[0]) * (x[1] + offset)`。仅在 gated_linear_unit 为 True 时使用。默认为0.。
 - untie_embeddings_and_output_weights: 解开embedding和输出权重的绑定，默认为True。
-- disable_bias_linear: 禁用linear层的bias。默认为True。
+- add_bias_linear: 开启linear层的bias。默认为True。
 - add_qkv_bias: 仅在QKV的linear中增加bias，默认为True。
 - attention_dropout: 默认为0.。
 - hidden_dropout: 默认为0.。
@@ -199,9 +199,9 @@
 
 
 **MoE参数**:
-- num_experts: MoE的专家数，默认为None。自动从config.json读取。
+- num_moe_experts: MoE的专家数，默认为None。自动从config.json读取。
 - moe_layer_freq: MoE 层与 Dense 层之间的分布频率。默认为None。从config.json中读取。
-- moe_ffn_hidden_size: 每个专家的前馈网络（ffn）的隐藏层大小。默认为None，自动从config.json读取。若未读取到且`num_experts`不为None，则设置为ffn_hidden_size。
+- moe_ffn_hidden_size: 每个专家的前馈网络（ffn）的隐藏层大小。默认为None，自动从config.json读取。若未读取到且`num_moe_experts`不为None，则设置为ffn_hidden_size。
 - moe_shared_expert_intermediate_size: 共享专家的总FFN隐藏层大小。如果有多个共享专家，它应等于 `num_shared_experts * ffn_size_of_each_shared_expert`。 默认为None。自动从config.json读取。
 - moe_router_topk: 每个token路由到的专家数量。默认为None。自动从config.json读取。
 - moe_router_num_groups: 将专家分成的组数，用于组限制路由。参考DeepSeek-V2和DeepSeek-V3。默认为None。自动从config.json读取。

diff --git a/docs/source/Megatron-SWIFT/Mcore-Bridge.md b/docs/source/Megatron-SWIFT/Mcore-Bridge.md
@@ -285,9 +285,10 @@ megatron export \
 ```python
 import torch
 
-from swift.megatron import MegatronArguments, convert_hf_config, get_megatron_model_meta
+from swift.megatron import (
+    MegatronArguments, convert_hf_config, get_megatron_model_meta, initialize_megatron
+)
 from swift.model import get_processor
-from megatron.training.initialize import initialize_megatron
 
 model_id = 'Qwen/Qwen3-4B-Instruct-2507'
 processor = get_processor(model_id, download_model=True)
@@ -327,10 +328,10 @@ LoRA权重的加载、导出和存储同理，运行`CUDA_VISIBLE_DEVICES=0,1,2,
 import torch
 
 from swift.megatron import (
-    MegatronArguments, convert_hf_config, get_megatron_model_meta, prepare_mcore_model
+    MegatronArguments, convert_hf_config, get_megatron_model_meta,
+    prepare_mcore_model, initialize_megatron
 )
 from swift.model import get_processor
-from megatron.training.initialize import initialize_megatron
 
 model_id = 'Qwen/Qwen3-30B-A3B-Instruct-2507'
 processor = get_processor(model_id, download_model=True)

diff --git a/docs/source/Megatron-SWIFT/Quick-start.md b/docs/source/Megatron-SWIFT/Quick-start.md
@@ -29,7 +29,7 @@ pip install pybind11
 pip install --no-build-isolation transformer_engine[pytorch]
 
 # apex
-# 提示：Megatron-SWIFT可以在不含apex的环境下运行，额外设置`--no_gradient_accumulation_fusion true`即可。
+# 提示：Megatron-SWIFT可以在不含apex的环境下运行，额外设置`--gradient_accumulation_fusion false`即可。
 git clone https://github.com/NVIDIA/apex
 cd apex
 pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./

diff --git a/docs/source_en/Instruction/Frequently-asked-questions.md b/docs/source_en/Instruction/Frequently-asked-questions.md
@@ -592,7 +592,7 @@ Saving checkpoints per epoch is not yet supported.
 ```text
 RuntimeError: ColumnParallelLinear was called with gradient_accumulation_fusion set to True but the custom CUDA extension fused_weight_gradient_mlp_cuda module is not found. To use gradient_accumulation_fusion you must install APEX with --cpp_ext and --cuda_ext. For example: pip install --global-option="--cpp_ext" --global-option="--cuda_ext ." Note that the extension requires CUDA>=11. Otherwise, you must turn off gradient accumulation fusion.
 ```
-Set `--no_gradient_accumulation_fusion true`.
+Set `--gradient_accumulation_fusion false`.
 
 ### Q163: For MoE LoRA training, if target_modules is set to all-linear, does this include the router modules?
 It depends on whether the gate is implemented as nn.Linear. If it's an nn.Parameter, it won't be trained. For details, see the command-line parameter [target_parameters](https://swift.readthedocs.io/en/latest/Instruction/Command-line-parameters.html#tuner-arguments).

diff --git a/docs/source_en/Megatron-SWIFT/Ascend.md b/docs/source_en/Megatron-SWIFT/Ascend.md
@@ -186,7 +186,7 @@ def train_step(self, forward_step_func, data_iterator, model, optimizer, opt_par
 
 ### Enable
 
-Additionally, since msprobe does not support fusion computation, you need to add `--no_bias_dropout_fusion True`, `--no_bias_swiglu_fusion True`, `--cross_entropy_loss_fusion False` to the launch script.
+Additionally, since msprobe does not support fusion computation, you need to add `--bias_dropout_fusion false`, `--bias_swiglu_fusion false`, `--cross_entropy_loss_fusion false` to the launch script.
 
 #### Example
 ```shell
@@ -200,7 +200,7 @@ megatron sft \
               'swift/self-cognition#500' \
     --tensor_model_parallel_size 2 \
     ...
-    --no_bias_dropout_fusion True \
-    --no_bias_swiglu_fusion True \
-    --cross_entropy_loss_fusion False
+    --bias_dropout_fusion false \
+    --bias_swiglu_fusion false \
+    --cross_entropy_loss_fusion false
 ```