MoE Parallel Folding: Heterogeneous Parallelism Mappings for Efficient Large-Scale MoE Model Training with Megatron Core

# Authors
- Dennis Liu∗ Zijie Yan∗ Xin Yao Tong Liu
Vijay Korthikanti Evan Wu Shiqing Fan Gao Deng Hongxiao Bai
Jianbin Chang Ashwath Aithal Michael Andersch Mohammad Shoeybi
Jiajie Yao Chandler Zhou David Wu Xipeng Li June Yang †
 - NVIDIA

# Abstract
- efficient training of large-scale MoE models across thousands of GPUs presents significant challenges due to limitations in existing parallelism strategies
- five-dimensional hybrid parallelism: Tensor Parallelism, Expert Parallelism, Context Parallelism, Data Parallelism, and Pipeline Parallelism
- Central to our approach is MoE Parallel Folding, a novel strategy that decouples the parallelization of attention and MoE layers in Transformer models, allowing each layer type to adopt optimal parallel configurations.
- Additionally, we develop a flexible token-level dispatcher that supports both token-dropping and token-dropless MoE training across all five dimensions of parallelism.
- Our experiments demonstrate significant improvements in training efficiency and scalability. We achieve up to 49.3% Model Flops Utilization (MFU) for the Mixtral 8x22B model and 39.0% MFU for the Qwen2-57B-A14B model on H100 GPUs, outperforming existing methods

# 1 Introduction
- Different parallelism strategies have been proposed in recent years for distributed LLM training, including model parallelism, data parallelism, and pipeline parallelism [31; 27; 19]. However, a single parallelism strategy has limitations regarding scalability. For example, the performance of data parallelism with ZeRO-3 will decrease dramatically when the number of GPUs increases to several thousands [21].
- While 3D parallelism is widely adopted for training large-scale dense models, optimizing training efficiency for MoE models using hybrid parallelism presents greater complexity. This is primarily due to the inherent sparsity of MoE models, which results in a significantly lower computation-to-parameter ratio compared to dense models. Employing a small degree of model parallelism often leads to out-of-memory issues for MoE models, whereas a large degree introduces substantial communication overhead and diminishes computational efficiency.
- **Moreover, the distinct computational characteristics of the Attention and Feed-Forward Network (FFN) layers in MoE models render a uniform parallelism strategy across these layers suboptimal.** -> 이게 결국 하고자 하는 말의 모티브
- To address these challenges, we propose an end-to-end training framework for large-scale MoE models based on 5-D hybrid parallelism, which integrates five key parallelism dimensions: Tensor Parallelism(TP), Expert Parallelism(EP), Context Parallelism(CP), Data Parallelism(DP), and Pipeline Parallelism(PP).
- At the core of our framework are two innovations: `MoE Parallel Folding` and an `efficient token-level dispatcher`
 - MoE Parallel Folding is a novel hybrid parallelism strategy that disentangles the parallel mappings of the Attention and MoE components in Transformer-based models.
 - This dispatcher accommodates both token-dropping and token-dropless training paradigms, eliminates sequence length dependencies, and enables dynamic tensor shapes, thereby facilitating the implementation of complex parallelism schemes.
- Contributions
 - 1. MoE Parallel Folding: We introduce MoE Parallel Folding, the first approach that decouples parallelization strategies for attention and MoE layers, enabling each layer to adopt its own optimal configurations. This method enables the folding of communication-intensive parallel dimensions to fit within high-bandwidth intra-node networks, reducing communication overhead.
 - 2. Flexible and efficient token-level dispatcher: We develop a novel dispatcher that supports both token-dropping and token-dropless MoE training with five-dimensional hybrid parallelism, including TP, EP, CP, DP, and PP.
 - 3. 3. Performance enhancements: Through MoE Parallel Folding, we demonstrate significant improvements in training efficiency and scalability for large-scale MoE models. By optimizing the utilization of network resources based on model characteristics, we achieve 49.3% MFU for Mixtral 8x22B and 39.0% MFU for Qwen2-57B-A14B on H100 GPUs.

# 2 Related Work
## 2.1 Mixture of Experts
- Addressing challenges such as load balancing among experts and managing dynamic computation graphs is critical for MoE architectures. Traditional methods often employ token-dropping training [7 / Switch Transformer], setting a capacity factor for each expert to prevent overloading. While this mitigates performance bottlenecks, it can result in some tokens being dropped or not fully processed, potentially affecting model quality. In contrast, Megablocks [8] utilizes token-dropless training to ensure all input tokens are processed, demonstrating better performance for models of equivalent size and training data by avoiding the loss of information inherent in token dropping.
 - Megablock paper도 좀 봐야겠군.. Trevor Gale, Deepak Narayanan, Cliff Young, and Matei A. Zaharia. Megablocks: Efficient sparse training with mixture-of-experts. ArXiv, abs/2211.15841, 2022.


## 2.2 Distributed MoE Training (공부하기 좋게 정리되어있네, 참고논문 체크)
- TP can significantly reduce the memory consumption of each model rank but introduces some intra-layer communication overhead.
- DP distributes batches of data across replicas of the model on different devices, aggregating gradients during training
- Zero Redundancy Optimizer(ZeRO) further splits optimizer states, model weights and gradients across DP group to trade memory with communication
- CP splits the input sequences into small segments for each device, allowing for very long sequence length training
- PP splits[6;21; 19; 20] the model layers across devices, enabling different stages of the model to process data concurrently in a pipelined fashion
- In the context of MoE models, EP is employed to optimize MoE training by assigning different experts to different devices[4; 7; 1; 15; 30]. 
 - During training, the routing mechanism directs inputs to the appropriate experts across devices
 - EP efficiently utilizes hardware resources by balancing the computational load and reducing inter-device communication overhead associated with expert data exchanges.
 - To further enhance the efficiency of distributed MoE training, hybrid parallelism strategies are leveraged which combines EP with other parallelism methods, like FSDP and TP

# 3 Method
## 3.1 Preliminary
### 3.1.1 Mixture of Experts
- Capacity Factor (CF)를 넘는 토큰이 drop되는거군.. 그러니까 expert당 예상 평균 토큰 * CF 숫자보다 넘는 토큰들은 드랍된다

<img width="1248" height="1142" alt="Image" src="https://github.com/user-attachments/assets/8b18f4f7-9aac-44a3-90fe-3af143787d98" />

### 3.1.2 Expert Parallelism

- The EP process comprises three key stages
 - T**oken Dispatching**: Initially, input tokens are grouped according to their assigned experts through `data permutation`, ensuring tokens destined for the same expert are stored contiguous in memory. An All-to-All collective communication operation then exchanges token data between devices, allowing each device to receive only the tokens required by its locally hosted experts
 - 토큰 디스패칭(Token Dispatching) 단계에서 data permutation 이후에 All-to-All 집단 통신 연산을 수행하는 이유는 토큰들을 실제로 해당 엑스퍼트가 위치한 장치(device)로 이동시키기 위함
 - 2. 데이터 순열(Permutation)의 역할 (장치 내 분류): 토큰 디스패칭의 첫 단계인 data permutation은 입력 토큰들을 해당 토큰이 할당된 엑스퍼트에 따라 그룹화하여 메모리에 인접하게(contiguous) 저장되도록 보장합니다. 이 과정은 토큰을 분류하고 정렬하는 역할을 하지만, 토큰의 물리적 위치가 필요한 엑스퍼트가 있는 장치로 바뀐 것은 아닙니다.
 - 3. All-to-All 통신의 역할 (장치 간 이동): data permutation이 완료된 후, 토큰들은 여전히 원래의 장치에 남아있을 수 있습니다. 이 토큰들을 실제로 처리할 엑스퍼트는 네트워크상 다른 장치에 존재할 수 있습니다. 따라서 All-to-All 집단 통신 연산을 수행하여 장치들 사이에 토큰 데이터를 교환해야 합니다.
 - 이 통신을 통해 각 장치는 자신이 현재 가지고 있는 토큰 중 다른 장치의 엑스퍼트에게 필요한 토큰을 보내고, 동시에 자신이 로컬로 호스팅하는 엑스퍼트가 필요로 하는 토큰만을 다른 장치들로부터 수신하게 됩니다. 결론적으로, data permutation은 토큰을 분류하는 단계이고, All-to-All은 분류된 토큰들을 분산된 엑스퍼트의 위치로 실제로 전달하는 단계이며, 이 두 단계가 합쳐져 토큰 디스패칭 프로세스를 구성합니다.

 - **Expert Computation**: Each device processes its local batch of tokens through its designated experts in parallel. Since experts operate independently, this stage requires no inter-device communication, allowing for efficient concurrent computation across the distributed system
 - **Token Restore**: After expert processing, the output tokens are rearranged to restore their original sequence order through an inverse permutation operation (시퀀스 순서 맞추기 위해서 다시 permutation을 거꾸로 적용시켜주는게 필요하다는것!) This restoration step ensures proper alignment for subsequent layer operations while maintaining the model’s sequential processing requirements. The restored tokens can then flow into the next layer of the network (• 왜 필요한가? MoE 레이어의 다음 단계는 일반적으로 어텐션 레이어(Attention layer) 또는 다음 트랜스포머 블록이 됩니다. 이러한 후속 레이어들은 입력 토큰들이 순서대로 정렬되어 있어야만 올바른 행렬 연산을 수행할 수 있습니다. 정렬이 틀어지면 모델의 계산 정확성(numerical correctness)이 깨지게 됩니다.)

## 3.2 MoE Parallel Folding
- Attention operations are performed at the whole-sequence level with dense computation,
requiring information exchange between devices holding sub-sequences when using TP and CP.
- In contrast, MoE layers process individual tokens rather than whole sequences, and their inherent sparsity makes them more suitable for EP with lower communication overhead.

```
MoE 레이어가 EP에 더 적합하고 통신 오버헤드가 낮은 주요 이유는 **"개별 토큰 처리"**와 "내재된 희소성(inherent sparsity)" 때문입니다.
• 개별 토큰 처리 (Token-level processing): MoE 레이어는 전체 시퀀스(whole sequences) 대신 개별 토큰(individual tokens)을 처리합니다.
• 희소성 (Sparsity): MoE 모델은 라우팅 메커니즘을 통해 각 입력 토큰에 대해 가장 관련성이 높은 소수의 엑스퍼트만 동적으로 선택하여 활성화합니다. 즉, 계산에 참여하는 파라미터가 전체 파라미터 중 일부에 불과합니다.
2. EP가 통신 오버헤드를 낮추는 원리
EP는 이러한 희소성을 활용하여 통신을 최소화합니다.
• 엑스퍼트 분산 및 계산 부하 균형: EP는 서로 다른 엑스퍼트들을 각기 다른 장치(device)에 할당하여 계산 부하를 효율적으로 분산시킵니다.
• 필요한 토큰만 전송: 훈련 중, 라우팅 메커니즘이 입력 토큰을 해당 엑스퍼트가 있는 장치로 전송해야 합니다. 이 과정에서 All-to-All 집단 통신 연산이 수행되지만, 각 장치는 자신이 로컬로 호스팅하는 엑스퍼트가 필요로 하는 토큰만을 받게 됩니다. 즉, 모델 전체의 큰 텐서를 주고받는 것이 아니라, 활성화된 엑스퍼트로 향하는 소수의 토큰 데이터만 이동합니다.
• 엑스퍼트 계산의 독립성: EP의 엑스퍼트 계산(Expert Computation) 단계에서는 엑스퍼트들이 독립적으로 작동하므로 장치 간 통신이 전혀 필요 없습니다 (no inter-device communication).
EP는 계산 부하를 분산시키고 엑스퍼트 데이터 교환과 관련된 장치 간 통신 오버헤드를 줄이는 데 효율적입니다.
```

- Consequently, forcing MoE layers to follow the same parallelism mapping as Attention layers is sub-optimal
- To achieve optimal hybrid parallelism for MoE models, we propose MoE Parallel Folding, which disentangles the parallel mappings between Attention and MoE layers


<img width="2460" height="1002" alt="Image" src="https://github.com/user-attachments/assets/c7520528-42f5-444b-8d0e-efc0d97233c6" />

- previous methods place the EP group in a sub-group of DP, which greatly restricts the scalability of MoE. The maximum degree of expert parallelism is bounded by the degree of data parallelism.
- Instead, we flatten the parallelism mappings of the attention layer and allow model parallelism in the MoE layer to be folded with arbitrary sub-groups of attention, making the parallelism mappings of MoE layer more flexible and efficient. (즉 기존의 TPCPDPPP등에 썼던 GPU rank를 쭉 flatten해서 펼쳐서 MoE에서는 다시 정의해서 쓰겠다가 핵심인듯..)

```
"Instead, we flatten the parallelism mappings of the attention layer..." (대신, 우리는 어텐션 레이어의 병렬화 매핑을 펼치고...)
이는 병렬 처리 그룹을 더 유연하게 재구성한다는 의미입니다. MoE 병렬 폴딩은 **어텐션과 MoE 레이어의 병렬 매핑을 분리(decouple)**하는 새로운 하이브리드 병렬화 전략입니다.
어텐션 레이어에 대해, 이 프레임워크는 TP × CP × DP × PP로 구성된 4차원 병렬 그룹을 형성합니다. 여기서 'flatten'은 이러한 차원들을 평평하게 배열하여 MoE 레이어와 유연하게 연결할 수 있는 기반을 마련하는 것으로 이해할 수 있습니다.
"...and allow model parallelism in the MoE layer to be folded with arbitrary sub-groups of attention..." (그리고 MoE 레이어의 모델 병렬 처리가 어텐션의 임의의 서브 그룹과 폴딩(folding)되도록 허용합니다.)
이것이 MoE Parallel Folding의 핵심 아이디어입니다.
• 모델 병렬 처리(Model Parallelism) in MoE: MoE 레이어에서는 주로 **EP(Expert Parallelism)**와 경우에 따라 **ETP(Expert-TP)**가 모델 병렬 처리 역할을 합니다.
• 폴딩(Folding): 이는 MoE 레이어의 병렬 그룹(주로 EP 그룹)이 어텐션 레이어의 병렬 그룹(TP, CP, DP) 내의 임의의 서브 그룹과 겹쳐지거나 통합되어 구성될 수 있음을 의미합니다.
이러한 분리와 폴딩을 통해 (예를 들어, 어텐션 레이어의 TP 그룹과 MoE 레이어의 EP 그룹을 독립적으로 설정하고 조정할 수 있도록) 두 레이어에 최적의 병렬 구성을 독립적으로 채택할 수 있게 됩니다. 예를 들어, EP는 ETP보다 통신 효율성이 더 높기 때문에 ETP를 EP로 대체하고 이를 어텐션 레이어의 TP와 '폴딩'할 수 있습니다.
```
- For the attention layers, we form a four-dimensional parallel group comprising TP ×
CP ×DP ×PP. 
- For the MoE layers, we define another four-dimensional group consisting of
TP ×EP ×DP ×PP. For convenience, we name the TP and DP group for MoE layer as Expert-TP(ETP) and Expert-DP(EDP). 
 - Attn에서는 TPCPDPPP지만, MoE에서는 CP는 사실 시퀀스가 의미가 없이 토큰레벨 처리되니까 CP 대신 EP가 들어가서, ETP, EP, EDP, PP라고 보면 될듯!
- The only restriction is that the number of PP groups and members of each PP group for the Attention and MoE layer must be consistent
 - PP 그룹 안에서는 일치해야함
- MoE Parallel Folding provides two main benefits (결국 겹쳐놓으면 intra layer comm이 줄어서 좋다는 것과, (TP, CP를 안써서 그런가..) Attn과 별개로 MoE에서 최적의 병렬화를 할수 있다는것)
- First, it allows selecting the optimal parallelism mapping for the MoE layer independently of the Attention layer. For example, EP is more communication-efficient than ETP. We can replace ETP with EP and fold it with TP in the Attention layer. 
- Second, the folded parallelism mappings enable communication within more compact groups (폴딩 전략은 모델 병렬화 차원(예: MoE의 EP 그룹)을 어텐션 레이어의 병렬화 차원(예: CP 그룹) 내에서 전략적으로 통합 및 재배열합니다. ◦ 이는 통신을 수행해야 하는 장치 그룹을 물리적으로 더 가깝게(compact) 구성하여, 해당 레이어 내 통신이 노드 외부로 나가는 것을 방지합니다.). By folding model parallelism across attention and MoE layers, the scope of intra-layer communication is reduced, allowing it to fit within high-bandwidth intra-node connections more effectively

## 3.3 Flexible and Efficient Token Dispatcher
- To ensure numerical correctness while maintaining high performance under different parallelism strategies, we have designed a unified token dispatcher that handles both ETP and EP within the MoE layer
- With MoE Parallel Folding, the inputs fed into the MoE layer from the attention layer are split either along the batch dimension (DP) or the sequence dimensions (CP and TP). In both scenarios, different ranks contain different chunks of tokens. Since the expert layer computes the features of each token individually, we can employ the same workflow for the token dispatcher regardless of the parallelism mappings of the attention layer.

<img width="1220" height="504" alt="Image" src="https://github.com/user-attachments/assets/4ddbf80a-8769-4dd3-a539-a614d80d9915" />
(DP2, TP2인 그림이 앞부분이고, DP2니까 어떤 expert로 갈지가 다를순있음, 후반부는 ETP의 경우 결국 입력은 같고 연산만 달라야되니까 AG해준거고)

- In Figure 2, we illustrate the workflow of an MoE layer distributed across four GPUs, where the degrees of TP and ETP are both 2. GPU pairs (0, 1) and (2, 3) form the ETP group. GPU pairs (0, 2) and (1, 3) form the EP group
- The forward computation workflow proceeds as follows. 
 - First, the router determines the mapping of each token to its designated expert based on the local input and reorganizes the tokens assigned to the same expert into contiguous memory regions through a permutation operation.
 - Next, an All-to-All-V communication is executed across the EP groups to exchange tokens, ensuring that each token is delivered to its corresponding expert
 - Following this, an AllGather-V communication is performed within the ETP groups to guarantee that all members within an ETP group **share identical activations**
 - Once the AllGather-V communication is complete, each GPU computes its assigned partition of the expert feed-forward networks.
 - A subsequent ReduceScatter-V communication within the ETP groups aggregates and distributes the output hidden states, effectively reversing the AllGather operation
 - Another All-to-All-V communication is then employed to return the tokens to their original GPUs
 - Finally, an un-permutation operation restores the tokens to their initial order, preparing them for
further processing in the attention layer
- The backward workflow mirrors the forward process, with the AllGather/ReduceScatter (AG/RS) operations in the TP groups replaced by ReduceScatter/AllGather (RS/AG)
- MPI_Alltoallv 함수는 각 프로세스의 다양한 데이터 수를 허용하여 MPI_Alltoall 함수에 유연성을 더합니다

- We now elaborate on `the design of the router to support both token-dropping and token-dropless training paradigms`.
 - In token-dropless training, ensuring numerical correctness is straightforward, as token assignments remain consistent across different parallelism configurations. 
 - For token-dropping training, two potential strategies can be employed: 
 - full-sequence-based dropping
 - Full-sequence dropping ensures consistency by gathering logits from all ranks that collectively represent the entire sequence. However, this approach incurs significant communication overhead, particularly when sequences are distributed across multiple nodes
 - sub-sequence-based dropping
 - Sub-sequence dropping, on the other hand, makes dropping decisions based solely on the logits from the current sub-sequence. This strategy eliminates the need for gathering logits across ranks, thereby reducing communication overhead and alleviating load imbalance issues during token communication.
 - Empirically, we observe that **sub-sequence dropping** does not adversely affect model convergence compared to full-sequence dropping. Consequently, we adopt the sub-sequence dropping approach as the default strategy in this work.

```
Question was from "Full-sequence dropping incurs significant communication overhead, particularly when sequences are distributed across multiple nodes"
가설 1: CF를 넘긴다고 판단되면 하고 그 다음에 버려도 되는 거 아냐?
data permutation (데이터 순열)은 라우터가 토큰을 할당한 후 동일한 엑스퍼트로 향하는 토큰들을 메모리에서 인접하게 저장하는 단계입니다.
드롭핑 결정이 **전역적인 정보(full-sequence logits)**를 필요로 한다면, 이 전역 정보가 모이지 않은 상태에서 개별 장치가 로컬 정보만을 가지고 토큰을 permute하는 것은 정확하지 않거나 비효율적입니다.
• 만약 로컬에서 permute를 진행한 후 전역 정보를 모아 드롭핑을 결정한다면, 이미 permute된 토큰 중 일부를 버려야 하므로 permute 작업 자체가 비효율적으로 수행될 수 있습니다.
• 가장 중요한 것은, **라우터 결정(어떤 토큰을 Top-K로 선택할지)**과 CF에 따른 드롭핑 결정이 **토큰 디스패칭(Token Dispatching)**의 가장 초기에 통합되어 처리되어야 한다는 점입니다.
가설 2: 이미 expert에 보낸 후 계산 과정에서 스킵하거나?
소스에 따르면, 토큰 디스패칭의 목표는 "각 장치가 로컬로 호스팅하는 엑스퍼트가 필요로 하는 토큰만을 수신하도록" 토큰 데이터를 교환하는 것(All-to-All)입니다.
• 만약 토큰이 드롭될 운명이라면, 해당 토큰은 **네트워크 통신(All-to-All-V)**을 통해 엑스퍼트가 있는 장치로 전송될 필요가 없습니다. 엑스퍼트 계산 단계(Expert Computation)는 토큰을 받은 후에 병렬로 처리하며, 토큰을 불필요하게 전송하고 나서 계산을 스킵하는 것은 네트워크 대역폭을 낭비하게 됩니다.
• 따라서 드롭핑 결정은 All-to-All 통신 이전에 최종적으로 확정되어야 통신 오버헤드를 최소화할 수 있습니다.
3. 효율적인 대안: Sub-sequence Dropping
이러한 Full-sequence dropping의 통신 오버헤드 문제를 해결하기 위해, 해당 연구에서는 **서브 시퀀스 기반 드롭핑 (Sub-sequence dropping)**을 기본 전략으로 채택했습니다.
Sub-sequence dropping은 현재 서브 시퀀스에서 나온 로짓에만 의존하여 드롭핑을 결정하기 때문에, 랭크 간에 로짓을 수집할 필요가 없어 통신 오버헤드를 줄여줍니다. 경험적으로 이 방법이 모델 수렴에 부정적인 영향을 주지 않는 것으로 관찰되었기 때문입니다.
```

Q) drop된 토큰은 어떤값으로 대체되는가? 원래 값 그대로 residual하게 가는가 그게 궁금..

# 4 Experiments
## 4.1 Experimental Setup
- The Eos cluster consists of NVIDIA DGX H100 nodes, each equipped with eight NVIDIA H100 GPUs [24] and two 56-core
Intel Sapphire Rapids CPUs. Each GPU achieves a peak half-precision throughput of 989.5 TFLOP/s, and all GPUs are interconnected via NVLink 4th Generation [22] and InfiniBand [23]. The peak uni-directional communication bandwidths are 450 GB/s for intra-node (NVLink) and 400Gbps for inter-node (InfiniBand) connections
- We utilize PyTorch 2.5.0 and CUDA 12.6 for our experiments. All performance measurements reported in TFLOPS and MFU are conducted using BF16 precision. Up to 1024 GPUs are utilized in the scaling experiments
- We select two types of MoE models for our experiments, coarse-grained and fine-grained MoE
- Compared to coarse-grained MoE, fine-grained MoE has a larger number of experts and more activated experts per token, but each expert has a reduced hidden size. For the coarse-grained MoE, we select the Mixtral 8x22B [18] model and design a larger MoE named Llama3-8x70B by upcycling Llama3-70B [5] to 8 experts [13]. For the fine-grained MoE, we choose Qwen2-57B-A14B [38], which has 64 experts and 8 active experts per token, totaling 57 billion parameters with 14 billion active parameters
- To obtain a larger fine-grained MoE model, we reparameterized the Mixtral 8x22B model to 64 experts and 8 active experts per token called Mixtral-8x22B-G8T8, with each expert possessing a hidden size that is (1/8) one-eighth of the original model, by applying fine-grained upcycling [9].

## 4.2 Performance Comparison
- To alleviate the performance jitter caused by load imbalance issues in dropless training, we use **token drop training with a capacity factor equal to 1 for benchmarking**
1. FSDP [39]: A data parallelism method that shards model parameters, gradients, and opti-
mizer states across workers.
2. FSDP + EP [8]: An extension of FSDP that incorporates EP.
3. TP+EP+DP [32]: An framework combining TP and EP to fit larger MoE models across
multiple GPUs.
4. MCore with 5D-parallelism[21]: The state-of-the-art training framework for large scale
LLM models, supporting TP,EP,CP,DP and PP.
All baseline methods were implemented using the NVIDIA Megatron-Core framework.

<img width="1314" height="403" alt="Image" src="https://github.com/user-attachments/assets/9981184b-8fb7-4468-827c-fe22e4340d40" />


(이거 나중에 발표때 사용해보자)
- The experiments also reveal that fine-grained MoE models achieve lower training efficiency compared to coarse-grained MoE models across all parallelism strategies. This performance gap stems from two key factors: (아주 중요한 부분임.. comm overhead + GEMM efficiency)
 - (1) Fine-grained MoE models generate higher communication volume due to their architecture - they employ more experts and activate more experts per token, increasing communication overhead during the token dispatching process. Additionally, the smaller hidden sizes decrease GEMM efficiency.
 - (2) Fine-grained MoE models typically incorporate a larger number of local and active experts, leading to significant memory overhead for storing activations The memory requirements for managing numerous experts force the use of larger model parallelism sizes, which introduces additional communication costs and further reduces training efficiency
 - 한개 한개는 작은 expert일지라도 많이 활성화하면 사용하는 메모리 및 activation 저장에 대한 requirements는 커진다.. 그래서 모델 parallel도 더 많이 필요하게됨 -> 더 많이 parallel하게 하면 comm overhead 커짐

## 4.3 Scaling Experiments

<img width="1239" height="749" alt="Image" src="https://github.com/user-attachments/assets/997c83e4-dae0-4f13-94ee-87afa5832eeb" />

<img width="1240" height="412" alt="Image" src="https://github.com/user-attachments/assets/e9ced317-3b5a-4d85-baee-ddb1814e4a7a" />

## 4.4 Ablation Study

<img width="1239" height="432" alt="Image" src="https://github.com/user-attachments/assets/724c5a51-c2f7-451b-94ae-791a197a6c62" />

<img width="1231" height="531" alt="Image" src="https://github.com/user-attachments/assets/62bdf2ff-fb30-4ff2-90ac-5abfc1e1f414" />
fine-grained일 수록 CP를 쓸 경우 A2A에서 토큰정보를 동기화하는게 굉장히 오래걸림
- when the size of the CPxEP group exceeds 8 and spans beyond the NVLINK domain, the latency without MoE Parallel Folding increases significantly. Without MoE Parallel Folding, the EP group spans across multiple context parallelism groups, causing All-to-All communications within the EP group to traverse the lower-bandwidth inter-node network fabric
- The MoE Parallel Folding technique allows the CP and EP groups to be folded together, maximizing the use of high-bandwidth NVLink connections whenever possible -> 이 부분이 잘 이해는 안감.. 어떻게 다시 NVLink 범위내로 줄일수있을까..? 아 예시가 Mixtral 이라서 Expert가 8이라 가능한거네.. 어차피 MoE layer에서는 다시 토큰레벨로 GPU를 할당해서 폴딩해버리니까..?!


## 4.5 FP8 Training Performance
근데 아래 결과는 fine-grained MoE는 아니라서, 요새 추세에 맞는건 다른 모델로 한번 더 측정해야될 것
<img width="2234" height="538" alt="Image" src="https://github.com/user-attachments/assets/f25491c7-8c16-4e27-9d53-08350f6e4da8" />

# 5 Conclusion
- First, we propose MoE Parallel Folding, a technique that decouples the parallelization strategies of attention and MoE layers, enabling more flexible and efficient parallel configurations. This approach allows for optimal resource utilization by adapting to the distinct computational characteristics of each layer. 
- Second, we develop an efficient token-level dispatcher that supports both token-dropping and token-dropless training across five dimensions of parallelism, providing a robust foundation for complex hybrid parallelism schemes. Our experimental results demonstrate significant performance improvements across different MoE architectures, achieving up to 49.3% MFU for Mixtral 8x22B and 39.0% MFU for Qwen2-57B-A14B on H100 GPUs. The framework shows strong scaling efficiency up to 1024 GPUs and maintains high performance with sequence lengths up to 128K tokens. 
- These results validate the effectiveness of our approach in addressing the scalability challenges of large-scale MoE model training.



# Appendix

## 6.2 Workflow for Transformer Layer with MoE parallel folding

<img width="1151" height="1002" alt="Image" src="https://github.com/user-attachments/assets/3ff81f5e-a508-44d1-8ae6-57dec531e409" />

- CP관련해서는 결국 시퀀스가 4개로 쪼개져서 gpu로 나뉘는데, CP2니까 원래 2개로 쪼개지는데 그 2개가 TP 그룹내에서 TP가 입력에 따라 또 쪼개다보니 그 입력이 TP2니까 2개로 쪼깨져서 저렇게 그려지는거고, CP가 결국 Ring으로 Attn을 모으니까 서로 다른 Seq로 처음에 쪼갰던 0<->2 사이 그리고 1<->3사이에서 comm이 일어나는 것!

#### 아니 근데 TP는 모델 weight를 나누는건데 왜 컨텍스트 조각을 나누냐고

사용자님의 질문은 텐서 병렬 처리(TP)와 컨텍스트 병렬 처리(CP)가 Attention 레이어에서 데이터에 어떻게 영향을 미치는지에 대한 핵심적인 오해를 담고 있습니다. 사용자님께서 정확히 이해하신 대로, **TP(Tensor Parallelism)는 주로 모델의 가중치(weight)와 은닉 차원(hidden dimension)을 분할하는 역할**을 합니다.

하지만 Attention 레이어에서 **TP가 적용되면, 그 레이어의 입출력 데이터인 활성화 텐서(activation tensor, 즉 시퀀스 데이터) 역시 TP 축을 따라 분할되어야 합니다.**

Figure 9의 예시(Attention 컴포넌트: TP2CP2DP2, 시퀀스는 4개 GPU에 분할됨)에서 시퀀스가 $CP \times TP$의 곱인 4개의 GPU에 분할되는 이유는 다음과 같습니다.

### 1. CP의 역할: 시퀀스 길이 차원의 분할

* **CP (Context Parallelism)**는 시퀀스의 **길이 차원(sequence length dimension)**을 따라 입력 시퀀스를 작은 세그먼트(segments)로 명시적으로 분할합니다.
* CP=2라는 것은 시퀀스 길이가 2개의 조각으로 나뉘어 각기 다른 장치(ranks)에 할당됨을 의미합니다.

### 2. TP의 역할: 활성화 텐서의 은닉 차원 분할

* **TP (Tensor Parallelism)**는 Attention 레이어의 **가중치(weight)**를 **은닉 차원(hidden dimension)**을 따라 분할하여 메모리 사용량을 줄입니다.
* TP=2가 적용된 상황에서, 이 분할된 가중치로 연산을 수행하려면 해당 레이어의 **입력 텐서(활성화, activations)**도 가중치와 일치하도록 **은닉 차원을 따라 2개로 분할**되어야 합니다.
* Attention 연산은 **시퀀스 전체 수준(whole-sequence level)**에서 밀집된 계산을 수행하기 때문에, TP를 사용할 경우 **서브 시퀀스를 가진 장치들 사이에 정보 교환이 필요합니다**.

### 3. CP와 TP의 결합 효과: 시퀀스 조각의 최종 분산

TP와 CP가 동시에 사용될 때 ($TP=2$, $CP=2$):

1. **CP=2:** 시퀀스 $L$이 $L/2$ 크기의 두 조각(Context Piece A, Context Piece B)으로 나뉩니다.
2. **TP=2:** 각 컨텍스트 조각($L/2$)이 다시 은닉 차원 $H$를 따라 $H/2$ 크기의 두 조각으로 나뉩니다.

결과적으로, 하나의 전체 시퀀스를 처리하는 데이터는 $CP \times TP = 2 \times 2 = 4$ 개의 GPU에 걸쳐 분산됩니다. 시퀀스의 **어떤 조각**($Context$ Piece)이 어떤 GPU에 위치하는지는 CP와 TP의 매핑에 의해 동시에 결정되는 것입니다.

따라서 시퀀스는 **CP에 의해 길이 방향으로** 나뉘고, 이 나뉜 조각들이 **TP에 의해 은닉 차원 방향으로** 다시 나뉘면서, 전체적으로 4개의 GPU에 분산된다고 이해할 수 있습니다. 이 구조 덕분에 각 GPU는 시퀀스의 일부와 모델의 일부 파라미터를 담당하게 되어 대규모 훈련이 가능해집니다.

#### 6.2 그림에서 왜 CP Ring Exchange는 0,2번 랭크끼리 그리고 1,3번 랭크끼리 이뤄지는거야? 이때 뭐가 교환되는건데?
<html><head></head><body>제공해주신 자료의 Figure 9(MoE 병렬 폴딩이 적용된 트랜스포머 레이어의 워크플로우)에서 **CP 링 교환(CP Ring Exchange)**이 랭크 (0, 2)와 (1, 3) 사이에서 이루어지고 이때 교환되는 데이터에 대해 설명해 드리겠습니다.
<h3>1. CP Ring Exchange 그룹 (0, 2)와 (1, 3)의 구성 이유</h3>
Figure 9의 Attention 컴포넌트는 TP2CP2DP2 병렬화 매핑을 사용하며, 하나의 시퀀스는 총 4개의 GPU에 걸쳐 분할됩니다 ($TP \times CP = 4$). 랭크 0, 1, 2, 3은 이 4개의 시퀀스 분할 조각을 담당하는 GPU를 나타냅니다.
CP Ring Exchange 그룹이 (0, 2)와 (1, 3)으로 구성되는 이유는 텐서 병렬 처리(TP)와 컨텍스트 병렬 처리(CP)의 매핑 구조 때문입니다.
<ol>
<li>TP (Tensor Parallelism) 그룹: TP는 은닉 차원을 따라 가중치와 활성화 텐서를 분할합니다. 이 예시에서 $TP=2$이므로, 2개의 TP 그룹이 존재합니다.</li>
<li>CP (Context Parallelism) 그룹: CP는 시퀀스 길이 차원을 따라 데이터를 분할합니다 ($CP=2$).</li>
</ol>
랭크 구성 추론:
일반적으로 분산 훈련에서 GPU 랭크를 TP와 CP 축으로 배치할 때, (0, 2)와 (1, 3) 쌍은 다음과 같은 구조를 나타냅니다:

랭크 쌍 | 특징
-- | --
(0, 2) | 이 랭크들은 동일한 TP 랭크를 공유하지만, 다른 CP 랭크에 속합니다.
(1, 3) | 이 랭크들 역시 동일한 TP 랭크를 공유하지만, 다른 CP 랭크에 속합니다.


CP Ring Exchange의 필요성:
Attention 연산은 전체 시퀀스 수준에서 수행되지만, CP 전략은 시퀀스를 분할하여 각 랭크가 서브 시퀀스만 갖게 만듭니다. 따라서 CP Ring Exchange는 동일한 TP 그룹 내에서 (즉, 모델의 파라미터 분할은 같지만, 시퀀스 조각이 다른) 인접한 CP 랭크끼리 통신하며 시퀀스의 정보(활성화)를 교환하는 역할을 수행합니다.
이러한 방식으로 그룹을 구성하면, 분할된 시퀀스 조각들이 Attention 계산에 필요한 정보를 효율적으로 순환시키며 교환할 수 있습니다.
<h3>2. CP Ring Exchange를 통해 교환되는 데이터</h3>
CP Ring Exchange를 통해 교환되는 데이터는 **활성화 텐서(Activations) 또는 은닉 상태(Hidden States)**입니다.
<ul>
<li>Attention 연산은 전체 시퀀스 수준에서 정보를 필요로 합니다.</li>
<li>$CP=2$가 시퀀스 길이를 나누었기 때문에, 각 랭크는 시퀀스의 일부분(서브 시퀀스)만 가지고 있습니다.</li>
<li>CP Ring Exchange는 분할된 시퀀스 조각을 가진 장치들 간에 정보를 교환하여, 각 장치가 Attention 계산을 위해 필요한 시퀀스 데이터를 확보하도록 합니다.</li>
</ul>
즉, CP Ring Exchange는 분할된 시퀀스 데이터(활성화)의 조각들을 순환시키며 교환함으로써, Attention 계산의 수치적 정확성을 보장하는 핵심적인 집단 통신 연산입니다. 이 통신을 통해 서브 시퀀스를 가진 장치들 사이에 정보 교환이 발생합니다.</body></html>


## 6.3 Parallel Groups Initialization

<img width="771" height="875" alt="Image" src="https://github.com/user-attachments/assets/93794b85-5e1a-4207-a3d0-7dcb7b2da571" />

## 6.4 Details of Parallelism Mappings in Experiments

<img width="1424" height="1856" alt="Image" src="https://github.com/user-attachments/assets/c98902ac-2f4f-463e-a2e7-d3edd4ead416" />

<img width="407" height="907" alt="Image" src="https://github.com/user-attachments/assets/a6ce390a-1c5a-4e8e-96b8-62721e1d2441" />

<img width="708" height="350" alt="Image" src="https://github.com/user-attachments/assets/bfadfa92-ce21-4e28-a7a6-531ceb551eb9" />

랭크 쌍	특징
(0, 2)	이 랭크들은 동일한 TP 랭크를 공유하지만, 다른 CP 랭크에 속합니다.
(1, 3)	이 랭크들 역시 동일한 TP 랭크를 공유하지만, 다른 CP 랭크에 속합니다.

MoE Parallel Folding: Heterogeneous Parallelism Mappings for Efficient Large-Scale MoE Model Training with Megatron Core #47

Description

Authors

Abstract

1 Introduction

2 Related Work

2.1 Mixture of Experts

2.2 Distributed MoE Training (공부하기 좋게 정리되어있네, 참고논문 체크)

3 Method

3.1 Preliminary

3.1.1 Mixture of Experts

3.1.2 Expert Parallelism

3.2 MoE Parallel Folding

3.3 Flexible and Efficient Token Dispatcher

4 Experiments

4.1 Experimental Setup

4.2 Performance Comparison

4.3 Scaling Experiments

4.4 Ablation Study

4.5 FP8 Training Performance

5 Conclusion

Appendix

6.2 Workflow for Transformer Layer with MoE parallel folding

아니 근데 TP는 모델 weight를 나누는건데 왜 컨텍스트 조각을 나누냐고

1. CP의 역할: 시퀀스 길이 차원의 분할

2. TP의 역할: 활성화 텐서의 은닉 차원 분할

3. CP와 TP의 결합 효과: 시퀀스 조각의 최종 분산

6.2 그림에서 왜 CP Ring Exchange는 0,2번 랭크끼리 그리고 1,3번 랭크끼리 이뤄지는거야? 이때 뭐가 교환되는건데?

1. CP Ring Exchange 그룹 (0, 2)와 (1, 3)의 구성 이유

2. CP Ring Exchange를 통해 교환되는 데이터

6.3 Parallel Groups Initialization

6.4 Details of Parallelism Mappings in Experiments

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions