Skip to content

Split K for SharedMoE fusion#70

Open
jianan-gu wants to merge 9 commits intomingfeima:cpu_opt_ww11from
jianan-gu:jianan/splitk_sharedmoe
Open

Split K for SharedMoE fusion#70
jianan-gu wants to merge 9 commits intomingfeima:cpu_opt_ww11from
jianan-gu:jianan/splitk_sharedmoe

Conversation

@jianan-gu
Copy link

@jianan-gu jianan-gu commented Apr 24, 2025

Performance on a GNR_DDR ( so far )
Current splitk num is 4

Prefill (total/avg/layers) 1K BS1 BS4
Baseline 105.897ms/1.826ms/58 396.073ms/6.829ms/58
Splitk=2 100.539ms/1.733ms/58 373.898ms/6.447ms/58
Splitk=4 102.118ms/1.761ms/ 58 354.326ms/6.109ms/58
Splitk=8 107.627ms/1.856ms/58 358.720ms/6.185ms/58
Decode (total/avg/layers) 1K BS1 BS4
Baseline 4.904ms/84.549us/58 5.270ms/90.864us/58
Splitk=2 4.491ms/77.430us/58 5.524ms/95.247us/58
Splitk=4 4.381ms/75.539us/58 5.001ms/86.217us/58
Splitk=8 4.532ms/78.137us/58 5.079ms/87.572us/58

DS R1 test on GNR_MCR

Reference
BS4
First 
sgl-kernel::shared_expert_cpu         4.68%     337.369ms         4.69%     337.822ms       5.825ms            58 
Next 
sgl-kernel::shared_expert_cpu         6.31%       4.541ms         6.44%       4.633ms      79.880us            58  
BS1
First 
sgl-kernel::shared_expert_cpu         5.24%      92.125ms         5.25%      92.237ms       1.590ms            58
Next 
sgl-kernel::shared_expert_cpu         6.61%       3.720ms         6.68%       3.763ms      64.884us            58


Improved
BS4
First 
sgl-kernel::shared_expert_cpu         4.46%     272.729ms         4.46%     273.136ms       4.709ms            58  
Next  
sgl-kernel::shared_expert_cpu         5.72%       4.086ms         5.87%       4.189ms      72.229us            58              

BS1
First 
 sgl-kernel::shared_expert_cpu         5.08%      87.134ms         5.09%      87.346ms       1.506ms            58
Next 
sgl-kernel::shared_expert_cpu         6.33%       3.544ms         6.49%       3.636ms      62.695us            58
    

Acc test cases all pass

Python https://github.com/mingfeima/sgl-cpu-tests/blob/main/test_shared_experts.py

run_single_test(1, 704, 7168, 1, torch.bfloat16)
run_single_test(1, 1024, 1024, 1, torch.bfloat16)
run_single_test(4, 704, 7168, 1, torch.bfloat16)
run_single_test(4, 1024, 1024, 1, torch.bfloat16)
run_single_test(128, 704, 7168, 1, torch.bfloat16)
run_single_test(128, 1024, 1024, 1, torch.bfloat16)

@jianan-gu jianan-gu marked this pull request as draft April 24, 2025 01:43
@jianan-gu jianan-gu marked this pull request as ready for review April 24, 2025 05:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant