feat(optimizers): integrate flash-muon with runtime selection by Yui-Koi · Pull Request #39 · Open-Superintelligence-Lab/blueberry-llm

Yui-Koi · 2025-10-20T09:26:22Z

Add flash-muon as git submodule for optimized Newton-Schulz iterations.
Implements config-based selection between Muon and FlashMuon optimizers.

Technical Details:

Flash-muon reduces NS5 matmul FLOPs by ~50% via fused CUDA kernels
Speedup is dimension and hardware dependent (1.2-2× on optimizer step)
Only affects 2D weight matrices (muon params), not embeddings/norms (adamw)
Expected real training speedup ~1-2% based on param distribution and profiling estimates.

Implementation:

Added as git submodule instead of pip package for upstream tracking
Config flag use_flash_muon (default: True) enables runtime selection
Modified trainer.py only; experiments untouched for reproducibility
Ternary operator selects optimizer class based on config

Performance Impact:
Wall-clock speedup varies by GPU and matrix dimension:
H800: 0.9-1.56× (overhead at small dims, gains at large)
H20: 1.68-2.03× (consistent improvement)
A100: 1.19-1.78× (solid gains)
4090: 1.0-1.90× (best at large dimensions)

Optimizer step is ~3-5% of training time, muon handles ~75.7% of params (calculated from moe model config), and assuming median speedup of 1.6x so 0.375, then theoretically the end to end speedup would be 0.04 × 0.757 × 0.375 ≈ 1.11% faster training

Compatibility:

Experiments unchanged to preserve ablation reproducibility
Config fallback: set use_flash_muon=False if issues arise
Fresh clones require: git submodule update --init --recursive

Refs: https://github.com/nil0x9/flash-muon
Benchmarks: https://github.com/nil0x9/flash-muon#benchmarks

Breaking Changes: None

ps: I didn't have the gpu time needed to benchmark it properly so I estimated the gains but it will be net positive improvement based off my calculations

…am to a helper func

Yui-Koi added 2 commits October 20, 2025 08:53

feat: add flash muon submodule alongside muon impl

54ca9c3

feat: add flash muon toggle via use_flash_muon flag

3c61a02

Yui-Koi marked this pull request as draft October 20, 2025 09:28

Yui-Koi added 2 commits October 20, 2025 15:46

add flash muon to path

5e45acf

fix: add world size and rank for flash muon, moved check for muon par…

e3728c1

…am to a helper func

Yui-Koi force-pushed the feat/add-flash-muon branch from 969b9b5 to e3728c1 Compare October 20, 2025 16:44

Yui-Koi added 2 commits October 20, 2025 17:05

add batch unpacking util

899bce5

add batch unpacking util

0aef920

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(optimizers): integrate flash-muon with runtime selection#39

feat(optimizers): integrate flash-muon with runtime selection#39
Yui-Koi wants to merge 6 commits intoOpen-Superintelligence-Lab:mainfrom
Yui-Koi:feat/add-flash-muon

Yui-Koi commented Oct 20, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Yui-Koi commented Oct 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Yui-Koi commented Oct 20, 2025 •

edited

Loading