Skip to content

Commit fee921d

Browse files
committed
Revamp future plan into What's Next with active roadmap tone
Signed-off-by: Ma, Guokai <guokai.ma@gmail.com>
1 parent 379e6b8 commit fee921d

1 file changed

Lines changed: 8 additions & 7 deletions

File tree

blogs/muon-optimizer/README.md

Lines changed: 8 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -96,15 +96,16 @@ We measured peak GPU memory during finetuning Qwen2.5-3B on tatsu-lab/alpaca usi
9696

9797
Muon reduces per-GPU memory by approximately 3 GiB (9%) compared to AdamW. The savings come entirely from optimizer states: Muon parameters store one momentum buffer (4 bytes) instead of Adam's two (8 bytes). However, because optimizer states are only one component of total GPU memory (alongside model weights, gradients, and activations), the end-to-end reduction is modest. For larger models or tighter memory budgets, this 9% savings could make the difference between fitting a workload on-device versus requiring CPU offloading.
9898

99-
## Future plan
100-
Muon optimizer is getting more and more attention, and is verified by production-level open LLM model such as Kimi-K2 which has 1T weights. This makes Muon a strong second choice and a potential replacement of Adam optimizer. To make Muon optimizer more accessible in production environment, the following features are needed:
99+
## What’s Next
100+
Muon is rapidly gaining traction in the community, and production-level adoption like Kimi-K2 (1T parameters) signals that it is a serious contender to replace Adam as the default optimizer for large-scale training. We are actively building out full Muon support in DeepSpeed, with a series of improvements already in flight:
101101

102-
- [x] Muon optimizer with ZeRO stage 3
103-
- [ ] CPU Offloading support
104-
- [ ] MuonClip support
105-
- [x] Performance optimization with Gram-Schmidt based Newton-Schulz iteration (in review)
102+
- [x] **ZeRO Stage 3 support** — merged
103+
- [x] **Gram-Schmidt based Newton-Schulz iteration** — a faster orthogonalization kernel, in review
104+
- [ ] **CPU Offloading** — partially done
105+
- [ ] **ZeRO Stage 2 support** — work in progress
106+
- [ ] **MuonClip** — the variant used by Kimi-K2, planned
106107

107-
If you have thoughts, feedback and contribution on Muon optimizer, welcome to start an issue for discussion, or submit a PR to DeepSpeed. Let’s make Muon optimizer rock solid and lightning fast in DeepSpeed!
108+
If you have thoughts, feedback, or contributions on Muon optimizer, welcome to start an issue for discussion or submit a PR to DeepSpeed. Let’s make Muon rock solid and lightning fast in DeepSpeed!
108109

109110
## Contributors
110111
This work is contributed from Wang, Zhipeng (@PKUWZP); Chi McIsaac(@qimcis) and Ma, Guokai (@delock)

0 commit comments

Comments
 (0)