You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: blogs/muon-optimizer/README.md
+8-7Lines changed: 8 additions & 7 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -96,15 +96,16 @@ We measured peak GPU memory during finetuning Qwen2.5-3B on tatsu-lab/alpaca usi
96
96
97
97
Muon reduces per-GPU memory by approximately 3 GiB (9%) compared to AdamW. The savings come entirely from optimizer states: Muon parameters store one momentum buffer (4 bytes) instead of Adam's two (8 bytes). However, because optimizer states are only one component of total GPU memory (alongside model weights, gradients, and activations), the end-to-end reduction is modest. For larger models or tighter memory budgets, this 9% savings could make the difference between fitting a workload on-device versus requiring CPU offloading.
98
98
99
-
## Future plan
100
-
Muon optimizer is getting more and more attention, and is verified by production-level open LLM model such as Kimi-K2 which has 1T weights. This makes Muon a strong second choice and a potential replacement of Adam optimizer. To make Muon optimizer more accessible in production environment, the following features are needed:
99
+
## What’s Next
100
+
Muon is rapidly gaining traction in the community, and production-level adoption like Kimi-K2 (1T parameters) signals that it is a serious contender to replace Adam as the default optimizer for large-scale training. We are actively building out full Muon support in DeepSpeed, with a series of improvements already in flight:
101
101
102
-
-[x] Muon optimizer with ZeRO stage 3
103
-
-[ ] CPU Offloading support
104
-
-[ ] MuonClip support
105
-
-[x] Performance optimization with Gram-Schmidt based Newton-Schulz iteration (in review)
102
+
-[x]**ZeRO Stage 3 support** — merged
103
+
-[x]**Gram-Schmidt based Newton-Schulz iteration** — a faster orthogonalization kernel, in review
104
+
-[ ]**CPU Offloading** — partially done
105
+
-[ ]**ZeRO Stage 2 support** — work in progress
106
+
-[ ]**MuonClip** — the variant used by Kimi-K2, planned
106
107
107
-
If you have thoughts, feedback and contribution on Muon optimizer, welcome to start an issue for discussion, or submit a PR to DeepSpeed. Let’s make Muon optimizer rock solid and lightning fast in DeepSpeed!
108
+
If you have thoughts, feedback, or contributions on Muon optimizer, welcome to start an issue for discussion or submit a PR to DeepSpeed. Let’s make Muon rock solid and lightning fast in DeepSpeed!
108
109
109
110
## Contributors
110
111
This work is contributed from Wang, Zhipeng (@PKUWZP); Chi McIsaac(@qimcis) and Ma, Guokai (@delock)
0 commit comments