Skip to content

Commit 7abc53a

Browse files
committed
Update Muon blog with measured convergence and memory data
Replace placeholder claims with actual experiment results: - Add lr sweep results for both AdamW and Muon optimizers - Report measured GPU memory: AdamW 34.5 GiB vs Muon 31.4 GiB (9% savings) - Remove old convergence chart (adamw_vs_muon_3b.png) - Fix inaccurate claims (Muon 19% better, Adam OOM on 2xA100) - Add hybrid optimizer explanation and separate lr config docs Signed-off-by: Ma, Guokai <guokai.ma@gmail.com>
1 parent 8d9433e commit 7abc53a

2 files changed

Lines changed: 67 additions & 40 deletions

File tree

blogs/muon-optimizer/README.md

Lines changed: 67 additions & 40 deletions
Original file line numberDiff line numberDiff line change
@@ -3,12 +3,14 @@
33
Muon optimizer has gained momentum with more and more use from community and also from Large Foundation Model like Kimi-K2-Thinking. Now DeepSpeed supports Muon optimizer.
44

55
## What is Muon optimizer?
6-
Muon is an optimizer designed for hidden 2D weights of a neural network. It takes gradient of the weight, computes its momentum, and applies NewtonSchulz iterations to orthogonalize the momentum matrix, then use this orthogonalized matrix to update weight[1](https://kellerjordan.github.io/posts/muon/). With Muon optimizer, optimization are less likely to overfit, converge faster, and saves more memory than Adam optimizer. It is used by Keller Jordan’s mod of NanoGPT[2](https://github.com/KellerJordan/modded-nanogpt), Andrej Karpathy’s nanochat[3](https://github.com/karpathy/nanochat), and a variant of Muon (MuonClip) is also used by product level LLM model Kimi-K2 from MoonShot[4](https://arxiv.org/pdf/2507.20534).
6+
Muon is an optimizer designed for hidden 2D weights of a neural network. It takes gradient of the weight, computes its momentum, and applies Newton-Schulz iterations to orthogonalize the momentum matrix, then uses this orthogonalized matrix to update the weight[1](https://kellerjordan.github.io/posts/muon/). Because Muon only maintains one momentum buffer (versus Adam’s two), it uses less memory for optimizer states. It is used by Keller Jordan’s mod of NanoGPT[2](https://github.com/KellerJordan/modded-nanogpt), Andrej Karpathy’s nanochat[3](https://github.com/karpathy/nanochat), and a variant of Muon (MuonClip) is also used by the production-level LLM Kimi-K2 from MoonShot[4](https://arxiv.org/pdf/2507.20534).
77

88
## Muon Optimizer support in DeepSpeed
99
One of the challenges of applying Muon optimizer to DeepSpeed is that previous optimizers (SGD, Adam) look at gradients as flattened buffers. Thus it is hard to swap in Muon optimizer in the same place because the gradient buffers are already flattened. We move the Muon update to the get_flat_partition function of stage 1 and 2 DeepSpeedZeroOptimizer in which per parameter gradients are still in unflattened stages, thus we can easily apply the Muon updates.
1010

11-
Muon optimizer works for hidden 2D gradients. We apply a parse in model engine initializer to tag the model parameter with ‘use_muon’, if and only if the model parameter is 2D and is hidden. When Muon optimizer is used, any gradient with parameter match ‘use_muon’ will use Muon optimizer to update weight.
11+
Muon optimizer works for hidden 2D gradients. We apply a parse in model engine initializer to tag the model parameter with 'use_muon', if and only if the model parameter is 2D and is hidden. When Muon optimizer is used, any gradient with parameter match 'use_muon' will use Muon optimizer to update weight.
12+
13+
Note that Muon is a hybrid optimizer: it uses Muon updates only for 2D hidden weights and falls back to Adam for all other parameters (embeddings, layer norms, biases, lm_head). The DeepSpeed config supports separate learning rates via `muon_lr` (for Muon parameters) and `adam_lr` (for Adam parameters).
1214

1315
## Running DeepSpeed finetune with Muon optimizer
1416
Deepspeed finetune demo[5](https://github.com/delock/deepspeed_finetune_demo) is a demo to use different DeepSpeed training features and compare their performance in a single place. You can use it to test finetune LLM models with Muon optimizer:
@@ -20,60 +22,85 @@ cd deepspeed_finetune_demo
2022

2123
## Muon Optimizer Convergence Experiment Result
2224

23-
We compared Muon optimizer with AdamW optimizer by finetuning a Qwen2.5-3B model on the tatsu-lab/alpaca dataset with the same learning rate.
24-
25-
![Muon vs AdamW convergence on Qwen2.5-3B](images/adamw_vs_muon_3b.png)
26-
*Training configuration: Qwen2.5-3B, tatsu-lab/alpaca dataset, ZeRO Stage 2, bf16, batch_size=8, lr=2e-5, 1 epoch, 8 GPUs.*
27-
28-
In one epoch, Muon optimizer achieved approximately 19% lower loss compared to AdamW optimizer. Moreover, Muon optimizer did not show overfitting while AdamW optimizer exhibited overfitting behavior.
29-
30-
## Muon Optimizer memory overhead
31-
Muon optimizer has significantly smaller memory requirements than Adam optimizer, making it particularly valuable for large-scale model training.
32-
33-
### Memory Usage Comparison
34-
In theory, Muon optimizer has one momentum buffer while Adam has two, this makes Muon optimizer use 50% less memory for optimizer states compared to Adam
35-
36-
| Optimizer | Momentum Buffers | Memory per Parameter | Example: 3B Model |
37-
|-----------|------------------|---------------------|-------------------|
38-
| Adam | 2 (m, v) | 8 bytes | ~24 GB |
39-
| Muon | 1 (momentum) | 4 bytes | ~12 GB |
40-
41-
With Muon optimizer, we can potentially use larger batch sizes and avoid CPU offloading.
42-
43-
### Real-world Example: 3B Model - No Offloading Required
44-
We tested finetuning a Qwen2.5-3B model with tatsu-lab/alpaca dataset on 2xA100 (40GB GPU memory each) using batch size=8 and input length=512:
25+
We compared Muon optimizer with AdamW optimizer by finetuning a Qwen2.5-3B model on the tatsu-lab/alpaca dataset. To ensure a fair comparison, we performed learning rate sweeps for both optimizers independently and report results at each optimizer’s best configuration.
4526

4627
**Training Configuration:**
4728
- Model: Qwen2.5-3B
48-
- Dataset: tatsu-lab/alpaca (standard instruction-tuning dataset)
49-
- Batch size: 8
50-
- Sequence length: 512 tokens (standard for instruction-tuning)
51-
- GPU memory: 80 GB total (2×A100 40GB)
52-
- ZeRO Stage: Stage 2 (distributed optimizer states and gradients)
29+
- Dataset: tatsu-lab/alpaca
30+
- ZeRO Stage 2, bf16
31+
- Batch size: 32 (4 per GPU), 8 GPUs (A100 40GB)
32+
- 1 epoch (~1460 steps), eval every 100 steps
33+
- LR schedule: constant (no warmup, no decay)
34+
- Gradient clipping: 1.0
35+
36+
**AdamW Optimizer Hyperparameters:**
37+
- betas: (0.9, 0.999)
38+
- eps: 1e-8
39+
- weight_decay: 0.01
40+
41+
**Muon Optimizer Hyperparameters:**
42+
- momentum: 0.95 (Muon parameters)
43+
- betas: (0.9, 0.999) (Adam parameters)
44+
- eps: 1e-8
45+
- weight_decay: 0.01
46+
47+
**Learning Rate Sweep Results:**
48+
49+
For AdamW, we swept lr across {1e-6, 2e-6, 5e-6, 1e-5}. For Muon, we first swept muon_lr across {1e-4, 2e-4, 5e-4, 1e-3, 2e-3, 5e-3, 1e-2} with adam_lr=2e-6, then swept adam_lr across {2e-6, 5e-6, 1e-5} with muon_lr=5e-3.
50+
51+
| Optimizer | Learning Rate | Final Eval Loss |
52+
|-----------|---------------|-----------------|
53+
| AdamW | lr=1e-5 | 1.2404 |
54+
| AdamW | lr=5e-6 | 1.2001 |
55+
| **AdamW** | **lr=2e-6** | **1.1842** |
56+
| AdamW | lr=1e-6 | 1.1883 |
57+
| Muon | muon_lr=5e-3, adam_lr=2e-6 | 1.1996 |
58+
| **Muon** | **muon_lr=5e-3, adam_lr=5e-6** | **1.1966** |
59+
| Muon | muon_lr=5e-3, adam_lr=1e-5 | 1.1970 |
60+
61+
**Convergence Trajectory (Best Configuration per Optimizer):**
62+
63+
| Step | AdamW (lr=2e-6) | Muon (muon_lr=5e-3, adam_lr=5e-6) |
64+
|------|-----------------|-----------------------------------|
65+
| 0 | 1.3278 | 1.3300 |
66+
| 100 | 1.2205 | 1.2814 |
67+
| 200 | 1.2101 | 1.2300 |
68+
| 500 | 1.1969 | 1.2107 |
69+
| 1000 | 1.1894 | 1.2009 |
70+
| 1400 | **1.1842** | **1.1966** |
71+
72+
In this finetuning experiment, AdamW achieves a slightly lower final eval loss (1.1842) compared to Muon (1.1966). AdamW also converges faster in early training steps. This result is consistent with the observation that Muon’s strength has been demonstrated primarily in pretraining settings, while finetuning a pretrained model on a small dataset may not fully benefit from Muon’s orthogonalization approach.
73+
74+
## Muon Optimizer Memory Savings
75+
Muon optimizer uses less memory for optimizer states than Adam, because it maintains one momentum buffer per parameter instead of two (first and second moment).
5376

54-
**Performance Test Results:**
77+
### Memory Usage Comparison
78+
Note that Muon is a hybrid optimizer: 2D hidden weights use Muon (1 buffer), while remaining parameters (embeddings, layer norms, lm_head) still use Adam (2 buffers). The actual memory savings depend on the fraction of parameters that are 2D hidden weights. For typical transformer models, 70-80% of parameters are 2D hidden weights, so optimizer state memory is reduced by roughly 35-40%. However, because total GPU memory also includes model weights, gradients, and activations, the end-to-end memory reduction is smaller (see measured results below).
5579

56-
BS=8, sequence length=512
80+
| Optimizer | State Buffers per Param | Memory per Parameter |
81+
|-----------|------------------------|---------------------|
82+
| Adam | 2 (m, v) | 8 bytes |
83+
| Muon | 1 (momentum) | 4 bytes |
5784

58-
| Optimizer | Offloading | Iteration Time |
59-
|-----------|------------|----------------|
60-
| Muon | No | 0.9s |
61-
| Adam | No | OOM (Crash) |
62-
| Adam | Yes | 4.5s |
85+
### Measured GPU Memory: Qwen2.5-3B Finetuning
86+
We measured peak GPU memory during finetuning Qwen2.5-3B on tatsu-lab/alpaca using the same 8xA100 (40GB) configuration described above (batch size 32, ZeRO Stage 2, bf16).
6387

64-
**Key Performance Insights:**
88+
| Optimizer | Peak Memory per GPU | Savings vs AdamW |
89+
|-----------|---------------------|------------------|
90+
| AdamW | 34.5 GiB ||
91+
| Muon | 31.4 GiB | 9% |
6592

66-
From this result, we can see in certain situation, Muon optimizer can use less memory and does not need CPU offloading, while Adam optimizer cannot fit GPU memory and requires CPU offloading. This collaterally brings performance benefit even when Muon optimizer needs more computation, because no offloading needed.
93+
Muon reduces per-GPU memory by approximately 3 GiB (9%) compared to AdamW. The savings come entirely from optimizer states: Muon parameters store one momentum buffer (4 bytes) instead of Adam's two (8 bytes). However, because optimizer states are only one component of total GPU memory (alongside model weights, gradients, and activations), the end-to-end reduction is modest. For larger models or tighter memory budgets, this 9% savings could make the difference between fitting a workload on-device versus requiring CPU offloading.
6794

6895
## Future plan
69-
Muon optimizer is getting more and more attention, and is verified by product level open LLM model such as Kimi-K2 which has 1T weights. This makes Muon a strong second choice and a potential replacement of Adam optimizer. To make Muon optimizer more accessible in production environment, the following feature is needed:
96+
Muon optimizer is getting more and more attention, and is verified by production-level open LLM model such as Kimi-K2 which has 1T weights. This makes Muon a strong second choice and a potential replacement of Adam optimizer. To make Muon optimizer more accessible in production environment, the following features are needed:
7097

7198
- [ ] Muon optimizer with ZeRO stage 3
7299
- [ ] CPU Offloading support
73100
- [ ] MuonClip support
74101
- [ ] Performance optimization to make Muon optimizer more efficient
75102

76-
If you have thoughts, feedback and contribution on Muon optimizer, welcome to start an issue for discussion, or submit a PR to DeepSpeed. Let's make Muon optimizer rock solid and lightning fast in DeepSpeed!
103+
If you have thoughts, feedback and contribution on Muon optimizer, welcome to start an issue for discussion, or submit a PR to DeepSpeed. Lets make Muon optimizer rock solid and lightning fast in DeepSpeed!
77104

78105
## Contributors
79106
This work is contributed from Wang, Zhipeng (@PKUWZP); Chi McIsaac(@qimcis) and Ma, Guokai (@delock)
-1.23 MB
Binary file not shown.

0 commit comments

Comments
 (0)