Reproducing Lora Low Regret blog from scratch (no HF) and investigating the LR 10x ratio.
I conducted preliminary experiments to investigate the origin of the 10× learning rate ratio observed in LoRA training.
⭐{Please note that the provided insights are unique to my minimal setups and the lack of generalization to scale is due to computational constraints, nonetheless they remain valuable and get us closer from a better understanding of LoRA and FullFT LR dynamics.}
It is important to note that the consistent ratio reported in the blog stems from the 1/r and alpha scaling factors. We know that 1/r implicitly scales the learning rate by the layer width determined by rank r, ensuring updates' velocity remain invariant to width scaling as highlighted by Yang et al. in their μP approach.
The data shows that performance, as measured by either test accuracy or perplexity (depending on the task), is influenced by adapters initialization, alpha/r scale factor and regime defined by task nature.
1) Standard configuration from Lora-Without-Regret blog:
Ainitialized using uniform distribution andBis zero- We use a constant
alphavalue of32and factor by1/r - We set a fixed
lr(no scheduler) used by both adapters - We train
Distilbert-uncased-baseon a10ksubset of AG-News (classifcation)
The results show that the optimal learning rate for all ranks is 10x higher than FullFT with test accuracy peaking at rank 32.
2) Different regime:
Now, we train distilgpt2 on the wikitext dataset using the same configuration and data amount. We rely on test perplexity (exp(NLL)) for benchmarking.
We observe that performance here peaks at a higher rank 128 compared to previous setup, revealing that the optimal rank for is regime-dependent for similar configuration.
For the rest of experiments, We use distilbert and ag-news as the base-line for all comparisons.
Here, we increase the mutliplier alpha/r to scale adapters weights by a factor of 10 by setting alpha = 10r.
We observe the following:
- The optimal learning rate ratio is rank-dependent as it is
10xat rank 16,3xfor both rank 32 and 64, and only1xfor rank 128 - Performance peaks at rank 16 as opposed to rank 32 shown previously
- The highest accuracy recorded is bigger than the one seen in the previous setup which uses
32/rconsistently across all ranks
Although the maximum accuracy belongs to rank 16 which happens to validate the 10x optimal ratio, we strongly hypothesize that this is an artifact of the model and dataset used and not linked to the ratio itself.
With that highlighted, we observe that a lr 10x bigger than FullFT's is not always linked to the best performance but rather is a result of a particular norm of the AB matrix impacted by initialization distribution, alpha/r, and finally rank.
Important:
From playing with alpha, it appears that when the the optimal learning rate is consistent across ranks, it is not due to 1/r as Lora-Without-Regret blog claims but rather alpha/r with alpha being a constant.
When we change the alpha dynamics, we observe two main trends:
- Constant α → optimal LR ≈ invariant w.r.t. rank
- Constant α/r → optimal LR scales ↓ with rank
We change the standard initialization from A following an uniform distribution and B set to zero to both A and B following a Gaussian distribution.
As expected, not only the 10x ratio falls, but learning deteriorates drastically with a maximum accuracy capping around ~25%.
We emperically confirm that the 10x ratio found repeatedly in Thinking Machines blog and elsewhere is a result of the initialization regime controlled by A and B dsitributions, alpha/r, and rank. We also show that consistent optimal LR ratio across ranks is a result of alpha being a constant scaled by 1/r and not 1/r alone.
Lastly, changing regime alone only affects the optimal rank and not the optimal LR ratio as both distilbert and distilgpt2 show a LoRA LR 10x higher than FullFT across all ranks using the same initialization regime.
To try the experiemnts above, run the commands below.
pip install -r requirements.txttorch
transformers
datasets
scikit-learn
python main.pypython main.py --full-finetuneLoRA (default):
python main.py --epochs 10 --rank 64 --alpha 128Full Fine-tuning:
python main.py --full-finetune --epochs 10--lrs: Learning rates to sweep (default: 1e-4 5e-4 1e-3)--epochs: Number of training epochs (default: 5)--rank: LoRA rank (default: 128)--alpha: LoRA alpha scaling (default: 256)--batch_size: Batch size (default: 32)--seed: Random seed (default: 42)--full-finetune: Use full fine-tuning instead of LoRA
Results saved to results.json:
{
"0.0001": {
"best_val_loss": 0.1989,
"test_acc": 0.9234,
"train_losses": [...],
"val_losses": [...]
}
}If you find this work useful, please cite it as:
@misc{brokttv2025vit,
title = {Lora-Without-Regret: Explaining the ratio between LoRA and FullFT learning rates},
author = {Brokttv},
year = {2026},
howpublished = {\url{https://github.com/Brokttv/Lora-Without-Regret}},
}