many researchers assume that if they use muP (a method by Yang et al, 2022, for transferring hyperparameters across model sizes), they are safe from LR-tuning headaches.
turns out this is not so when scaling for token counts.
the core problem muP solves is transfer of hyperparameters across model width, it fails to stabilize LR across token horizons, while muP stabilizes the LR across model width, if you plan to train longer, you must lower your learning rate.
optimal LR vs token horizon for a 50M model using muP parameterization. we can see that optimal LR decreases with longer token horizons, showing that LR does not transfer across horizons even with muP. plot taken from https://arxiv.org/pdf/2409.19913
in the above figure, as seen, if muP would've solved horizon transfer, the optimal LR would remain constant regardless of how long the training run was. instead, the optimal LR shifted significantly to the left (lower) as the horizon increased from 25B to 100B tokens.
what muP usually solves (spatial scaling)
standard neural networks become unstable as you make them wider (adding more neurons in hidden layers). if you double the width of a layer, the sum of the signals entering the next layer doubles, potentially exploding the gradients.
muP at its core, ensures no matter how wide the model is (100M or 100B params), the signal strength inside the network stays the same.
this should lead to the conclusion that optimal LR for a small model is usually the same as the optimal LR for a giant model, which is what they call "zero-shot transfer across model size."
but what happens when we want to scale along the temporal dimension (horizon) ?
final validation loss as a function of max learning rate (LR) and token horizon for four models. the optimal LR (denoted by a black star), decreases monotonically with longer horizon for all models. image taken from arxiv.org/abs/2409.19913
it is visible from the above plot that horizon scaling shifts optimal LRs. but why can't muP solve for horizon scaling ?
the signal accumulation problem
muP balances the signal at one specific forward/backward pass. it ensures the gradient update (Δw) is of the right size relative to the weight (w). however, training is a summation of updates over time!
muP ensures Δw_t is well behaved; but it does not account for T (the horizon). if T increases by 10x, the sum of updates changes nature. the "noise" in the updates accumulates differently over long horizons vs short horizons.
getting back to the first image of the article, if muP solved horizon scaling, the optimal LR curve should look like a vertical line, i.e., the best LR of 25B tokens should be the best LR for 100B tokens and for 50B as well, but this is clearly not the case.
best LR for 25B tokens is not equal to the best LR at 50B and 100B tokens
another thing that is interesting to note is that the LR shifted towards the left (smaller LR) as the horizon increased.
even in a mathematically perfect muP model, the "noise floor" principle still applies. muP prevents your gradients from exploding due to width, but it doesn't prevent your model from oscillating around the minimum due to step count.
so how do we incorporate time (horizon) ?
buying more tokens (horizon scaling) is paying for the privilege of reaching a lower loss, and to cash in that privilege, you must lower your LR.
there is a lot of good literature around this, but i will specifically present the ideas of Bjorck et al, 2025, which inspired me to write this post.
given some fixed model architecture, for a token horizon D, the optimal LR policy, employed by the authors, follows the functional form:
here B and β are two constants independent of D that might for e.g. depend on the model architecture; taking the logarithm on both sides, we get :
this is a linear equation in the unknowns log B, β, that can be fit with least squares.
scaling laws for optimal LR vs token horizon. this plot compares the empirically best LR to the smooth scaling of of the linear equation presented above with the fitted constants.
the R squared for these fits come out to be in the range 0.99 - 0.96. across all model sizes, which shows the scaling law provides a good fit to the empirical data.
the authors then go on to analyze the exponent (β), and find it to be relatively stable across model sizes, converging to approximately 0.32.
effect of model size
while not the main focus of the article, i still want to mention a brief note about the effects of model size.
(a) linear relationship between log D and log LR∗ for different model sizes N .
(b) linear relationship between log N and log LR∗ for different values of dataset size D
from (a) it is visible that the lines are roughly parallel, implying β (the slope) is roughly the same regardless of model size implying :
whereas (b) implies the relation (optimal LR scales w.r.t model size following a similar power law, where α is the power constant) :
this is the dimension which muP usually handles.
combining the two equations above we can infer that :
and after adding a constant C to achieve equality, the equation becomes :
empirically the best values the authors find for the constants to are :
the fit plot for the above equation:
fit of the above equation, compared to the experimental results. the data points for the 7B model (llama 1 architecture) are excluded at the time of fitting and used as validation data. an R**2 of 0.978 on this validation data.
the 7B model uses the llama architecture while the other data points use the GPT-3 architecture.
for practitioners
for people working with large models (say >= 760M params) it is recommended to use the final equation where the authors have already found β = 0.32 to generalize across architectures. to find the optimal LR, LR∗(D1) at some long horizon D1, you can just find the optimal LR, LR∗(D2) at a short horizon D2 and then estimate :
footnotes
while this article explores the scaling laws w.r.t token horizons, taking into account, batch size also becomes crucial, i wanted to keep the scope shorter, emphasizing on muP in horizon scaling; there is good literature out there to read and experiment upon optimal scaling laws for LRs and batch sizes. (which i've listed down below)
most of the plots and ideas (especially the equations) were taken from this paper and if you've made it till here, i definitely recommend you to give it a read.







