

How to Reverse Engineer Neural Networks
here’s a fun fact: nobody fully understands why large language models work. we know the math, we know the architecture, we can train them. but ask “why did it output this specific token?”
by suryansh

muP (maximal update parameterization) does not solve horizon scaling
many researchers assume that if they use muP (a method by Yang et al, 2022, for transferring hyperparameters across model sizes), they are safe from LR-tuning headaches.
turns out this is not so when scaling for token counts.
by andy


