-
Notifications
You must be signed in to change notification settings - Fork 28
Description
Mu P
This is an advance showing that if one does learning rate optimization on a smaller model, that one can use this framework.
From what I can see, vocab size and number of layers need to remain the same, but this has a method of finding the optimal learning rate for larger models (all larger width models) from a single sweep of a smaller model.
The claim is that this also holds true across datasets, and I think it also assumes that higher and smaller learning rates don't cross the smaller model's validation loss vs iteration curve (something I'm not sure is true since schedulers -- and dropping the learning rate -- has such a large impact on the final loss).
Notes: Changing model architecture requires retuning, this scales only the embedding dimension.
Reference: https://github.com/EleutherAI/nanoGPT-mup
Image from paper:
Image source: https://cerebras.ai/blog/the-practitioners-guide-to-the-maximal-update-parameterization
