Skip to content

Add support for Maximal Update of Parameters - MUP #372

@gkielian

Description

@gkielian

Mu P

This is an advance showing that if one does learning rate optimization on a smaller model, that one can use this framework.

From what I can see, vocab size and number of layers need to remain the same, but this has a method of finding the optimal learning rate for larger models (all larger width models) from a single sweep of a smaller model.

The claim is that this also holds true across datasets, and I think it also assumes that higher and smaller learning rates don't cross the smaller model's validation loss vs iteration curve (something I'm not sure is true since schedulers -- and dropping the learning rate -- has such a large impact on the final loss).

Notes: Changing model architecture requires retuning, this scales only the embedding dimension.

Reference: https://github.com/EleutherAI/nanoGPT-mup

Image from paper:

Image

Image source: https://cerebras.ai/blog/the-practitioners-guide-to-the-maximal-update-parameterization

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions