Add support for Maximal Update of Parameters - MUP

# Mu P

This is an advance showing that if one does learning rate optimization on a smaller model, that one can use this framework.

From what I can see, vocab size and number of layers need to remain the same, but this has a method of finding the optimal learning rate for larger models (all larger width models) from a single sweep of a smaller model.

The claim is that this also holds true across datasets, and I think it also assumes that higher and smaller learning rates don't cross the smaller model's validation loss vs iteration curve (something I'm not sure is true since schedulers -- and dropping the learning rate -- has such a large impact on the final loss).

Notes: Changing model architecture requires retuning, this scales only the embedding dimension.
  
Reference: https://github.com/EleutherAI/nanoGPT-mup

Image from paper: 

![Image](https://github.com/user-attachments/assets/0e4e8b5b-26ad-4c6d-8d5a-eb7227af1337)

Image source: https://cerebras.ai/blog/the-practitioners-guide-to-the-maximal-update-parameterization

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for Maximal Update of Parameters - MUP #372

Mu P

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Add support for Maximal Update of Parameters - MUP #372

Description

Mu P

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions