forked from karpathy/nanoGPT
-
Notifications
You must be signed in to change notification settings - Fork 28
Open
Description
This feature would monitor lm head overlaps for each of the earlier layers.
Two variations:
- learned lm heads per layer, and somehow included in backprop
- same as word embedding table (WTE), but not included in back prop.
Bonus to include early exit innovations from the SLED paper:
https://arxiv.org/pdf/2411.02433
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels