-
Notifications
You must be signed in to change notification settings - Fork 0
Description
In our Scaling Laws experiments, we set a fixed warmup duration (1000 iterations) across all runs, regardless of global batch size and token budget. Fixed warmup was chosen to be able to cool down at various token budgets and thus re-use previous runs and save compute. However, if global batch size and token budget differ, this means that a fixed warmup duration will represent a higher or lower fraction of training iterations. As we've calculated, the warmup duration will represent maximum 35% and minimum 0.25% of the total training iterations.
Initial results with a long warmup fraction leads to higher loss. Here, we will sweep for the highest global batch size and the warmup fraction to verify if we obtain much lower loss or not.
Model 50M, token budget 6B, global batch size 512, which results in 2861 iterations. Sweep over warmup fraction [1, 5, 10, 20] and learning rate [5.e-4, 1.e-3, 2.e-3, 4.e-3, 8.e-3]
Metadata
Metadata
Assignees
Labels
Type
Projects
Status