test warmup effect on 50M model 6B

In our Scaling Laws experiments, we set a fixed warmup duration (1000 iterations) across all runs, regardless of global batch size and token budget. Fixed warmup was chosen to be able to cool down at various token budgets and thus re-use previous runs and save compute. However, if global batch size and token budget differ, this means that a fixed warmup duration will represent a higher or lower fraction of training iterations. As we've calculated, the warmup duration will represent maximum 35% and minimum 0.25% of the total training iterations. 

Initial results with a long warmup fraction leads to higher loss. Here, we will sweep for the highest global batch size and the warmup fraction to verify if we obtain much lower loss or not. 

Model 50M, token budget 6B, global batch size 512, which results in 2861 iterations. Sweep over warmup fraction [1, 5, 10, 20] and learning rate [5.e-4, 1.e-3, 2.e-3, 4.e-3, 8.e-3]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

test warmup effect on 50M model 6B #158

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

test warmup effect on 50M model 6B #158

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions