Skip to content

test warmup effect on 50M model 6B #158

@dianaonutu

Description

@dianaonutu

In our Scaling Laws experiments, we set a fixed warmup duration (1000 iterations) across all runs, regardless of global batch size and token budget. Fixed warmup was chosen to be able to cool down at various token budgets and thus re-use previous runs and save compute. However, if global batch size and token budget differ, this means that a fixed warmup duration will represent a higher or lower fraction of training iterations. As we've calculated, the warmup duration will represent maximum 35% and minimum 0.25% of the total training iterations.

Initial results with a long warmup fraction leads to higher loss. Here, we will sweep for the highest global batch size and the warmup fraction to verify if we obtain much lower loss or not.

Model 50M, token budget 6B, global batch size 512, which results in 2861 iterations. Sweep over warmup fraction [1, 5, 10, 20] and learning rate [5.e-4, 1.e-3, 2.e-3, 4.e-3, 8.e-3]

Metadata

Metadata

Assignees

Type

No type

Projects

Status

In Progress

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions