feat: U-Net Skip Connections + Drop-Muon optimization #94

batuhanozkose · 2026-01-03T16:25:38Z

What I changed

Added U-Net skip connections and Drop-Muon optimization.

Changes:

Step-by-step progression:

torch.compile reduce-overhead
adamw_lr tuning (0.006 → 0.015)
weight_decay reduction (0.2 → 0.1)
U-Net skip connections
Drop-Muon (p=0.1)

Each change was validated before moving to next. The PR shows final state but I have full experiment log

Benchmark Results (RTX 4090, 8M tokens, 7 eval milestones)

Run	Time	Val Loss	Train Loss
1	1m 56s 405ms	4.7850	4.6734
2	1m 56s 647ms	4.7860	4.6696

bigwolfeman · 2026-01-04T01:27:54Z

This is not apples to apples comparison. This is a big deal when it comes to loss metrics. Cosine decay is the correct thing to do for training, but we want a sane constant for the small testing. This is one of the problems of the evaluating this sort of thing. There are a lot of knobs that we make decisions on that are very messy.

For some changes to be viable these parameters need to be fuzzed. This is also extremely sensitive to the LR warmup and decay behaviors. We want to keep the warmup and decay behaviors as simple as possible, as these other hyperparameters (and adamw_LR) might need to be changed through fuzzing.

If you look in the discussions I ran a very long fuzzing that yielded much stronger numbers for this btw.

The actual forward pass changes look good.

These changes to the muon optimizer are a potentially massive footgun of optimization pressure build up and divergence. This is specific to what you're doing.

You need to train this out to 100m tokens and show that it is stable. If it were me I would run 1k steps and measure floating point divergence over time to empirically prove that it won't destabilize training 1b tokens in.

This is still changing num_works in train_llm.py. This is bad unless 2 was just unstable.

You are still making optimizations to torch compile while making all these other changes. This is not an apples to apples comparison

This is 5 different things you're touching at once and trying to say it gave improvement. You may find that some of it is hurting performance, it has to be tested one at a time and PRed one at a time.

batuhanozkose added 2 commits January 3, 2026 19:19

feat: U-Net Skip Connections + Drop-Muon optimization

5842eb1

fix: remove config flags, revert eval/pin_memory

dcf1f5d

batuhanozkose marked this pull request as draft January 4, 2026 12:04

batuhanozkose marked this pull request as ready for review January 4, 2026 15:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: U-Net Skip Connections + Drop-Muon optimization #94

feat: U-Net Skip Connections + Drop-Muon optimization #94

Uh oh!

batuhanozkose commented Jan 3, 2026 •

edited

Loading

Uh oh!

bigwolfeman commented Jan 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat: U-Net Skip Connections + Drop-Muon optimization #94

Are you sure you want to change the base?

feat: U-Net Skip Connections + Drop-Muon optimization #94

Uh oh!

Conversation

batuhanozkose commented Jan 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What I changed

Changes:

Benchmark Results (RTX 4090, 8M tokens, 7 eval milestones)

Uh oh!

bigwolfeman commented Jan 4, 2026

You need to train this out to 100m tokens and show that it is stable. If it were me I would run 1k steps and measure floating point divergence over time to empirically prove that it won't destabilize training 1b tokens in.

This is still changing num_works in train_llm.py. This is bad unless 2 was just unstable.

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

batuhanozkose commented Jan 3, 2026 •

edited

Loading