Skip to content

Conversation

@batuhanozkose
Copy link

@batuhanozkose batuhanozkose commented Jan 3, 2026

What I changed

Added U-Net skip connections and Drop-Muon optimization.

Changes:

Step-by-step progression:

  1. torch.compile reduce-overhead
  2. adamw_lr tuning (0.006 → 0.015)
  3. weight_decay reduction (0.2 → 0.1)
  4. U-Net skip connections
  5. Drop-Muon (p=0.1)

Each change was validated before moving to next. The PR shows final state but I have full experiment log

Benchmark Results (RTX 4090, 8M tokens, 7 eval milestones)

Run Time Val Loss Train Loss
1 1m 56s 405ms 4.7850 4.6734
2 1m 56s 647ms 4.7860 4.6696

@bigwolfeman
Copy link

This is not apples to apples comparison. This is a big deal when it comes to loss metrics. Cosine decay is the correct thing to do for training, but we want a sane constant for the small testing. This is one of the problems of the evaluating this sort of thing. There are a lot of knobs that we make decisions on that are very messy.

For some changes to be viable these parameters need to be fuzzed. This is also extremely sensitive to the LR warmup and decay behaviors. We want to keep the warmup and decay behaviors as simple as possible, as these other hyperparameters (and adamw_LR) might need to be changed through fuzzing.

If you look in the discussions I ran a very long fuzzing that yielded much stronger numbers for this btw.

The actual forward pass changes look good.


These changes to the muon optimizer are a potentially massive footgun of optimization pressure build up and divergence. This is specific to what you're doing.

You need to train this out to 100m tokens and show that it is stable. If it were me I would run 1k steps and measure floating point divergence over time to empirically prove that it won't destabilize training 1b tokens in.

This is still changing num_works in train_llm.py. This is bad unless 2 was just unstable.

You are still making optimizations to torch compile while making all these other changes. This is not an apples to apples comparison

This is 5 different things you're touching at once and trying to say it gave improvement. You may find that some of it is hurting performance, it has to be tested one at a time and PRed one at a time.

@batuhanozkose batuhanozkose marked this pull request as draft January 4, 2026 12:04
@batuhanozkose batuhanozkose marked this pull request as ready for review January 4, 2026 15:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants