Skip to content

[ForgeEngine] fix broken ForgeEngine:#2429

Merged
tianyu-l merged 2 commits intopytorch:mainfrom
hann-wang:han/fix_broken_forge_engine
Feb 25, 2026
Merged

[ForgeEngine] fix broken ForgeEngine:#2429
tianyu-l merged 2 commits intopytorch:mainfrom
hann-wang:han/fix_broken_forge_engine

Conversation

@hann-wang
Copy link
Contributor

* support launching custom trainer;
* init trainer components through .build() (pytorch#2386);
* move data to GPU by micro-batch;
* remove rescale_accumulated_loss (pytorch#2206).
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Feb 24, 2026
Copy link
Contributor

@tianyu-l tianyu-l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, one comment

# pyrefly: ignore [missing-attribute]
trainer = config.build()
if custom_trainer_class is not None:
trainer = custom_trainer_class(config)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't need to modify this file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The example trainer is initialized with main(CustomTrainer):
https://github.com/hann-wang/torchtitan/blob/8c6fee127b689ca9a8bb52d2045be597542e8ebe/torchtitan/experiments/forge/example_train.py#L408

config_registry is returning a Trainer.Config object and config.build() is not the custom trainer we want. Is there a better way to init a custom trainer?

Copy link
Member

@joecummings joecummings Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This raises a good point about the extensibility of these new Titan changes with other forms of training. I see a few options here:

  1. If modifying this file, change the main function to take in any class that satisfies a "Trainer" protocol (I think technically the only requirement for main is that it can be built from a config and includes a train, close method), defaulting to the Trainer from torchtitan.trainer. This logic is easier to follow than custom_trainer_class that is deployed if provided, otherwise ignored for a possibly unknown class.
    def main(trainer: type[Trainer] = torchtitan.trainer.Trainer) ...

  2. If we're not ready to tackle the extensibility of torchtitan for other forms of training, copy the relevant logic of main.py to L408 of the forge example to make it work with the example. This should be straightforward.

Unless this needs to be unblocked immediately, I would have a slight preference for something like 1.

@tianyu-l ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

config_registry is returning a Trainer.Config object and config.build() is not the custom trainer we want. Is there a better way to init a custom trainer?

To use a custom trainer, you need to assemble a config whose root node is custom trainer config instance.

If we're not ready to tackle the extensibility of torchtitan for other forms of training, copy the relevant logic of main.py to L408 of the forge example to make it work with the example. This should be straightforward.

Yeah I think we are not ready to tackle the extensibility challenge yet, so I'm OK with option 2 to favor separation of issues for now. Sorry that I broke the forge example, but curious why you'd use it anyway?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going for option 2.

Sorry that I broke the forge example, but curious why you'd use it anyway?

I have a custom trainer that integrates Quantization/Sparsification-Aware Training.

@tianyu-l tianyu-l merged commit a4ca5a4 into pytorch:main Feb 25, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants