[ForgeEngine] fix broken ForgeEngine:#2429
Conversation
hann-wang
commented
Feb 24, 2026
- support launching custom trainer;
- init trainer components through .build() ([BC Breaking] Config System Refactor: TOML to Python Dataclass Registry #2386);
- move data to GPU by micro-batch;
- remove rescale_accumulated_loss (Fix loss computation by handling valid token imbalance in train loop #2206).
* support launching custom trainer; * init trainer components through .build() (pytorch#2386); * move data to GPU by micro-batch; * remove rescale_accumulated_loss (pytorch#2206).
torchtitan/train.py
Outdated
| # pyrefly: ignore [missing-attribute] | ||
| trainer = config.build() | ||
| if custom_trainer_class is not None: | ||
| trainer = custom_trainer_class(config) |
There was a problem hiding this comment.
shouldn't need to modify this file?
There was a problem hiding this comment.
The example trainer is initialized with main(CustomTrainer):
https://github.com/hann-wang/torchtitan/blob/8c6fee127b689ca9a8bb52d2045be597542e8ebe/torchtitan/experiments/forge/example_train.py#L408
config_registry is returning a Trainer.Config object and config.build() is not the custom trainer we want. Is there a better way to init a custom trainer?
There was a problem hiding this comment.
This raises a good point about the extensibility of these new Titan changes with other forms of training. I see a few options here:
-
If modifying this file, change the main function to take in any class that satisfies a "Trainer" protocol (I think technically the only requirement for main is that it can be built from a config and includes a
train,closemethod), defaulting to the Trainer fromtorchtitan.trainer. This logic is easier to follow thancustom_trainer_classthat is deployed if provided, otherwise ignored for a possibly unknown class.
def main(trainer: type[Trainer] = torchtitan.trainer.Trainer) ... -
If we're not ready to tackle the extensibility of torchtitan for other forms of training, copy the relevant logic of main.py to L408 of the forge example to make it work with the example. This should be straightforward.
Unless this needs to be unblocked immediately, I would have a slight preference for something like 1.
There was a problem hiding this comment.
config_registry is returning a Trainer.Config object and config.build() is not the custom trainer we want. Is there a better way to init a custom trainer?
To use a custom trainer, you need to assemble a config whose root node is custom trainer config instance.
If we're not ready to tackle the extensibility of torchtitan for other forms of training, copy the relevant logic of main.py to L408 of the forge example to make it work with the example. This should be straightforward.
Yeah I think we are not ready to tackle the extensibility challenge yet, so I'm OK with option 2 to favor separation of issues for now. Sorry that I broke the forge example, but curious why you'd use it anyway?
There was a problem hiding this comment.
Going for option 2.
Sorry that I broke the forge example, but curious why you'd use it anyway?
I have a custom trainer that integrates Quantization/Sparsification-Aware Training.