-
Notifications
You must be signed in to change notification settings - Fork 28
Add flexible training parameters and final eval 2 #730
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Add flexible training parameters and final eval 2 #730
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This pull request introduces flexible training limits and evaluation intervals based on either tokens or epochs, replacing or supplementing the traditional iteration-based training control. The changes allow users to specify when to stop training and when to evaluate based on token counts or epoch fractions rather than iteration counts. A final evaluation feature ensures that models are evaluated one last time after training completes.
Changes:
- Added command-line arguments for token-based and epoch-based training limits (
--max_tokens,--max_epochs) and evaluation intervals (--eval_interval_tokens,--eval_interval_epochs) - Implemented internal tracking and logic to compute when to evaluate and stop training based on these new limits
- Added optional final evaluation at the end of training with checkpoint saving capability
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| train_args.py | Adds four new command-line arguments for flexible evaluation intervals (--eval_interval_tokens, --eval_interval_epochs) and training limits (--max_tokens, --max_epochs), plus flags for final evaluation (--final_eval, --final_eval_save_checkpoint) |
| train.py | Implements the core logic for token/epoch-based training control, including new instance variables, helper methods to compute dataset info and check training limits, and modifications to the training loop to respect these limits and perform final evaluation |
| explorations/training_limits.yaml | Provides example configurations demonstrating various combinations of token/epoch limits and evaluation intervals for different training scenarios |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| is_eval_boundary = self._should_run_eval() | ||
|
|
||
| if is_eval_boundary and self.master_process: | ||
|
|
||
| losses, num_steps_with_worse_loss = self.run_validation_step( | ||
| running_mfu, current_epoch, current_dataset, num_steps_with_worse_loss, live | ||
| ) | ||
| self.last_eval_iter = self.iter_num | ||
| self._update_next_eval_tokens() |
Copilot
AI
Jan 21, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential issue with token-based training limits: When using max_tokens or max_epochs, the training can overshoot the limit by one iteration because the limit check happens after tokens are accumulated. For example, if max_tokens = 1000 and an eval runs at exactly 1000 tokens, the training will continue for one more iteration, accumulating additional tokens before the limit is checked. This means the final evaluation (if enabled) could run at a token count exceeding the specified limit. Consider checking the training limit before accumulating tokens in each iteration, or documenting this behavior if it's intentional.
| '--eval_interval_epochs', | ||
| default=None, | ||
| type=float, | ||
| help='Evaluate every N epochs on the main dataset (float accepted, cast to int tokens). Overrides eval_interval.', |
Copilot
AI
Jan 21, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The help text for --eval_interval_epochs says "cast to int tokens" which could be clearer. Consider rephrasing to "converted to token count then cast to int" to better explain that the epoch count is first multiplied by the dataset size to get a token count, then converted to int.
| help='Evaluate every N epochs on the main dataset (float accepted, cast to int tokens). Overrides eval_interval.', | |
| help='Evaluate every N epochs on the main dataset (float accepted, converted to token count then cast to int). Overrides eval_interval.', |
| break | ||
|
|
||
| if self.args.final_eval and self.master_process and self.last_eval_iter != self.iter_num: | ||
| losses, num_steps_with_worse_loss = self.run_validation_step( |
Copilot
AI
Jan 21, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Variable losses is not used.
| losses, num_steps_with_worse_loss = self.run_validation_step( | |
| _, num_steps_with_worse_loss = self.run_validation_step( |
| break | ||
|
|
||
| if self.args.final_eval and self.master_process and self.last_eval_iter != self.iter_num: | ||
| losses, num_steps_with_worse_loss = self.run_validation_step( |
Copilot
AI
Jan 21, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Variable num_steps_with_worse_loss is not used.
| losses, num_steps_with_worse_loss = self.run_validation_step( | |
| losses, _ = self.run_validation_step( |
This pull request introduces flexible training limits and evaluation intervals based on either tokens or epochs, enhancing the configurability of the training process. It adds new command-line arguments, updates the training logic to respect these new limits, and provides a YAML example for setting up various training scenarios. The changes allow users to specify when to evaluate and stop training based on tokens or epochs, and to perform a final evaluation with optional checkpoint saving.
Training limits and evaluation interval enhancements:
--max_tokens) or epochs (--max_epochs) and evaluation intervals by tokens (--eval_interval_tokens) or epochs (--eval_interval_epochs), with logic to derivemax_itersandeval_intervalaccordingly. [1] [2] [3]train.pyto compute and track main dataset, dataset size, tokens trained, and to determine when to evaluate or stop training based on the new limits.Final evaluation and checkpointing:
--final_evaland--final_eval_save_checkpointflags to optionally run a final validation and save a checkpoint at the end of training. Logic was added to ensure this happens only if the last evaluation was not at the final iteration. [1] [2]Training loop and logic updates:
Configuration example:
explorations/training_limits.yamlfile with example configurations demonstrating various combinations of token/epoch limits and evaluation intervals.Initialization and setup: