Skip to content

Conversation

@klei22
Copy link
Collaborator

@klei22 klei22 commented Jan 21, 2026

This pull request introduces flexible training limits and evaluation intervals based on either tokens or epochs, enhancing the configurability of the training process. It adds new command-line arguments, updates the training logic to respect these new limits, and provides a YAML example for setting up various training scenarios. The changes allow users to specify when to evaluate and stop training based on tokens or epochs, and to perform a final evaluation with optional checkpoint saving.

Training limits and evaluation interval enhancements:

  • Added support for specifying training limits by tokens (--max_tokens) or epochs (--max_epochs) and evaluation intervals by tokens (--eval_interval_tokens) or epochs (--eval_interval_epochs), with logic to derive max_iters and eval_interval accordingly. [1] [2] [3]
  • Implemented internal methods in train.py to compute and track main dataset, dataset size, tokens trained, and to determine when to evaluate or stop training based on the new limits.

Final evaluation and checkpointing:

  • Added --final_eval and --final_eval_save_checkpoint flags to optionally run a final validation and save a checkpoint at the end of training. Logic was added to ensure this happens only if the last evaluation was not at the final iteration. [1] [2]

Training loop and logic updates:

  • Updated the training loop to use the new evaluation and stopping criteria, including correct calculation of remaining evaluations and proper handling of early stopping and final evaluation. [1] [2] [3] [4]

Configuration example:

  • Added a new explorations/training_limits.yaml file with example configurations demonstrating various combinations of token/epoch limits and evaluation intervals.

Initialization and setup:

  • Ensured that training limits are configured during setup, before loading the tokenizer and starting training.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request introduces flexible training limits and evaluation intervals based on either tokens or epochs, replacing or supplementing the traditional iteration-based training control. The changes allow users to specify when to stop training and when to evaluate based on token counts or epoch fractions rather than iteration counts. A final evaluation feature ensures that models are evaluated one last time after training completes.

Changes:

  • Added command-line arguments for token-based and epoch-based training limits (--max_tokens, --max_epochs) and evaluation intervals (--eval_interval_tokens, --eval_interval_epochs)
  • Implemented internal tracking and logic to compute when to evaluate and stop training based on these new limits
  • Added optional final evaluation at the end of training with checkpoint saving capability

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File Description
train_args.py Adds four new command-line arguments for flexible evaluation intervals (--eval_interval_tokens, --eval_interval_epochs) and training limits (--max_tokens, --max_epochs), plus flags for final evaluation (--final_eval, --final_eval_save_checkpoint)
train.py Implements the core logic for token/epoch-based training control, including new instance variables, helper methods to compute dataset info and check training limits, and modifications to the training loop to respect these limits and perform final evaluation
explorations/training_limits.yaml Provides example configurations demonstrating various combinations of token/epoch limits and evaluation intervals for different training scenarios

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +2041 to +2049
is_eval_boundary = self._should_run_eval()

if is_eval_boundary and self.master_process:

losses, num_steps_with_worse_loss = self.run_validation_step(
running_mfu, current_epoch, current_dataset, num_steps_with_worse_loss, live
)
self.last_eval_iter = self.iter_num
self._update_next_eval_tokens()
Copy link

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential issue with token-based training limits: When using max_tokens or max_epochs, the training can overshoot the limit by one iteration because the limit check happens after tokens are accumulated. For example, if max_tokens = 1000 and an eval runs at exactly 1000 tokens, the training will continue for one more iteration, accumulating additional tokens before the limit is checked. This means the final evaluation (if enabled) could run at a token count exceeding the specified limit. Consider checking the training limit before accumulating tokens in each iteration, or documenting this behavior if it's intentional.

Copilot uses AI. Check for mistakes.
'--eval_interval_epochs',
default=None,
type=float,
help='Evaluate every N epochs on the main dataset (float accepted, cast to int tokens). Overrides eval_interval.',
Copy link

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The help text for --eval_interval_epochs says "cast to int tokens" which could be clearer. Consider rephrasing to "converted to token count then cast to int" to better explain that the epoch count is first multiplied by the dataset size to get a token count, then converted to int.

Suggested change
help='Evaluate every N epochs on the main dataset (float accepted, cast to int tokens). Overrides eval_interval.',
help='Evaluate every N epochs on the main dataset (float accepted, converted to token count then cast to int). Overrides eval_interval.',

Copilot uses AI. Check for mistakes.
break

if self.args.final_eval and self.master_process and self.last_eval_iter != self.iter_num:
losses, num_steps_with_worse_loss = self.run_validation_step(
Copy link

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Variable losses is not used.

Suggested change
losses, num_steps_with_worse_loss = self.run_validation_step(
_, num_steps_with_worse_loss = self.run_validation_step(

Copilot uses AI. Check for mistakes.
break

if self.args.final_eval and self.master_process and self.last_eval_iter != self.iter_num:
losses, num_steps_with_worse_loss = self.run_validation_step(
Copy link

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Variable num_steps_with_worse_loss is not used.

Suggested change
losses, num_steps_with_worse_loss = self.run_validation_step(
losses, _ = self.run_validation_step(

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant