Add flexible training parameters and final eval 2 #730

klei22 · 2026-01-21T09:03:08Z

This pull request introduces flexible training limits and evaluation intervals based on either tokens or epochs, enhancing the configurability of the training process. It adds new command-line arguments, updates the training logic to respect these new limits, and provides a YAML example for setting up various training scenarios. The changes allow users to specify when to evaluate and stop training based on tokens or epochs, and to perform a final evaluation with optional checkpoint saving.

Training limits and evaluation interval enhancements:

Added support for specifying training limits by tokens (--max_tokens) or epochs (--max_epochs) and evaluation intervals by tokens (--eval_interval_tokens) or epochs (--eval_interval_epochs), with logic to derive max_iters and eval_interval accordingly. [1] [2] [3]
Implemented internal methods in train.py to compute and track main dataset, dataset size, tokens trained, and to determine when to evaluate or stop training based on the new limits.

Final evaluation and checkpointing:

Added --final_eval and --final_eval_save_checkpoint flags to optionally run a final validation and save a checkpoint at the end of training. Logic was added to ensure this happens only if the last evaluation was not at the final iteration. [1] [2]

Training loop and logic updates:

Updated the training loop to use the new evaluation and stopping criteria, including correct calculation of remaining evaluations and proper handling of early stopping and final evaluation. [1] [2] [3] [4]

Configuration example:

Added a new explorations/training_limits.yaml file with example configurations demonstrating various combinations of token/epoch limits and evaluation intervals.

Initialization and setup:

Ensured that training limits are configured during setup, before loading the tokenizer and starting training.

Copilot

Pull request overview

This pull request introduces flexible training limits and evaluation intervals based on either tokens or epochs, replacing or supplementing the traditional iteration-based training control. The changes allow users to specify when to stop training and when to evaluate based on token counts or epoch fractions rather than iteration counts. A final evaluation feature ensures that models are evaluated one last time after training completes.

Changes:

Added command-line arguments for token-based and epoch-based training limits (--max_tokens, --max_epochs) and evaluation intervals (--eval_interval_tokens, --eval_interval_epochs)
Implemented internal tracking and logic to compute when to evaluate and stop training based on these new limits
Added optional final evaluation at the end of training with checkpoint saving capability

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File	Description
train_args.py	Adds four new command-line arguments for flexible evaluation intervals (`--eval_interval_tokens`, `--eval_interval_epochs`) and training limits (`--max_tokens`, `--max_epochs`), plus flags for final evaluation (`--final_eval`, `--final_eval_save_checkpoint`)
train.py	Implements the core logic for token/epoch-based training control, including new instance variables, helper methods to compute dataset info and check training limits, and modifications to the training loop to respect these limits and perform final evaluation
explorations/training_limits.yaml	Provides example configurations demonstrating various combinations of token/epoch limits and evaluation intervals for different training scenarios

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-21T09:11:09Z

train.py

+                is_eval_boundary = self._should_run_eval()
+
+                if is_eval_boundary and self.master_process:

                    losses, num_steps_with_worse_loss = self.run_validation_step(
                        running_mfu, current_epoch, current_dataset, num_steps_with_worse_loss, live
                    )
+                    self.last_eval_iter = self.iter_num
+                    self._update_next_eval_tokens()


Potential issue with token-based training limits: When using max_tokens or max_epochs, the training can overshoot the limit by one iteration because the limit check happens after tokens are accumulated. For example, if max_tokens = 1000 and an eval runs at exactly 1000 tokens, the training will continue for one more iteration, accumulating additional tokens before the limit is checked. This means the final evaluation (if enabled) could run at a token count exceeding the specified limit. Consider checking the training limit before accumulating tokens in each iteration, or documenting this behavior if it's intentional.

Copilot · 2026-01-21T09:11:10Z

train_args.py

+        '--eval_interval_epochs',
+        default=None,
+        type=float,
+        help='Evaluate every N epochs on the main dataset (float accepted, cast to int tokens). Overrides eval_interval.',


The help text for --eval_interval_epochs says "cast to int tokens" which could be clearer. Consider rephrasing to "converted to token count then cast to int" to better explain that the epoch count is first multiplied by the dataset size to get a token count, then converted to int.

Suggested change

help='Evaluate every N epochs on the main dataset (float accepted, cast to int tokens). Overrides eval_interval.',

help='Evaluate every N epochs on the main dataset (float accepted, converted to token count then cast to int). Overrides eval_interval.',

Copilot · 2026-01-21T09:11:10Z

train.py

                    break

+            if self.args.final_eval and self.master_process and self.last_eval_iter != self.iter_num:
+                losses, num_steps_with_worse_loss = self.run_validation_step(


Variable losses is not used.

Suggested change

losses, num_steps_with_worse_loss = self.run_validation_step(

_, num_steps_with_worse_loss = self.run_validation_step(

Copilot · 2026-01-21T09:11:10Z

train.py

                    break

+            if self.args.final_eval and self.master_process and self.last_eval_iter != self.iter_num:
+                losses, num_steps_with_worse_loss = self.run_validation_step(


Variable num_steps_with_worse_loss is not used.

Suggested change

losses, num_steps_with_worse_loss = self.run_validation_step(

losses, _ = self.run_validation_step(

klei22 and others added 3 commits January 20, 2026 23:59

Add token/epoch training limits and final eval options

db200c6

Fix training stop condition at max_iters

85a7d4a

Add final eval (default true)

368fef9

klei22 requested review from Copilot and gkielian January 21, 2026 09:03

Copilot started reviewing on behalf of klei22 January 21, 2026 09:03 View session

Copilot AI reviewed Jan 21, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add flexible training parameters and final eval 2 #730

Add flexible training parameters and final eval 2 #730

Uh oh!

klei22 commented Jan 21, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jan 21, 2026

Uh oh!

Copilot AI Jan 21, 2026

Uh oh!

Copilot AI Jan 21, 2026

Uh oh!

Copilot AI Jan 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	help='Evaluate every N epochs on the main dataset (float accepted, cast to int tokens). Overrides eval_interval.',
	help='Evaluate every N epochs on the main dataset (float accepted, converted to token count then cast to int). Overrides eval_interval.',

	losses, num_steps_with_worse_loss = self.run_validation_step(
	_, num_steps_with_worse_loss = self.run_validation_step(

	losses, num_steps_with_worse_loss = self.run_validation_step(
	losses, _ = self.run_validation_step(

Add flexible training parameters and final eval 2 #730

Are you sure you want to change the base?

Add flexible training parameters and final eval 2 #730

Uh oh!

Conversation

klei22 commented Jan 21, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant