Skip to content

[trainer] update time format & fix resume from checkpoint train_speed#8007

Merged
Jintao-Huang merged 2 commits intomodelscope:mainfrom
Jintao-Huang:update_time_update
Feb 7, 2026
Merged

[trainer] update time format & fix resume from checkpoint train_speed#8007
Jintao-Huang merged 2 commits intomodelscope:mainfrom
Jintao-Huang:update_time_update

Conversation

@Jintao-Huang
Copy link
Collaborator

No description provided.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @Jintao-Huang, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request focuses on improving the accuracy and precision of training progress metrics within the trainer module. It resolves an issue where training speed and estimated remaining time were incorrectly reported when training was resumed from a checkpoint. Additionally, the changes enhance the granularity of time displays in logs, providing a more detailed and accurate representation of training duration.

Highlights

  • Accurate Training Speed and Remaining Time Calculation: The calculation for train_speed and remaining_time has been updated to correctly account for training resumed from a checkpoint. This is achieved by introducing a start_step parameter, ensuring that the speed is calculated based on the iterations completed since the start of the current training run, not the global step count from the very beginning.
  • Change in Training Speed Metric: The train_speed metric has been re-defined from 'iterations per second' (iter/s) to 'seconds per iteration' (s/it). This provides a different perspective on training efficiency, indicating the average time taken for each training step.
  • Enhanced Time Formatting Precision: The format_time utility function now displays the seconds component with two decimal places, offering more precise reporting of elapsed and remaining times in the logs.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • swift/trainers/patcher.py
    • Modified the add_train_message function to accept a start_step argument, which is used to calculate train_speed and remaining_time more accurately when resuming training.
    • Updated the train_speed calculation to elapsed / (state.global_step - start_step) and changed its log key from 'train_speed(iter/s)' to 'train_speed(s/it)'.
    • Adjusted the remaining_time calculation to utilize the newly computed train_speed.
    • Introduced self.start_step = state.global_step in the on_train_begin methods of ProgressCallbackNew and PrinterCallbackNew to capture the starting global step for accurate speed calculation.
    • Passed self.start_step to add_train_message in the on_log methods of ProgressCallbackNew and PrinterCallbackNew.
  • swift/utils/utils.py
    • Modified the format_time function to round the seconds component to two decimal places (round(seconds % 60, 2)) instead of truncating it to an integer, providing more precise time formatting.
Activity
  • No specific activity (comments, reviews, etc.) has been recorded for this pull request yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the training time format and fixes an issue with calculating training speed when resuming from a checkpoint. The changes in swift/trainers/patcher.py correctly adjust the train_speed and remaining_time calculations by considering the start_step. The change in swift/utils/utils.py increases the precision of the formatted time, which is a good improvement. My review includes one suggestion to prevent a potential ZeroDivisionError in the training speed calculation for added robustness. Overall, the changes are well-implemented and address the intended purpose.

logs['percentage'] = f'{train_percentage * 100:.2f}%'
elapsed = time.time() - start_time
logs['elapsed_time'] = format_time(elapsed)
train_speed = elapsed / (state.global_step - start_step)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

There's a potential for a ZeroDivisionError here. If on_log is called when state.global_step is equal to start_step (e.g., immediately after resuming from a checkpoint before the first training step is taken), this will raise an exception. It's safer to guard against this division by zero for robustness.

    steps_done = state.global_step - start_step
    train_speed = elapsed / steps_done if steps_done > 0 else 0.0

@Jintao-Huang Jintao-Huang merged commit 4451f2a into modelscope:main Feb 7, 2026
1 of 3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants