Gracefully shutdown when loss is NaN by Arcomano1234 · Pull Request #850 · ai2cm/ace

Arcomano1234 · 2026-02-18T23:35:02Z

We raise an exception when the loss is NaN, where this check is done locally on each rank (see _validate_loss in fme/core/optimization.py). However, there is a chance that there is only an NaN loss on a subset of the ranks. When this error is not raised on all the ranks the other ranks call torch.backward, which is a collective all reduce and instead the other GPUs hang until a NCCL timeout error occurs.

Changes:

add dist shutdown to _validate_loss in fme/core/optimization.py if the training is distributed
Tests added

Resolves #849

gracefully shutdown loss calculation when NaN is present

f6ec2d7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gracefully shutdown when loss is NaN#850

Gracefully shutdown when loss is NaN#850
Arcomano1234 wants to merge 1 commit intomainfrom
fix/raise-loss-nan-error

Arcomano1234 commented Feb 18, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

Arcomano1234 commented Feb 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Arcomano1234 commented Feb 18, 2026 •

edited

Loading