Skip to content

Gracefully shutdown when loss is NaN#850

Draft
Arcomano1234 wants to merge 1 commit intomainfrom
fix/raise-loss-nan-error
Draft

Gracefully shutdown when loss is NaN#850
Arcomano1234 wants to merge 1 commit intomainfrom
fix/raise-loss-nan-error

Conversation

@Arcomano1234
Copy link
Contributor

@Arcomano1234 Arcomano1234 commented Feb 18, 2026

We raise an exception when the loss is NaN, where this check is done locally on each rank (see _validate_loss in fme/core/optimization.py). However, there is a chance that there is only an NaN loss on a subset of the ranks. When this error is not raised on all the ranks the other ranks call torch.backward, which is a collective all reduce and instead the other GPUs hang until a NCCL timeout error occurs.

Changes:

  • add dist shutdown to _validate_loss in fme/core/optimization.py if the training is distributed

  • Tests added

Resolves #849

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

NaN loss not logged properly

1 participant

Comments