MPI Exit Handling by mxkpp · Pull Request #103 · NGWPC/ngen-forcing

mxkpp · 2026-02-18T03:45:48Z

This adds exit handling to the MpiConfig class for treating the following conditions the same way via a cleanup routine: unhandled exceptions, normal exit (and sys.exit()), and MPI aborts. MPI aborts have a new wrapper method -- raw abort is removed and should no longer be used.

The existing BMI tmp file cleanup steps were moved into the new MPI exit handler cleanup steps.

Additions

MPI exit handling

Removals

Direct MPI abort (use the wrapped method going forward)

Changes

BMI tmp file cleanup steps moved out of BMI class and into MpiConfig class
BMI tmp file cleanup occurs on all ranks, not only rank 0
Uniqueness added to mesh file names

Testing

Added __test_exit() and _test_exit() methods to the MpiConfig class for checking various exit and crash conditions -- occurring on rank 0 as well as non-0 ranks.
Ran calibrations and forecasts.
Observed log debug messages.

Screenshots

Notes

Todos

Checklist

Testing checklist

Target Environment support

Linux

…related file cleanup

mxkpp · 2026-02-18T03:54:46Z

To avoid concurrent jobs from conflicting during cleanup, this should not be merged until either the geogrid file name is made unique, or until _cleanup_geogrid is disabled.

Similarly, if the operating environment causes the Scratch directory to be shared between concurrent jobs, then this should not be merged until uniqueness is added there, as well, since the current code deletes contents of the scratch dir arbitrarily.

kyle-larkin · 2026-02-18T17:33:44Z

I added the uniquefying filename code (in config.py). I'm not sure if the scratch dir cleaning is going to cause a problem, though it is shared. I think it's mostly used for the temporary .nc files during regridding, at this point, and those files are highly ephemeral. I think other usages have been removed.

mxkpp · 2026-02-18T19:33:01Z

The new randomized prefix on the geogrid file name needs to be shared between the ranks rather than each rank receiving a separate random string. There is an existing shared (broadcasted) random string within MpiConfig which can be used for this. But, MpiConfig is instantiated after ConfigOptions is instantiated, so this will require moving a bit of things around.

… apply to geogrid file name.

mxkpp · 2026-02-19T03:43:48Z

Kyle, I just noticed that you had an atexit.unregister in the original introduction of the cleanup logic in PR: #94, and added that to this one too.

This is ready for review.

mxkpp · 2026-02-19T03:52:10Z

If the scratch dir is a conflict for concurrent jobs using the same fileserver / disk, I think the manager of the jobs (ngenCERF, or otherwise) could provide unique scratch dir names to each job. Regardless, this PR does not change the fundamental behavior of how the scratch dir is cleaned up, but it does add more conditions that trigger that event, and allows any rank to do the cleanup.

Eventually I would like to revisit the cleanup concept in the context of the more complex calibration modes (objective functions & optimization algorithms) that involve launching concurrent workers within one job, since that seems like a use case that could require a more precise cleanup.

mxkpp added 2 commits February 17, 2026 11:58

Add MPI exit handling for exceptions, signals, and normal exits, and …

ca4a21c

…related file cleanup

MPI exit handling: add log debug calls

e184f12

mxkpp requested a review from kyle-larkin February 18, 2026 03:45

uniquefying meshfile names

26c1e09

mxkpp added 2 commits February 18, 2026 17:03

Add random uid to ConfigOptions class for additional file uniqueness,…

95df4f1

… apply to geogrid file name.

Add atexit.unregister of cleanup

f50c1e7

mxkpp marked this pull request as ready for review February 19, 2026 03:35

mxkpp requested review from kyle-larkin and removed request for kyle-larkin February 19, 2026 03:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

MPI Exit Handling#103

MPI Exit Handling#103
mxkpp wants to merge 5 commits intodevelopmentfrom
maxkipp-mpi-exit-handling

mxkpp commented Feb 18, 2026 •

edited

Loading

Uh oh!

mxkpp commented Feb 18, 2026

Uh oh!

kyle-larkin commented Feb 18, 2026

Uh oh!

mxkpp commented Feb 18, 2026

Uh oh!

mxkpp commented Feb 19, 2026

Uh oh!

mxkpp commented Feb 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

mxkpp commented Feb 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Additions

Removals

Changes

Testing

Screenshots

Notes

Todos

Checklist

Testing checklist

Target Environment support

Uh oh!

mxkpp commented Feb 18, 2026

Uh oh!

kyle-larkin commented Feb 18, 2026

Uh oh!

mxkpp commented Feb 18, 2026

Uh oh!

mxkpp commented Feb 19, 2026

Uh oh!

mxkpp commented Feb 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mxkpp commented Feb 18, 2026 •

edited

Loading