Skip to content

Comments

MPI Exit Handling#103

Open
mxkpp wants to merge 5 commits intodevelopmentfrom
maxkipp-mpi-exit-handling
Open

MPI Exit Handling#103
mxkpp wants to merge 5 commits intodevelopmentfrom
maxkipp-mpi-exit-handling

Conversation

@mxkpp
Copy link

@mxkpp mxkpp commented Feb 18, 2026

This adds exit handling to the MpiConfig class for treating the following conditions the same way via a cleanup routine: unhandled exceptions, normal exit (and sys.exit()), and MPI aborts. MPI aborts have a new wrapper method -- raw abort is removed and should no longer be used.

The existing BMI tmp file cleanup steps were moved into the new MPI exit handler cleanup steps.

Additions

  • MPI exit handling

Removals

  • Direct MPI abort (use the wrapped method going forward)

Changes

  • BMI tmp file cleanup steps moved out of BMI class and into MpiConfig class
  • BMI tmp file cleanup occurs on all ranks, not only rank 0
  • Uniqueness added to mesh file names

Testing

  1. Added __test_exit() and _test_exit() methods to the MpiConfig class for checking various exit and crash conditions -- occurring on rank 0 as well as non-0 ranks.
  2. Ran calibrations and forecasts.
  3. Observed log debug messages.

Screenshots

Notes

Todos

Checklist

  • PR has an informative and human-readable title
  • Changes are limited to a single goal (no scope creep)
  • Code can be automatically merged (no conflicts)
  • Code follows project standards (link if applicable)
  • Passes all existing automated tests
  • Any change in functionality is tested
  • New functions are documented (with a description, list of inputs, and expected output)
  • Placeholder code is flagged / future todos are captured in comments
  • Visually tested in supported browsers and devices (see checklist below 👇)
  • Project documentation has been updated (including the "Unreleased" section of the CHANGELOG)
  • Reviewers requested with the Reviewers tool ➡️

Testing checklist

Target Environment support

  • Linux

@mxkpp mxkpp requested a review from kyle-larkin February 18, 2026 03:45
@mxkpp
Copy link
Author

mxkpp commented Feb 18, 2026

To avoid concurrent jobs from conflicting during cleanup, this should not be merged until either the geogrid file name is made unique, or until _cleanup_geogrid is disabled.

Similarly, if the operating environment causes the Scratch directory to be shared between concurrent jobs, then this should not be merged until uniqueness is added there, as well, since the current code deletes contents of the scratch dir arbitrarily.

@kyle-larkin
Copy link

I added the uniquefying filename code (in config.py). I'm not sure if the scratch dir cleaning is going to cause a problem, though it is shared. I think it's mostly used for the temporary .nc files during regridding, at this point, and those files are highly ephemeral. I think other usages have been removed.

@mxkpp
Copy link
Author

mxkpp commented Feb 18, 2026

The new randomized prefix on the geogrid file name needs to be shared between the ranks rather than each rank receiving a separate random string. There is an existing shared (broadcasted) random string within MpiConfig which can be used for this. But, MpiConfig is instantiated after ConfigOptions is instantiated, so this will require moving a bit of things around.

@mxkpp mxkpp marked this pull request as ready for review February 19, 2026 03:35
@mxkpp
Copy link
Author

mxkpp commented Feb 19, 2026

Kyle, I just noticed that you had an atexit.unregister in the original introduction of the cleanup logic in PR: #94, and added that to this one too.

This is ready for review.

@mxkpp mxkpp requested review from kyle-larkin and removed request for kyle-larkin February 19, 2026 03:44
@mxkpp
Copy link
Author

mxkpp commented Feb 19, 2026

If the scratch dir is a conflict for concurrent jobs using the same fileserver / disk, I think the manager of the jobs (ngenCERF, or otherwise) could provide unique scratch dir names to each job. Regardless, this PR does not change the fundamental behavior of how the scratch dir is cleaned up, but it does add more conditions that trigger that event, and allows any rank to do the cleanup.

Eventually I would like to revisit the cleanup concept in the context of the more complex calibration modes (objective functions & optimization algorithms) that involve launching concurrent workers within one job, since that seems like a use case that could require a more precise cleanup.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants