Conversation
|
To avoid concurrent jobs from conflicting during cleanup, this should not be merged until either the geogrid file name is made unique, or until Similarly, if the operating environment causes the Scratch directory to be shared between concurrent jobs, then this should not be merged until uniqueness is added there, as well, since the current code deletes contents of the scratch dir arbitrarily. |
|
I added the uniquefying filename code (in config.py). I'm not sure if the scratch dir cleaning is going to cause a problem, though it is shared. I think it's mostly used for the temporary .nc files during regridding, at this point, and those files are highly ephemeral. I think other usages have been removed. |
|
The new randomized prefix on the geogrid file name needs to be shared between the ranks rather than each rank receiving a separate random string. There is an existing shared (broadcasted) random string within MpiConfig which can be used for this. But, MpiConfig is instantiated after ConfigOptions is instantiated, so this will require moving a bit of things around. |
… apply to geogrid file name.
|
Kyle, I just noticed that you had an atexit.unregister in the original introduction of the cleanup logic in PR: #94, and added that to this one too. This is ready for review. |
|
If the scratch dir is a conflict for concurrent jobs using the same fileserver / disk, I think the manager of the jobs (ngenCERF, or otherwise) could provide unique scratch dir names to each job. Regardless, this PR does not change the fundamental behavior of how the scratch dir is cleaned up, but it does add more conditions that trigger that event, and allows any rank to do the cleanup. Eventually I would like to revisit the cleanup concept in the context of the more complex calibration modes (objective functions & optimization algorithms) that involve launching concurrent workers within one job, since that seems like a use case that could require a more precise cleanup. |
This adds exit handling to the
MpiConfigclass for treating the following conditions the same way via a cleanup routine: unhandled exceptions, normal exit (and sys.exit()), and MPI aborts. MPI aborts have a new wrapper method -- raw abort is removed and should no longer be used.The existing BMI tmp file cleanup steps were moved into the new MPI exit handler cleanup steps.
Additions
Removals
Changes
MpiConfigclassTesting
__test_exit()and_test_exit()methods to theMpiConfigclass for checking various exit and crash conditions -- occurring on rank 0 as well as non-0 ranks.Screenshots
Notes
Todos
Checklist
Testing checklist
Target Environment support