Execute GEOS via calling gcm_run.j -- part 1#677
Conversation
|
I really would like to get this PR go in soon and I'm aware other PRs are piling up. For it to function and past code tests etc it will require a part2, but I will work on that once I address potential reviewer comments here. This particular PR aims to achieve a few main things, so if you are reviewing please focus on those aspects mainly:
With the risk of repeating myself (for more details, read the description of this PR), Ricardo, Maryam, and I had a long discussion last week on executing GEOS in SWELL. It boils down to these two ways:
I am exclusively switching to method 1, @mer-a-o voiced desire to implement some type of method 2 (of which she has a Skylab application). @rtodling is not fully convinced either way, but leaning towards 1. I'm leaving the second option intact for now to help @mer-a-o choose between 1 & 2. Unfortunately, I'm not able to run GEOS v12rc** using method 2 on Milan nodes so approach 2 won't work for GEOSv12** for now (most likely due to using Anyways, happy to have a dedicated meeting on this, maybe even include @tclune. There are just so many different modes of running `gcm_run.j so IMO method 2 makes it hard for SWELL to support multiple versions of GEOSgcm. |
|
@Dooruk Happy to meet when convenient, but lets try to involve Matt and Shayon as they have much more experience with the details here. |
|
@Dooruk I tried running |
Sorry, can you try with this branch:
Sure, I will bring this up in our dev meeting. |
mer-a-o
left a comment
There was a problem hiding this comment.
Sorry the review took so long. I tried my best to go over most of the changes and also used Claude Code to help me review this PR.
src/swell/tasks/geos_marine/move_erase_da_restart-geos_marine.py
Outdated
Show resolved
Hide resolved
| else: | ||
| logger.info(' Copying file: '+src) | ||
| shutil.copy(src, dst_dir) | ||
| if not os.path.isfile(src): |
There was a problem hiding this comment.
Why try and Expection were removed?
There was a problem hiding this comment.
I tried addressing @shiklomanov-an's concern here:
Not sure if this is what he had in mind.
There was a problem hiding this comment.
I think this is better! If any of these operations fails, it will throw a precisely typed exception with a precise error message.
Previously, if this operation failed, we would immediately throw an error anyway, except that we would drop all useful error information and replace it with a generic "Copying failed..." message.
try: ... except: ... blocks only make sense if we actually do something to handle the exception (e.g., if copying fails, try looking for the file in a different place, or create a blank file, or do some additional analysis for common problems and then raise a precise exception with the results of that analysis).
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 27 out of 27 changed files in this pull request and generated 19 comments.
Comments suppressed due to low confidence (3)
src/swell/tasks/move_erase_forecast_restart.py:56
forecast_dir()only accepts a singlepathsargument (string or list). This call passes two positional arguments ('scratch','*_checkpoint') and will raise aTypeError. Pass a list (e.g.,['scratch', '*_checkpoint']) or a single joined string.
src/swell/tasks/move_erase_forecast_restart.py:56- Call to method taskBase.forecast_dir with too many arguments; should be no more than 1.
src/swell/tasks/move_erase_da_restart.py:14 - Import of 'shutil' is not used.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
I addressed all of the concerns here, part2 #703 has more changes that makes this particular PR merge ready. I am still working on that. In the meantime happy to address any additional concerns. For part2 I need @mer-a-o's input. As we discussed before currently I'm proposing two ways to execute GEOS. I don't want to put you on the spot but I need to know which way you are leaning towards in terms for forecast between below 3 options. Sounds like I'm convincing @rtodling towards option 1 but no pressure 😄.:
This is not an end-all be-all PR, there is still ensemble forecast part of the puzzle. @mathomp4 @sshakoor1 my proposed approach in this PR is to call swell/src/swell/suites/3dfgat_coupled_cycle/flow.cylc Lines 163 to 171 in 23bc8a5 The reason I'm tagging you is that if |
|
@Dooruk, tough question! :) For CF, I'd start from scratch since the setup is simpler compared to weather. If that goes well, I'd look to implement something similar for weather later. I did something similar in Skylab, started with running the forecast via gcm_run.j, and later moved all tasks under it into Skylab. So this feels like a natural path. I'm open to suggestions if you or others see it differently. |
Sure, sounds like the best option! I will work on part2 accordingly. |
Following the comments on my previous PR, I created a more "lightweight" PR. I realize this is not an easy PR for others who never executed GEOS within SWELL to review. Changes to R2D2 and GEOS model version hasn't helped in terms of simplifying this either. This is still a work in progress, and I don't even think it will work in this form, but if #626 goes in first I could simplify this a little bit more.
I added files to ignore at the end. I realize I should be more selective in my commits in terms of what goes in. Also better docstrings are required (I see @ashiklom's comments in other PRs). With those in mind..
There are two main things happening here (read the section about
gcm_run.jat the end of the description for model execution task):gcm_run.jdirectly. To faciliate this aforecastdirectory was created under{swell_exp_dir}/GEOSgcm/forecast. Thisforecastfolder is a replication of a GEOS experiment folder, with only a few changes regarding where HOMDIR, EXPDIR are defined. Model execution happens under{swell_exp_dir}/GEOSgcm/forecast/scratchsimilar to typical GEOS model runs.Why was this change necessary:
/RCfiles) in the forecast directory. This creates incompatibility while running/testing different GEOSgcm versions.forecastdir can't be updated in finalflow.cylcif it is templated in a time dependent way.subprocesssimply didn't run with GEOSv12 on Milan nodes, I tried many combinations, didn't pass beyond initialization.gcm_run.j. If users make mistake in terms of requesting sufficient SLURM nodes, GEOS tries submitting hundreds of instances to compensate lack of compute resources, then NCCS will yell at you.gcm_run.jandgcm_setup.jscripts are being or will be modernized. This is work underway but might take a long time (especiallygcm_run.j).gcm_run.jin SWELL, some parts should be erased or commented out. Or, my idea is that there could be conditional sections ingcm_run.jsaySWELL_active, thengcm_run.jcan skip those sections, which are mainlypostprocessinganyway.3dfgat_coupled_cycle(not the best name) is added.Ignore these changes (has no impact or related to grid change):
src/swell/configuration/jedi/interfaces/geos_marine/model/background_error.yaml
src/swell/tasks/generate_b_climatology.py
Finally, little primer on
gcm_run.jLet's consider
gcm_run.jin 4 stages:In the current implementation, SWELL handles 2 & 3 via python and
subprocessand 1 is assumed to be set properly by the user, which caused trouble with the NCCS. For DA purposes 4, postprocessing is explicitly handled by SWELL but that is not the focus of this PR.In this proposed implementation, the main difference is that we rely on
gcm_run.jfor 2 and 3 by conducting surgical edits viaPrepCoupledGeosRundirat few locations and runninggcm_run.jdirectly from Cylc (which doesn't capture failed exit status):I created the
3dfgat_coupled_cyclesuite for testing, should work by default if anyone has time to check it out.