Skip to content

Execute GEOS via calling gcm_run.j -- part 1#677

Open
Dooruk wants to merge 35 commits intodevelopfrom
feature/exec_geos_direct_part1
Open

Execute GEOS via calling gcm_run.j -- part 1#677
Dooruk wants to merge 35 commits intodevelopfrom
feature/exec_geos_direct_part1

Conversation

@Dooruk
Copy link
Collaborator

@Dooruk Dooruk commented Dec 15, 2025

Following the comments on my previous PR, I created a more "lightweight" PR. I realize this is not an easy PR for others who never executed GEOS within SWELL to review. Changes to R2D2 and GEOS model version hasn't helped in terms of simplifying this either. This is still a work in progress, and I don't even think it will work in this form, but if #626 goes in first I could simplify this a little bit more.

I added files to ignore at the end. I realize I should be more selective in my commits in terms of what goes in. Also better docstrings are required (I see @ashiklom's comments in other PRs). With those in mind..

There are two main things happening here (read the section about gcm_run.j at the end of the description for model execution task):

  1. Cylc (flow.cylc) calls gcm_run.j directly. To faciliate this a forecast directory was created under {swell_exp_dir}/GEOSgcm/forecast. This forecast folder is a replication of a GEOS experiment folder, with only a few changes regarding where HOMDIR, EXPDIR are defined. Model execution happens under {swell_exp_dir}/GEOSgcm/forecast/scratch similar to typical GEOS model runs.

Why was this change necessary:

  • We want SWELL to be less involved in GEOS model execution task(s). The previous method required lots of file manipulation (in particular due to the boundary condition files, AKA /RC files) in the forecast directory. This creates incompatibility while running/testing different GEOSgcm versions.
  • In Cylc templating forecast dir can't be updated in final flow.cylc if it is templated in a time dependent way.
  • subprocess simply didn't run with GEOSv12 on Milan nodes, I tried many combinations, didn't pass beyond initialization.
  • Defining sufficient nodes, MPI layouts etc. is handled by gcm_run.j. If users make mistake in terms of requesting sufficient SLURM nodes, GEOS tries submitting hundreds of instances to compensate lack of compute resources, then NCCS will yell at you.
  • (long term relevancy) gcm_run.j and gcm_setup.j scripts are being or will be modernized. This is work underway but might take a long time (especially gcm_run.j).

⚠️ Which means to use a gcm_run.j in SWELL, some parts should be erased or commented out. Or, my idea is that there could be conditional sections in gcm_run.j say SWELL_active, then gcm_run.j can skip those sections, which are mainly postprocessing anyway.

  1. Model-differentiated tasks were created in this PR. There is a new suite and new tasks that are only executed with coupled mode. A new suite called 3dfgat_coupled_cycle (not the best name) is added.

⚠️ @mranst introduced this in some capacity already: #626, so if we merge the model-differentiated tasks PR first I can make necessary changes to my PR and it might be less confusing. It is my mistake to make this change in this PR.

Ignore these changes (has no impact or related to grid change):

src/swell/configuration/jedi/interfaces/geos_marine/model/background_error.yaml
src/swell/tasks/generate_b_climatology.py

Finally, little primer on gcm_run.j

Let's consider gcm_run.j in 4 stages:

  1. SLURM & node assignment
  2. Preprocessing
  3. Execution
  4. Postprocessing

In the current implementation, SWELL handles 2 & 3 via python and subprocess and 1 is assumed to be set properly by the user, which caused trouble with the NCCS. For DA purposes 4, postprocessing is explicitly handled by SWELL but that is not the focus of this PR.

In this proposed implementation, the main difference is that we rely on gcm_run.j for 2 and 3 by conducting surgical edits via PrepCoupledGeosRundir at few locations and running gcm_run.j directly from Cylc (which doesn't capture failed exit status):

    [[RunGeos]]
        script = "{{experiment_path}}/forecast/gcm_run.j"
        platform = {{platform}}
        [[[job]]]
            shell = /bin/csh
        [[[directives]]]
        {%- for key, value in scheduling["RunGeos"]["directives"]["all"].items() %}
            --{{key}} = {{value}}
        {%- endfor %}

I created the 3dfgat_coupled_cycle suite for testing, should work by default if anyone has time to check it out.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@Dooruk
Copy link
Collaborator Author

Dooruk commented Feb 5, 2026

I really would like to get this PR go in soon and I'm aware other PRs are piling up. For it to function and past code tests etc it will require a part2, but I will work on that once I address potential reviewer comments here.

This particular PR aims to achieve a few main things, so if you are reviewing please focus on those aspects mainly:

  • Introduce RunGeos; which might be summarized as a "transient" task that will only live under suites that runs GEOS. This task calls gcm_run.j directly from Cylc.
  • Make GEOS running tasks more application specific, e.g. PrepCoupledGeosRunDir would only be used by coupled suites.
  • GetCoupledRestarts, new task, can use Discover restarts or R2D2 (this is a placeholder option for now)
  • Switch default to GEOSv12rc**, the rumor mill says an official release is imminent. SWELL based ADAS and ODAS will use GEOSv12+. I know this is not necessarily the case for GEOS-CF.
  • Start using model specific tasks

With the risk of repeating myself (for more details, read the description of this PR), Ricardo, Maryam, and I had a long discussion last week on executing GEOS in SWELL. It boils down to these two ways:

  1. (This PR) Copy an existing GEOS experiment folder, make minimal changes, and call gcm_run.j directly
  2. Current approach, point to an external GEOS experiment folder, copy everything, break down gcm_run.j into Python methods, as it is done in prep_geos_run_dir.py and execute GEOSgcm.x using Python subprocess in RunGeosExecutable.

I am exclusively switching to method 1, @mer-a-o voiced desire to implement some type of method 2 (of which she has a Skylab application). @rtodling is not fully convinced either way, but leaning towards 1.

I'm leaving the second option intact for now to help @mer-a-o choose between 1 & 2. Unfortunately, I'm not able to run GEOS v12rc** using method 2 on Milan nodes so approach 2 won't work for GEOSv12** for now (most likely due to using mpirun in subprocess approach). FWIW, GEOS-CF currently uses GEOSv10, I don't know what is the target version for a JEDI based release. @mer-a-o is looking into gcm_setup and such at the moment so she can provide her input.

Anyways, happy to have a dedicated meeting on this, maybe even include @tclune. There are just so many different modes of running `gcm_run.j so IMO method 2 makes it hard for SWELL to support multiple versions of GEOSgcm.

@Dooruk Dooruk marked this pull request as ready for review February 5, 2026 18:31
@tclune
Copy link

tclune commented Feb 5, 2026

@Dooruk Happy to meet when convenient, but lets try to involve Matt and Shayon as they have much more experience with the details here.

@mer-a-o
Copy link
Contributor

mer-a-o commented Feb 5, 2026

@Dooruk I tried running 3dvar_cycle experiment and got this error:

PrepGeosRunDir:   Swell called ABORT: In config class, trying to get variable 'geos_experiment_directory' but this variable was not created. Ensure that the variable is in the experiment configuration and specified in the question list for the task (task_questions.py).

@Dooruk
Copy link
Collaborator Author

Dooruk commented Feb 5, 2026

@Dooruk I tried running 3dvar_cycle experiment and got this error:

PrepGeosRunDir:   Swell called ABORT: In config class, trying to get variable 'geos_experiment_directory' but this variable was not created. Ensure that the variable is in the experiment configuration and specified in the question list for the task (task_questions.py).

Sorry, can you try with this branch: feature/exec_geos_direct_part2 (which will fail during initilzation:)

@Dooruk Happy to meet when convenient, but lets try to involve Matt and Shayon as they have much more experience with the details here.

Sure, I will bring this up in our dev meeting.

Copy link
Contributor

@mer-a-o mer-a-o left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry the review took so long. I tried my best to go over most of the changes and also used Claude Code to help me review this PR.

else:
logger.info(' Copying file: '+src)
shutil.copy(src, dst_dir)
if not os.path.isfile(src):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why try and Expection were removed?

Copy link
Collaborator Author

@Dooruk Dooruk Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried addressing @shiklomanov-an's concern here:

#677 (comment)

Not sure if this is what he had in mind.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is better! If any of these operations fails, it will throw a precisely typed exception with a precise error message.

Previously, if this operation failed, we would immediately throw an error anyway, except that we would drop all useful error information and replace it with a generic "Copying failed..." message.

try: ... except: ... blocks only make sense if we actually do something to handle the exception (e.g., if copying fails, try looking for the file in a different place, or create a blank file, or do some additional analysis for common problems and then raise a precise exception with the results of that analysis).

@Dooruk Dooruk mentioned this pull request Feb 10, 2026
26 tasks
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 27 out of 27 changed files in this pull request and generated 19 comments.

Comments suppressed due to low confidence (3)

src/swell/tasks/move_erase_forecast_restart.py:56

  • forecast_dir() only accepts a single paths argument (string or list). This call passes two positional arguments ('scratch', '*_checkpoint') and will raise a TypeError. Pass a list (e.g., ['scratch', '*_checkpoint']) or a single joined string.
    src/swell/tasks/move_erase_forecast_restart.py:56
  • Call to method taskBase.forecast_dir with too many arguments; should be no more than 1.
    src/swell/tasks/move_erase_da_restart.py:14
  • Import of 'shutil' is not used.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@Dooruk Dooruk changed the title Execute GEOS via calling gcm_run.j and create new model specific tasks. Execute GEOS via calling gcm_run.j -- part 1 Feb 12, 2026
@Dooruk
Copy link
Collaborator Author

Dooruk commented Feb 12, 2026

I addressed all of the concerns here, part2 #703 has more changes that makes this particular PR merge ready. I am still working on that. In the meantime happy to address any additional concerns.

For part2 I need @mer-a-o's input. As we discussed before currently I'm proposing two ways to execute GEOS. I don't want to put you on the spot but I need to know which way you are leaning towards in terms for forecast between below 3 options. Sounds like I'm convincing @rtodling towards option 1 but no pressure 😄.:

  1. Adapt to executing gcm_run.j directly. I'm happy to work with ADAS or CF group in terms of adaptation/explanation. This is reasonable with the way gcm_setup is structured.

  2. Use existing approach of executing GEOSgcm.x directly (this will require some modifications on my end and your end. I can address as best as I can in part2 then hand it to you). As I mentioned before, subprocess part didn't work for me on Milan nodes, but I can walk you through the execution part.

  3. Start from scratch, recreate your Skylab approach, which is somewhat similar to option 2 approach, but uses a shell script to execute GEOSgcm.x.

  4. ?? Do you have any other suggestions?

This is not an end-all be-all PR, there is still ensemble forecast part of the puzzle.

@mathomp4 @sshakoor1 my proposed approach in this PR is to call gcm_run.j directly from Cylc. The relevant part for you is here:

[[RunGeos]]
script = "{{experiment_path}}/GEOSgcm/forecast/gcm_run.j"
platform = {{platform}}
[[[job]]]
shell = /bin/csh
[[[directives]]]
{%- for key, value in scheduling["RunGeos"]["directives"]["all"].items() %}
--{{key}} = {{value}}
{%- endfor %}

The reason I'm tagging you is that if gcm_run.j is executed in Cylc/SWELL some post-processing parts in gcm_run.j need to be deactivated. My csh and gcm templated suggestion is to make certain parts of gcm_run.j conditional on if SWELL is used or not. Could be a really dumb suggestion depending on the complexity of the gcm_run.j.tmpl structure. We can discuss this during an upcoming SWELL dev meeting.

@mer-a-o
Copy link
Contributor

mer-a-o commented Feb 12, 2026

@Dooruk, tough question! :)
For weather, I'm leaning toward option 1 with gcm_setup, since there are already many pieces in place for non-cycling experiments and also marine is using it, which will help me understand all the details and caveats before going further.

For CF, I'd start from scratch since the setup is simpler compared to weather. If that goes well, I'd look to implement something similar for weather later.

I did something similar in Skylab, started with running the forecast via gcm_run.j, and later moved all tasks under it into Skylab. So this feels like a natural path.

I'm open to suggestions if you or others see it differently.

@Dooruk
Copy link
Collaborator Author

Dooruk commented Feb 13, 2026

@Dooruk, tough question! :) For weather, I'm leaning toward option 1 with gcm_setup, since there are already many pieces in place for non-cycling experiments and also marine is using it, which will help me understand all the details and caveats before going further.

For CF, I'd start from scratch since the setup is simpler compared to weather. If that goes well, I'd look to implement something similar for weather later.

Sure, sounds like the best option! I will work on part2 accordingly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants

Comments