add ModelTorchDistributed backend for spatial parallelism along tests by mahf708 · Pull Request #842 · ai2cm/ace

mahf708 · 2026-02-16T00:38:55Z

add ModelTorchDistributed backend to begin the process of enabling training and inferencing with a spatial parallelized context.

Changes:

In fme.core.distributed:
- ModelTorchDistributed (DistributedBackend) in model_torch_distributed module
- module model_torch_distributed_comm contains comm utilities for ModelTorchDistributed
- model_torch_distributed_utils contains tensor utilities for ModelTorchDistributed
Expanded fme.core.distributed.parallel_tests and verified with NPROC=1,2,3,4 (on 4xA100 pod)
Updated signature of base classes as needed
Tests added
Optional (for now unpinned) dependency on physicsnemo

Closes #749

mahf708 · 2026-02-16T00:45:32Z

This builds on top of @odiazib's PR (#749) and previous draft work (#719). I decided to significantly increase test coverage; arguably overkill, but I thought these tests would help with the planned upcoming dev (data loading/writing, training operations, layering, etc.). There are some tricky (and very not pretty) parts in how we are borrowing comm utilities from makani, but they are hidden enough for now. In particular, in order to facilitate efficient/easy testing, some nontrivial hacks were needed. Some of the tests are kind of trivial in nature (e.g., barrier) and they balloon the size of this PR by a fair bit. Hopefully we will keep future PRs to <1000 lines. Copying @odiazib, @elynnwu, and @mcgibbon for awareness. Feedback welcome.

Tested locally with make parallel_tests NPROC=X for X=1,2,3,4. Also, ran full test suite on A100 GPUs as in the CI files. All passing as far as I could tell (except with timed ones running overtime).

mcgibbon · 2026-02-17T17:03:15Z

.github/workflows/tests.yaml

        run: |
          python -m pip install uv
          uv pip install --system -c constraints.txt -e .[dev]
+          uv pip install --system --no-build-isolation -c constraints.txt -e .[dev,spatial-parallelism]


Is it possible to run the very-fast tests without spatial parallelism? It's nice to have a test that doesn't use them to make sure the "optional" part of optional dependency is done correctly.

mcgibbon · 2026-02-17T17:12:11Z

fme/core/distributed/parallel_tests/test_gather_not_supported.py

+
+
+@requires_parallel
+def test_gather_raises_not_implemented(monkeypatch):


Suggestion: Make this a unit test of the ModelTorchDistributed backend, rather than an integration test involving Distributed.

mcgibbon · 2026-02-17T17:12:43Z

fme/core/distributed/parallel_tests/test_gather_not_supported.py

+
+
+@requires_parallel
+def test_gather_irregular_raises_not_implemented(monkeypatch):


Question: Do we need tests for these NotImplementedError's? We do plan to implement them.

mcgibbon · 2026-02-17T17:14:46Z

fme/core/distributed/parallel_tests/test_model_torch_distributed.py

+    else:
+        monkeypatch.delenv("W_PARALLEL_SIZE", raising=False)
+
+    result = ModelTorchDistributed.is_available()


Suggestion: If the logic can be refactored to a helper in some way, or you have is_available take input arguments that default to the environment variables, you can unit test this without environment monkeypatching.

mcgibbon · 2026-02-17T17:15:52Z

fme/core/distributed/distributed.py

        return singleton

+    @classmethod
+    def reset(cls) -> None:


I am nervous about something like this, because distributed backends generally use global state which doesn't get properly reset when we do something like this.

mcgibbon · 2026-02-17T17:16:15Z

fme/core/distributed/distributed.py

        """
        return self._distributed.total_ranks

+    def get_local_rank(self) -> int:


Question: What is a "local" rank?

mcgibbon · 2026-02-17T17:16:49Z

fme/core/distributed/distributed.py

+    def get_local_rank(self) -> int:
+        return self._distributed.get_local_rank()
+
    def get_sampler(


It would be great to add a test of this, I think you're right about the changes.

mcgibbon · 2026-02-17T17:18:22Z

fme/core/distributed/distributed.py

        return self._distributed.local_batch_size(batch_size)

-    def reduce_mean(self, tensor: torch.Tensor) -> torch.Tensor:
+    def reduce_mean(self, tensor: torch.Tensor, group=None) -> torch.Tensor:


Suggestion: We could abstract away concerns like specific group names to make it easier on users. For example, can we use a data_parallel_only: bool = True argument instead? or a kind: Literal["data_parallel", "model", "all"] = "data_parallel"?

mcgibbon · 2026-02-17T17:19:39Z

fme/core/distributed/distributed.py

@@ -169,26 +191,28 @@ def get_local_slices(

    def reduce_sum(self, tensor: torch.Tensor) -> torch.Tensor:


Generally speaking we need reductions along the data-parallel dimension to get global maps of mean outputs in our aggregators. We probably also need to do global area-weighted means for certain operations in the correctors.

We also do need to reudce over other orthogonal dimensions (say for zonal means), but that can wait a bit

mcgibbon · 2026-02-17T17:46:26Z

fme/core/distributed/parallel_tests/test_local_batch_size.py

+    assert dist.local_batch_size(global_batch) == expected
+
+
+def test_local_batch_size_not_divisible():


Does this test need to be run in a parallel context or would it work in serial?

mcgibbon · 2026-02-17T18:34:00Z

fme/core/distributed/model_torch_distributed.py

+    def get_local_slices(
+        self,
+        tensor_shape,
+        rank: int | None = None,


Issue: Both the rank and data_parallel_dim argument here are ignored.

Suggestion: I have this PR to let the rank argument get removed: #839

I don't know what needs to be done yet about data_parallel_dim.

mcgibbon · 2026-02-17T18:36:33Z

fme/core/distributed/distributed.py

+    def comm_get_size(self, key: str):
+        return self._distributed.comm_get_size(key)
+
+    def comm_get_group(self, key: str):
+        return self._distributed.comm_get_group(key)
+
+    def comm_get_rank(self, key: str):
+        return self._distributed.comm_get_rank(key)


Issue: These are low-level operations. Can we hide them inside backend methods? Other backends don't have a comm.

mcgibbon · 2026-02-17T18:37:33Z

fme/core/distributed/model_torch_distributed_comm.py

+from physicsnemo.distributed.manager import DistributedManager
+from physicsnemo.distributed.config import ProcessGroupNode, ProcessGroupConfig


If this is all we're using from physicsnemo, we could consider vendorizing/forking the code, i.e. copy-pasting it into a subdirectory here. The last time I checked these seemed pretty isolated / didn't depend on other infrastructure in physicsnemo, and it would avoid a dependency.

mcgibbon · 2026-02-17T18:38:43Z

fme/core/distributed/non_distributed.py

+        return 1
+
+    def comm_get_group(self, key: str):
+        return None


Question: Is this a valid return value for comm_get_group?

mcgibbon · 2026-02-17T18:39:50Z

fme/core/distributed/base.py

        ...

+    @abstractmethod
+    def comm_get_size(self, key: str): ...


Issue: These need return value type hints.

mcgibbon · 2026-02-17T19:06:49Z

fme/core/distributed/distributed.py

    def shutdown(self):
        return self._distributed.shutdown()

+    def comm_get_size(self, key: str):


These comm_ methods are only used in tests, do we need them / do we need them public on Distributed?

mcgibbon · 2026-02-17T19:13:03Z

fme/core/distributed/non_distributed.py

        return batch_size

-    def reduce_mean(self, tensor: torch.Tensor) -> torch.Tensor:
+    def reduce_mean(self, tensor: torch.Tensor, group=None) -> torch.Tensor:


I would like to avoid adding low-level information like the group names above this PR, but we should talk about it. For an initial PR we should avoid features that need this.

mcgibbon · 2026-02-17T19:15:10Z

fme/core/gridded_ops.py

 from fme.core.cuhpx.sht import SHT as CuHpxSHT
 from fme.core.cuhpx.sht import iSHT as CuHpxiSHT
 from fme.core.device import get_device
+from fme.core.distributed import Distributed


These changes to gridded_ops and their associated tests in parallel_tests can/probably should be split into their own PR.

mcgibbon · 2026-02-17T19:16:18Z

fme/core/distributed/parallel_tests/test_coordinates_sp.py

@@ -0,0 +1,81 @@
+import logging


This test should probably be split off with the gridded_ops changes into its own PR.

mcgibbon · 2026-02-17T19:16:45Z

fme/core/distributed/parallel_tests/test_get_local_slices.py

@@ -0,0 +1,82 @@
+import pytest


What coverage does this file add that isn't already covered by test_local_slices.py?

mcgibbon · 2026-02-17T19:18:05Z

fme/core/distributed/parallel_tests/test_local_batch_size.py

It would be nice if we could test this code in a unit test without the full process of setting up Distributed. But it's probably a test of DistributedManager.

mcgibbon · 2026-02-17T19:19:35Z

fme/core/distributed/parallel_tests/test_loss_sp.py

+
+    # Create test data
+    data_tensor_host = torch.randn(1, 2, nx, ny, device="cpu")
+    area_weights_host = torch.ones(nx, ny).to("cpu") * 5


Issue: Please update area weights to not use ones (maybe 1 plus a random uniform?) to cover how area is applied.

mcgibbon · 2026-02-17T19:22:35Z

fme/core/distributed/parallel_tests/test_local_slices.py

    # depending on the batch/data parallel index/rank.
    x_global_ranked = x_global_base + dist.data_parallel_rank
    x_local_ranked = x_global_ranked[dist.get_local_slices(global_shape, dist.rank)]
    x_local_reduced = dist.reduce_mean(x_local_ranked)


I don't understand how this test passes under spatial parallelism, given this isn't a reduction over the data-parallel group, and the test is supposed to require that it is.

this never passes under sp :) because it is slyly being skipped by the pytest hackery 😺 (which we will get rid of in the next edit)

For the record, I am not surprised at all the cpu tests bombed. I didn't even run a single one of these tests, and I totally forgot about that until after I stopped working on this. In the revision, will address those too

add ModelTorchDistributed spatial parallelism backend with tests

84b3bb8

odiazib requested review from elynnwu, mcgibbon and odiazib February 16, 2026 16:54

mcgibbon reviewed Feb 17, 2026

View reviewed changes

Merge branch 'main' into e3sm/oscar/sp-distributed-class

4c0036e

mahf708 marked this pull request as draft February 18, 2026 19:41

mahf708 mentioned this pull request Feb 19, 2026

add ModelTorchDistributed with tests #847

Open

2 tasks



		@requires_parallel
		def test_gather_raises_not_implemented(monkeypatch):



		@requires_parallel
		def test_gather_irregular_raises_not_implemented(monkeypatch):

		@@ -169,26 +191,28 @@ def get_local_slices(

		def reduce_sum(self, tensor: torch.Tensor) -> torch.Tensor:

		assert dist.local_batch_size(global_batch) == expected


		def test_local_batch_size_not_divisible():

		from physicsnemo.distributed.manager import DistributedManager
		from physicsnemo.distributed.config import ProcessGroupNode, ProcessGroupConfig

Conversation

mahf708 commented Feb 16, 2026

Uh oh!

mahf708 commented Feb 16, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mcgibbon Feb 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Comments

mcgibbon Feb 17, 2026 •

edited

Loading