add ModelTorchDistributed with tests by mahf708 · Pull Request #847 · ai2cm/ace

mahf708 · 2026-02-18T18:24:53Z

add ModelTorchDistributed with tests

Changes:

fme.core.distributed has a new ModelDistributedBackend that allows for parallelism over spatial dimensions as well as batch/data.
torch is pinned with a minimum of 2.4.0 to use new facilities for distributed, etc.
Tests added
If dependencies changed, "deps only" image rebuilt and "latest_deps_only_image.txt" file updated

Closes #749
Closes #842

mahf708 · 2026-02-18T18:27:57Z

fme/core/distributed/distributed.py

i'm not thrilled with these names or this general method of forcing; I if we have to go this route, I prefer to call the new one: "model" (since it can parallelize over all sorts of dims/tags)

For now, can you keep the behavior Oscar was going to use? That is, if (let's call it) FME_DISTRIBUTED_H or _W is set > 1 then the spatial backend is used? If you need a way to force it to get used, perhaps if one or both is set we can use the spatial backend and if not use the torch backend, and only use the torch backend if both are unset? I don't think we currently need a way to force non-distributed from the CLI so we shouldn't add that feature.

Addressed mostly, but I'd rather force people to pick for one, we can streamline defaults later, etc.

mcgibbon · 2026-02-18T20:14:24Z

fme/core/distributed/model_torch_distributed.py

+        logger.debug("Barrier on rank %d", self._rank)
+        torch.distributed.barrier(device_ids=self._device_ids)
+
+    def shutdown(self):


Not for this PR or necessarily related to spatial parallelism, but we should think about defining a context manager for parallelism that makes sure cleanup happens when the context exits, kind of like we do with GlobalTimer. I think we currently don't call it properly in the unit tests for example.

FWIW, the torch backend is the one with problematic teardown for some reason; the new one is slightly cleaner. I can look into why...

mcgibbon · 2026-02-18T20:18:43Z

fme/core/distributed/parallel_tests/test_local_slices.py

    """
    dist = Distributed.get_instance()
-    global_shape = (2, 4, 4)
+    n_dp = dist.total_data_parallel_ranks


Is this desired? I feel like we should set a constant shape like 4 that covers the cases we plan to run with, so that special values like 3 data parallel ranks are more interesting (which my test may need to be refactored to properly manage, idk).

For now, I think it is an easy copout to get things going and running with arbitrary tests. If we hardcode the batch dimension, we will need to instrument pytest skipping and such. I like the idea that these tests can run successfully with 5,000,000 ranks ;)

fme/core/distributed/parallel_tests/test_backend.py

mcgibbon · 2026-02-18T20:20:49Z

fme/core/distributed/parallel_tests/test_backend.py

Good tests!

fme/core/distributed/external/pnd_manager.py

fme/core/distributed/model_torch_distributed.py

elynnwu · 2026-02-18T21:53:29Z

pyproject.toml

    "tensorly-torch",
    "torch-harmonics==0.8.0",
-    "torch>=2",
+    "torch>=2.4.0",


No action needed here, I think our dependency only image already uses 2.7.1 and I don't believe there's anything in this PR that warrants re-rebuilding the image again.

.github/workflows/tests.yaml

mahf708 commented Feb 18, 2026

View reviewed changes

mcgibbon reviewed Feb 18, 2026

View reviewed changes

borrow distributed manager from physicsnemo

4b11ab1

elynnwu reviewed Feb 18, 2026

View reviewed changes

specialize the physicsnemo dm

9d226c2

mahf708 force-pushed the modtordis branch from d74c36a to 3f0bbf5 Compare February 18, 2026 22:52

mahf708 requested review from elynnwu and mcgibbon February 18, 2026 23:01

mahf708 commented Feb 18, 2026

View reviewed changes

.github/workflows/tests.yaml Outdated Show resolved Hide resolved

mahf708 force-pushed the modtordis branch from e451abe to db21196 Compare February 18, 2026 23:46

add ModelTorchDistributed with tests

dddddda

mahf708 force-pushed the modtordis branch from db21196 to dddddda Compare February 18, 2026 23:48

Conversation

mahf708 commented Feb 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mcgibbon Feb 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mcgibbon Feb 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

mahf708 commented Feb 18, 2026 •

edited

Loading

mcgibbon Feb 18, 2026 •

edited

Loading

mcgibbon Feb 18, 2026 •

edited

Loading