[BUG] Run MegatronApp on PyTorch 2.8.0

I'm tring to run MegatronApp on PyTorch 2.8.0.
Below are the solutions to the issues I encountered, which I hope will be helpful to others.

**Environment**
 - PyTorch version: 2.8.0
 - CUDA version: 12.8

**1. ModuleNotFoundError: No module named 'torch'**
When installing `shm_tensor_new_rdma` and `shm_tensor_new_rdma_pre_alloc` with `pip3 install -e .`,  I encountered a ModuleNotFoundError: No module named 'torch'.
​​Fix​​: Use `pip3 install -e . --no-build-isolation`. This is because the setup.py file imports torch, and without the `--no-build-isolation` option, setup.py is executed in an isolated environment, which can cause torch not to be found.

**2. runtime error: undefined symbol: _ZN3c106detail14torchCheckFailEPKcS2_jRKSs**
 In libc10.so compiled with the CXX11 ABI, this symbol has been changed to `_ZN3c106detail14torchCheckFailEPKcS2_jRKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE`. 
I am using PyTorch 2.8.0+cu128, which uses the CXX11 ABI. Therefore, I suggest dynamically modifying the flag `-D_GLIBCXX_USE_CXX11_ABI` in setup.py of `shm_tensor_new_rdma` and `shm_tensor_new_rdma_pre_alloc` based on the value of `torch._C._GLIBCXX_USE_CXX11_ABI`.

**3. API changes in PyTorch**
In newer versions of torch, `_write_item` requires a new parameter `serialization_format`. Referring to the latest version of Megatron-LM, I modified megatron/core/dist_checkpointing/strategies/filesystem_async.py as follows:

Within the `write_preloaded_data` function, inside the try-catch block, I added:
```python
import inspect
if "serialization_format" in inspect.signature(_write_item).parameters:
    from torch.distributed.checkpoint.filesystem import SerializationFormat
    extra_kwargs["serialization_format"] = SerializationFormat.TORCH_SAVE
```

Add extra_kwargswhen calling _write_item.
```python
_write_item(*transform_list, stream, data, write_item, storage_key, **extra_kwargs)
```



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Run MegatronApp on PyTorch 2.8.0 #34

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] Run MegatronApp on PyTorch 2.8.0 #34

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions