Skip to content

[BUG] Run MegatronApp on PyTorch 2.8.0 #34

@zbceric

Description

@zbceric

I'm tring to run MegatronApp on PyTorch 2.8.0.
Below are the solutions to the issues I encountered, which I hope will be helpful to others.

Environment

  • PyTorch version: 2.8.0
  • CUDA version: 12.8

1. ModuleNotFoundError: No module named 'torch'
When installing shm_tensor_new_rdma and shm_tensor_new_rdma_pre_alloc with pip3 install -e ., I encountered a ModuleNotFoundError: No module named 'torch'.
​​Fix​​: Use pip3 install -e . --no-build-isolation. This is because the setup.py file imports torch, and without the --no-build-isolation option, setup.py is executed in an isolated environment, which can cause torch not to be found.

2. runtime error: undefined symbol: _ZN3c106detail14torchCheckFailEPKcS2_jRKSs
In libc10.so compiled with the CXX11 ABI, this symbol has been changed to _ZN3c106detail14torchCheckFailEPKcS2_jRKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE.
I am using PyTorch 2.8.0+cu128, which uses the CXX11 ABI. Therefore, I suggest dynamically modifying the flag -D_GLIBCXX_USE_CXX11_ABI in setup.py of shm_tensor_new_rdma and shm_tensor_new_rdma_pre_alloc based on the value of torch._C._GLIBCXX_USE_CXX11_ABI.

3. API changes in PyTorch
In newer versions of torch, _write_item requires a new parameter serialization_format. Referring to the latest version of Megatron-LM, I modified megatron/core/dist_checkpointing/strategies/filesystem_async.py as follows:

Within the write_preloaded_data function, inside the try-catch block, I added:

import inspect
if "serialization_format" in inspect.signature(_write_item).parameters:
    from torch.distributed.checkpoint.filesystem import SerializationFormat
    extra_kwargs["serialization_format"] = SerializationFormat.TORCH_SAVE

Add extra_kwargswhen calling _write_item.

_write_item(*transform_list, stream, data, write_item, storage_key, **extra_kwargs)

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions