-
Notifications
You must be signed in to change notification settings - Fork 5
Description
I'm tring to run MegatronApp on PyTorch 2.8.0.
Below are the solutions to the issues I encountered, which I hope will be helpful to others.
Environment
- PyTorch version: 2.8.0
- CUDA version: 12.8
1. ModuleNotFoundError: No module named 'torch'
When installing shm_tensor_new_rdma and shm_tensor_new_rdma_pre_alloc with pip3 install -e ., I encountered a ModuleNotFoundError: No module named 'torch'.
Fix: Use pip3 install -e . --no-build-isolation. This is because the setup.py file imports torch, and without the --no-build-isolation option, setup.py is executed in an isolated environment, which can cause torch not to be found.
2. runtime error: undefined symbol: _ZN3c106detail14torchCheckFailEPKcS2_jRKSs
In libc10.so compiled with the CXX11 ABI, this symbol has been changed to _ZN3c106detail14torchCheckFailEPKcS2_jRKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE.
I am using PyTorch 2.8.0+cu128, which uses the CXX11 ABI. Therefore, I suggest dynamically modifying the flag -D_GLIBCXX_USE_CXX11_ABI in setup.py of shm_tensor_new_rdma and shm_tensor_new_rdma_pre_alloc based on the value of torch._C._GLIBCXX_USE_CXX11_ABI.
3. API changes in PyTorch
In newer versions of torch, _write_item requires a new parameter serialization_format. Referring to the latest version of Megatron-LM, I modified megatron/core/dist_checkpointing/strategies/filesystem_async.py as follows:
Within the write_preloaded_data function, inside the try-catch block, I added:
import inspect
if "serialization_format" in inspect.signature(_write_item).parameters:
from torch.distributed.checkpoint.filesystem import SerializationFormat
extra_kwargs["serialization_format"] = SerializationFormat.TORCH_SAVEAdd extra_kwargswhen calling _write_item.
_write_item(*transform_list, stream, data, write_item, storage_key, **extra_kwargs)