You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Feb 6, 2025. It is now read-only.
the important of failure is followed: (run on 3 node and each node has 2 GPUs)
Traceback (most recent call last):
File "pretrain_gpt.py", line 149, in <module>
pretrain(train_valid_test_datasets_provider, model_provider, forward_step_func,
File "/workspace/aceso/runtime/megatron/training.py", line 113, in pretrain
initialize_megatron(extra_args_provider=extra_args_provider,
File "/workspace/aceso/runtime/megatron/initialize.py", line 86, in initialize_megatron
finish_mpu_init()
File "/workspace/aceso/runtime/megatron/initialize.py", line 67, in finish_mpu_init
_initialize_distributed()
File "/workspace/aceso/runtime/megatron/initialize.py", line 209, in _initialize_distributed
mpu.initialize_model_parallel_flexpipe()
File "/workspace/aceso/runtime/megatron/mpu/initialize.py", line 288, in initialize_model_parallel_flexpipe
get_group(ranks)
File "/workspace/aceso/runtime/megatron/mpu/initialize.py", line 560, in get_group
group_bits = bitmap(ranks)
File "/workspace/aceso/runtime/megatron/mpu/initialize.py", line 555, in bitmap
raise ValueError("rank {} out of range ({})".format(rank, len(bits)))
ValueError: rank 6 out of range (6)
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2618839) of binary: /usr/bin/python3
Traceback (most recent call last):