Skip to content

ncclSystemError with multi-node multi-gpu training #6

@syncdoth

Description

@syncdoth

Problem Description

I am trying to run multi-node distributed training with pytorch. More specifically, I am using torchrun as distributed launcher, with deepspeed. The code works fine with single-node, multi-gpu setting, but NCCL error occurs when multi-node is used.

Launch Script

The launch script looks like this:

# tcloud_run.sh
GPU_PER_NODE=$1
torchrun --nproc_per_node $GPU_PER_NODE --nnodes $SLURM_NNODES --node_rank $SLURM_PROCID \
    --master_addr $MASTER_ADDR --master_port $MASTER_PORT \
    -m <my python script> <my args>

and in tuxiv.conf, the entrypoint is sh tcloud_run.sh 2.

Environment Setup

# cuda
export CUDA_HOME=/mnt/data/cuda/cuda-11.3.1
export LD_LIBRARY_PATH=/mnt/data/cuda/cuda-11.3.1/lib64:$LD_LIBRARY_PATH
export PATH=/mnt/data/cuda/cuda-11.3.1/bin:$PATH
# nccl
export NCCL_SOCKET_IFNAME=eth0

Debug Log

Attached below is the slurm_log with NCCL_DEBUG=INFO set.

XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
Nodelist:=  10-0-4-[10-11]
Number of nodes:=  2
Ntasks per node:=  1
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
MASTER_PORT=13819
WORLD_SIZE=2
MASTER_ADDR=10-0-4-10
NCCL_SOCKET_IFNAME=eth0
0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 10.0.4.10  netmask 255.255.255.0  broadcast 10.0.4.255
        ether 6e:7c:11:bc:44:4c  txqueuelen 1000  (Ethernet)
        RX packets 131917859378  bytes 195829243616837 (195.8 TB)
        RX errors 0  dropped 11298050  overruns 0  frame 0
        TX packets 133110220747  bytes 197254884153813 (197.2 TB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 172.17.0.11  netmask 255.255.0.0  broadcast 172.17.255.255
        ether 02:42:ac:11:00:0b  txqueuelen 0  (Ethernet)
        RX packets 3850508  bytes 3285467306 (3.2 GB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 2899989  bytes 5647254656 (5.6 GB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
        loop  txqueuelen 1000  (Local Loopback)
        RX packets 4775558139  bytes 188213232702545 (188.2 TB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 4775558139  bytes 188213232702545 (188.2 TB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 10.0.4.11  netmask 255.255.255.0  broadcast 10.0.4.255
        ether 56:7f:10:19:19:af  txqueuelen 1000  (Ethernet)
        RX packets 132366442963  bytes 195861266566055 (195.8 TB)
        RX errors 0  dropped 7418680  overruns 0  frame 0
        TX packets 131532889938  bytes 194952543072346 (194.9 TB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 172.17.0.12  netmask 255.255.0.0  broadcast 172.17.255.255
        ether 02:42:ac:11:00:0c  txqueuelen 0  (Ethernet)
        RX packets 792708  bytes 1213789120 (1.2 GB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 889655  bytes 278028450 (278.0 MB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
        loop  txqueuelen 1000  (Local Loopback)
        RX packets 4980505670  bytes 186223504814541 (186.2 TB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 4980505670  bytes 186223504814541 (186.2 TB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

Run started at:-
Fri Mar 24 03:18:49 UTC 2023
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
[2023-03-24 03:20:15,284] [INFO] [comm.py:652:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
10-0-4-10:74872:74872 [0] NCCL INFO Bootstrap : Using eth0:172.17.0.11<0>
10-0-4-10:74872:74872 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
10-0-4-10:74872:74872 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE [1]mlx5_2:1/RoCE [2]mlx5_49:1/RoCE [3]mlx5_39:1/RoCE [4]mlx5_67:1/RoCE [5]mlx5_29:1/RoCE [6]mlx5_57:1/RoCE [7]mlx5_19:1/RoCE [8]mlx5_47:1/RoCE [9]mlx5_37:1/RoCE [10]mlx5_65:1/RoCE [11]mlx5_12:1/RoCE [12]mlx5_27:1/RoCE [13]mlx5_55:1/RoCE [14]mlx5_17:1/RoCE [15]mlx5_113:1/RoCE ; OOB eth0:172.17.0.11<0>
10-0-4-10:74872:74872 [0] NCCL INFO Using network IB
NCCL version 2.10.3+cuda11.3
10-0-4-10:74873:74873 [1] NCCL INFO Bootstrap : Using eth0:172.17.0.11<0>
10-0-4-11:62452:62452 [0] NCCL INFO Bootstrap : Using eth0:172.17.0.12<0>
10-0-4-11:62453:62453 [1] NCCL INFO Bootstrap : Using eth0:172.17.0.12<0>
10-0-4-10:74873:74873 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
10-0-4-11:62452:62452 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
10-0-4-11:62453:62453 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
10-0-4-11:62452:62452 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE [1]mlx5_2:1/RoCE [2]mlx5_49:1/RoCE [3]mlx5_39:1/RoCE [4]mlx5_67:1/RoCE [5]mlx5_29:1/RoCE [6]mlx5_57:1/RoCE [7]mlx5_19:1/RoCE [8]mlx5_47:1/RoCE [9]mlx5_37:1/RoCE [10]mlx5_65:1/RoCE [11]mlx5_12:1/RoCE [12]mlx5_27:1/RoCE [13]mlx5_55:1/RoCE [14]mlx5_17:1/RoCE [15]mlx5_113:1/RoCE ; OOB eth0:172.17.0.12<0>
10-0-4-11:62452:62452 [0] NCCL INFO Using network IB
10-0-4-10:74873:74873 [1] NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE [1]mlx5_2:1/RoCE [2]mlx5_49:1/RoCE [3]mlx5_39:1/RoCE [4]mlx5_67:1/RoCE [5]mlx5_29:1/RoCE [6]mlx5_57:1/RoCE [7]mlx5_19:1/RoCE [8]mlx5_47:1/RoCE [9]mlx5_37:1/RoCE [10]mlx5_65:1/RoCE [11]mlx5_12:1/RoCE [12]mlx5_27:1/RoCE [13]mlx5_55:1/RoCE [14]mlx5_17:1/RoCE [15]mlx5_113:1/RoCE ; OOB eth0:172.17.0.11<0>
10-0-4-10:74873:74873 [1] NCCL INFO Using network IB
10-0-4-11:62453:62453 [1] NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE [1]mlx5_2:1/RoCE [2]mlx5_49:1/RoCE [3]mlx5_39:1/RoCE [4]mlx5_67:1/RoCE [5]mlx5_29:1/RoCE [6]mlx5_57:1/RoCE [7]mlx5_19:1/RoCE [8]mlx5_47:1/RoCE [9]mlx5_37:1/RoCE [10]mlx5_65:1/RoCE [11]mlx5_12:1/RoCE [12]mlx5_27:1/RoCE [13]mlx5_55:1/RoCE [14]mlx5_17:1/RoCE [15]mlx5_113:1/RoCE ; OOB eth0:172.17.0.12<0>
10-0-4-11:62453:62453 [1] NCCL INFO Using network IB
10-0-4-11:62453:62550 [1] NCCL INFO Could not enable P2P between dev 1(=1e000) and dev 0(=1d000)
10-0-4-11:62453:62550 [1] NCCL INFO Could not enable P2P between dev 0(=1d000) and dev 1(=1e000)
10-0-4-11:62453:62550 [1] NCCL INFO Could not enable P2P between dev 1(=1e000) and dev 0(=1d000)
10-0-4-11:62453:62550 [1] NCCL INFO Could not enable P2P between dev 0(=1d000) and dev 1(=1e000)
10-0-4-11:62452:62548 [0] NCCL INFO Could not enable P2P between dev 1(=1e000) and dev 0(=1d000)
10-0-4-11:62452:62548 [0] NCCL INFO Could not enable P2P between dev 0(=1d000) and dev 1(=1e000)
10-0-4-11:62452:62548 [0] NCCL INFO Could not enable P2P between dev 1(=1e000) and dev 0(=1d000)
10-0-4-11:62452:62548 [0] NCCL INFO Could not enable P2P between dev 0(=1d000) and dev 1(=1e000)
10-0-4-10:74872:74954 [0] NCCL INFO Could not enable P2P between dev 1(=1c000) and dev 0(=1b000)
10-0-4-10:74872:74954 [0] NCCL INFO Could not enable P2P between dev 0(=1b000) and dev 1(=1c000)
10-0-4-10:74873:74971 [1] NCCL INFO Could not enable P2P between dev 1(=1c000) and dev 0(=1b000)
10-0-4-10:74873:74971 [1] NCCL INFO Could not enable P2P between dev 0(=1b000) and dev 1(=1c000)
10-0-4-10:74872:74954 [0] NCCL INFO Could not enable P2P between dev 1(=1c000) and dev 0(=1b000)
10-0-4-10:74872:74954 [0] NCCL INFO Could not enable P2P between dev 0(=1b000) and dev 1(=1c000)
10-0-4-10:74873:74971 [1] NCCL INFO Could not enable P2P between dev 1(=1c000) and dev 0(=1b000)
10-0-4-10:74873:74971 [1] NCCL INFO Could not enable P2P between dev 0(=1b000) and dev 1(=1c000)
10-0-4-10:74873:74971 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0
10-0-4-10:74873:74971 [1] NCCL INFO Setting affinity for GPU 1 to 0fffff
10-0-4-10:74872:74954 [0] NCCL INFO Channel 00/02 :    0   1   2   3
10-0-4-11:62452:62548 [0] NCCL INFO Trees [0] 3/-1/-1->2->0 [1] 3/0/-1->2->-1
10-0-4-10:74872:74954 [0] NCCL INFO Channel 01/02 :    0   1   2   3
10-0-4-10:74872:74954 [0] NCCL INFO Trees [0] 1/2/-1->0->-1 [1] 1/-1/-1->0->2
10-0-4-11:62453:62550 [1] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] -1/-1/-1->3->2
10-0-4-10:74872:74954 [0] NCCL INFO Setting affinity for GPU 0 to 0fffff
10-0-4-10:74873:74971 [1] NCCL INFO Could not enable P2P between dev 1(=1c000) and dev 0(=1b000)
10-0-4-11:62453:62550 [1] NCCL INFO Could not enable P2P between dev 1(=1e000) and dev 0(=1d000)
10-0-4-10:74872:74954 [0] NCCL INFO Channel 00 : 3[1e000] -> 0[1b000] [receive] via NET/IB/1
10-0-4-11:62452:62548 [0] NCCL INFO Channel 00 : 1[1c000] -> 2[1d000] [receive] via NET/IB/1
10-0-4-10:74873:74971 [1] NCCL INFO Could not enable P2P between dev 1(=1c000) and dev 0(=1b000)
10-0-4-11:62453:62550 [1] NCCL INFO Could not enable P2P between dev 1(=1e000) and dev 0(=1d000)
10-0-4-10:74872:74954 [0] NCCL INFO Channel 01 : 3[1e000] -> 0[1b000] [receive] via NET/IB/1
10-0-4-10:74872:74954 [0] NCCL INFO Could not enable P2P between dev 0(=1b000) and dev 1(=1c000)
10-0-4-10:74872:74954 [0] NCCL INFO Channel 00 : 0[1b000] -> 1[1c000] via direct shared memory
10-0-4-10:74872:74954 [0] NCCL INFO Could not enable P2P between dev 0(=1b000) and dev 1(=1c000)
10-0-4-10:74872:74954 [0] NCCL INFO Channel 01 : 0[1b000] -> 1[1c000] via direct shared memory
10-0-4-11:62452:62548 [0] NCCL INFO Channel 01 : 1[1c000] -> 2[1d000] [receive] via NET/IB/1
10-0-4-11:62452:62548 [0] NCCL INFO Could not enable P2P between dev 0(=1d000) and dev 1(=1e000)
10-0-4-11:62452:62548 [0] NCCL INFO Channel 00 : 2[1d000] -> 3[1e000] via direct shared memory
10-0-4-11:62452:62548 [0] NCCL INFO Could not enable P2P between dev 0(=1d000) and dev 1(=1e000)
10-0-4-11:62452:62548 [0] NCCL INFO Channel 01 : 2[1d000] -> 3[1e000] via direct shared memory
10-0-4-10:74873:74971 [1] NCCL INFO Channel 00 : 1[1c000] -> 2[1d000] [send] via NET/IB/1
10-0-4-11:62453:62550 [1] NCCL INFO Channel 00 : 3[1e000] -> 0[1b000] [send] via NET/IB/1
10-0-4-10:74873:74971 [1] NCCL INFO Channel 01 : 1[1c000] -> 2[1d000] [send] via NET/IB/1
10-0-4-10:74873:74971 [1] NCCL INFO Connected all rings
10-0-4-10:74873:74971 [1] NCCL INFO Could not enable P2P between dev 1(=1c000) and dev 0(=1b000)
10-0-4-11:62453:62550 [1] NCCL INFO Channel 01 : 3[1e000] -> 0[1b000] [send] via NET/IB/1
10-0-4-10:74873:74971 [1] NCCL INFO Channel 00 : 1[1c000] -> 0[1b000] via direct shared memory
10-0-4-10:74873:74971 [1] NCCL INFO Could not enable P2P between dev 1(=1c000) and dev 0(=1b000)
10-0-4-10:74873:74971 [1] NCCL INFO Channel 01 : 1[1c000] -> 0[1b000] via direct shared memory
10-0-4-11:62453:62550 [1] NCCL INFO Connected all rings
10-0-4-11:62453:62550 [1] NCCL INFO Could not enable P2P between dev 1(=1e000) and dev 0(=1d000)
10-0-4-11:62453:62550 [1] NCCL INFO Channel 00 : 3[1e000] -> 2[1d000] via direct shared memory
10-0-4-11:62453:62550 [1] NCCL INFO Could not enable P2P between dev 1(=1e000) and dev 0(=1d000)
10-0-4-11:62453:62550 [1] NCCL INFO Channel 01 : 3[1e000] -> 2[1d000] via direct shared memory

10-0-4-11:62452:62548 [0] misc/ibvwrap.cc:284 NCCL WARN Call to ibv_modify_qp failed with error No such device
10-0-4-11:62452:62548 [0] NCCL INFO transport/net_ib.cc:415 -> 2

10-0-4-10:74872:74954 [0] misc/ibvwrap.cc:284 NCCL WARN Call to ibv_modify_qp failed with error No such device
10-0-4-10:74872:74954 [0] NCCL INFO transport/net_ib.cc:415 -> 2
10-0-4-10:74872:74954 [0] NCCL INFO transport/net_ib.cc:528 -> 2
10-0-4-10:74872:74954 [0] NCCL INFO include/net.h:22 -> 2
10-0-4-11:62452:62548 [0] NCCL INFO transport/net_ib.cc:528 -> 2
10-0-4-11:62452:62548 [0] NCCL INFO include/net.h:22 -> 2
10-0-4-11:62452:62548 [0] NCCL INFO transport/net.cc:234 -> 2
10-0-4-11:62452:62548 [0] NCCL INFO transport.cc:119 -> 2
10-0-4-10:74872:74954 [0] NCCL INFO transport/net.cc:234 -> 2
10-0-4-10:74872:74954 [0] NCCL INFO transport.cc:119 -> 2
10-0-4-10:74872:74954 [0] NCCL INFO init.cc:778 -> 2
10-0-4-11:62452:62548 [0] NCCL INFO init.cc:778 -> 2
10-0-4-11:62452:62548 [0] NCCL INFO init.cc:904 -> 2
10-0-4-10:74872:74954 [0] NCCL INFO init.cc:904 -> 2
10-0-4-10:74872:74954 [0] NCCL INFO group.cc:72 -> 2 [Async thread]
10-0-4-11:62452:62548 [0] NCCL INFO group.cc:72 -> 2 [Async thread]
[2023-03-24 03:20:35,665] [INFO] [partition_parameters.py:415:__exit__] finished initializing model with 0.13B parameters
Traceback (most recent call last):
  File "/mnt/home/schoiaj/.Miniconda3/envs/coh_torch/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/mnt/home/schoiaj/.Miniconda3/envs/coh_torch/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/mnt/home/schoiaj/WORKDIR/chain-of-hindsight-torch/coh/coh_train.py", line 102, in <module>
    main()
  File "/mnt/home/schoiaj/WORKDIR/chain-of-hindsight-torch/coh/coh_train.py", line 57, in main
    model = AutoModelForCausalLM.from_pretrained(args.model_name, cache_dir=args.cache_dir)
Traceback (most recent call last):
  File "/mnt/home/schoiaj/.Miniconda3/envs/coh_torch/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 471, in from_pretrained
    return model_class.from_pretrained(
  File "/mnt/home/schoiaj/.Miniconda3/envs/coh_torch/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2493, in from_pretrained
    model = cls(config, *model_args, **model_kwargs)
  File "/mnt/home/schoiaj/.Miniconda3/envs/coh_torch/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 355, in wrapper
    f(module, *args, **kwargs)
  File "/mnt/home/schoiaj/.Miniconda3/envs/coh_torch/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 652, in __init__
    self.model = LlamaModel(config)
  File "/mnt/home/schoiaj/.Miniconda3/envs/coh_torch/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 355, in wrapper
    f(module, *args, **kwargs)
  File "/mnt/home/schoiaj/.Miniconda3/envs/coh_torch/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/mnt/home/schoiaj/.Miniconda3/envs/coh_torch/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 456, in __init__
    self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)
  File "/mnt/home/schoiaj/.Miniconda3/envs/coh_torch/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 363, in wrapper
    self._post_init_method(module)
  File "/mnt/home/schoiaj/.Miniconda3/envs/coh_torch/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/mnt/home/schoiaj/.Miniconda3/envs/coh_torch/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 754, in _post_init_method
    dist.broadcast(param, 0, self.ds_process_group)
  File "/mnt/home/schoiaj/.Miniconda3/envs/coh_torch/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 123, in log_wrapper
    return func(*args, **kwargs)
  File "/mnt/home/schoiaj/WORKDIR/chain-of-hindsight-torch/coh/coh_train.py", line 102, in <module>
    main()
  File "/mnt/home/schoiaj/.Miniconda3/envs/coh_torch/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 228, in broadcast
    return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op)
  File "/mnt/home/schoiaj/WORKDIR/chain-of-hindsight-torch/coh/coh_train.py", line 57, in main
    model = AutoModelForCausalLM.from_pretrained(args.model_name, cache_dir=args.cache_dir)
  File "/mnt/home/schoiaj/.Miniconda3/envs/coh_torch/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 78, in broadcast
    return torch.distributed.broadcast(tensor=tensor,
  File "/mnt/home/schoiaj/.Miniconda3/envs/coh_torch/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1193, in broadcast
    work = default_pg.broadcast([tensor], opts)
  File "/mnt/home/schoiaj/.Miniconda3/envs/coh_torch/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 471, in from_pretrained
    return model_class.from_pretrained(
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1659484808560/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1191, unhandled system error, NCCL version 2.10.3
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. It can be also caused by unexpected exit of a remote peer, you can check NCCL warnings for failure reason and see if there is connection closure by a peer.
  File "/mnt/home/schoiaj/.Miniconda3/envs/coh_torch/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2493, in from_pretrained
    model = cls(config, *model_args, **model_kwargs)
  File "/mnt/home/schoiaj/.Miniconda3/envs/coh_torch/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 355, in wrapper
    f(module, *args, **kwargs)
  File "/mnt/home/schoiaj/.Miniconda3/envs/coh_torch/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 652, in __init__
    self.model = LlamaModel(config)
  File "/mnt/home/schoiaj/.Miniconda3/envs/coh_torch/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 355, in wrapper
    f(module, *args, **kwargs)
  File "/mnt/home/schoiaj/.Miniconda3/envs/coh_torch/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 456, in __init__
    self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)
  File "/mnt/home/schoiaj/.Miniconda3/envs/coh_torch/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 363, in wrapper
    self._post_init_method(module)
  File "/mnt/home/schoiaj/.Miniconda3/envs/coh_torch/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 754, in _post_init_method
    dist.broadcast(param, 0, self.ds_process_group)
  File "/mnt/home/schoiaj/.Miniconda3/envs/coh_torch/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 123, in log_wrapper
    return func(*args, **kwargs)
  File "/mnt/home/schoiaj/.Miniconda3/envs/coh_torch/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 228, in broadcast
    return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op)
  File "/mnt/home/schoiaj/.Miniconda3/envs/coh_torch/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 78, in broadcast
    return torch.distributed.broadcast(tensor=tensor,
  File "/mnt/home/schoiaj/.Miniconda3/envs/coh_torch/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1193, in broadcast
    work = default_pg.broadcast([tensor], opts)
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1659484808560/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1191, unhandled system error, NCCL version 2.10.3
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. It can be also caused by unexpected exit of a remote peer, you can check NCCL warnings for failure reason and see if there is connection closure by a peer.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions