Skip to content

Training with multiple nodes #17

@TianheWu

Description

@TianheWu

Thanks for the great work.

I’m trying to train this model in a multi-node, multi-GPU setup. Single-node multi-GPU works fine, but multinode consistently crashes with an NCCL error during the first loss computation.

torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.hpp:317, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.21.5

This happens right when the training step hits:

loss = self.compute_loss(model, inputs, num_items_in_batch=num_items_in_batch)

The training script is:

#!/bin/bash

cd src/t2i-r1/src
RUN_NAME="t2i-r1_wo-cot_pickscore_Janus-Pro-7B_multinode"

export DEBUG_MODE="true"
export LOG_PATH="./outputs/debug_t2i-r1_wo-cot_pickscore_Janus-Pro-7B.txt"
# export NCCL_DEBUG=INFO

QWEN_PATH="deepseek-ai/Janus-Pro-7B"
HF_DATASET="/home/notebook/data/group/wth/T2I-R1/data/geneval_and_t2i_data_final.json" 
OUTPUT_DIR="janus/outputs/${RUN_NAME}" 

PYTHONPATH="$(dirname $0)/..":$PYTHONPATH \
torchrun --nproc_per_node="8" \
--nnodes=$WORLD_SIZE \
--node_rank=$RANK \
--master_addr=$MASTER_ADDR \
--master_port=$MASTER_PORT \
/home/notebook/data/group/wth/T2I-R1/src/t2i-r1/src/open_r1/grpo.py --use_vllm False \
--deepspeed "/home/notebook/data/group/wth/T2I-R1/src/t2i-r1/configs/zero2.json" \
--output_dir $OUTPUT_DIR \
--model_name_or_path $QWEN_PATH \
--dataset_name $HF_DATASET \
--max_prompt_length 512 \
--max_completion_length 1024 \
--temperature 1.0 \
--num_generations 8 \
--per_device_train_batch_size 1 \
--gradient_accumulation_steps 1 \
--logging_steps 1 \
--bf16 \
--torch_dtype bfloat16 \
--report_to wandb \
--gradient_checkpointing false \
--attn_implementation flash_attention_2 \
--max_steps 800 \
--run_name $RUN_NAME \
--save_steps 200 \
--new_generations_image 1 \
--image_token_num_per_image 576 \
--cfg_weight 5 \
--reasoning_prompt_path /home/notebook/data/group/wth/T2I-R1/data/prompt/reasoning_prompt.txt \
--reward_funcs pickscore \
--beta 0.01 \
--tf32 true \
--learning_rate 1e-6 \
--pickscore_processor_path /home/notebook/data/group/wth/T2I-R1/src/t2i-r1/reward_weight/CLIP-ViT-H-14-laion2B-s32B-b79K \
--pickscore_model_path /home/notebook/data/group/wth/T2I-R1/src/t2i-r1/reward_weight/PickScore_v1

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions