Skip to content

solve the RuntimeError: Tensors must be CUDA and dense#33

Open
13416157913 wants to merge 3 commits intoalibaba:mainfrom
13416157913:main
Open

solve the RuntimeError: Tensors must be CUDA and dense#33
13416157913 wants to merge 3 commits intoalibaba:mainfrom
13416157913:main

Conversation

@13416157913
Copy link

1、update Megatron-LLaMA/megatron/core/parallel_state.py
2、update Megatron-LLaMA/megatron/optimizer/overlapped_dist_optimizer.py
3、update Megatron-LLaMA/megatron/optimizer/distrib_optimizer.py

Add world_size in the initialize_model_parallel function, to judge gloo or nccl
Add world size  in save_parameter_state function to judge cpu or cuda.
Add world size in the save_parameter_state function, to judge cpu or cuda.
@li-yi-dong
Copy link
Collaborator

这个PR 想解决什么问题?

@13416157913
Copy link
Author

@li-yi-dong 解决多节点分布式训练时使用nccl后端,在训练完后,保存检查点时报错的问题;是以下这个issues
#32

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants