使用accelerate单机多卡无法正常训练,端口报错

复现一下问题发生与解决的过程:

首先是按照默认方式生成default_config.yaml

accelerate config

得到的config文件在路径/root/.cache/huggingface/accelerate/default_config.yaml下面,根据问答情况一步步设置,内容如下:

In which compute environment are you running?
This machine                                                                                                                                
--------------------------------------------------------------------------------------------------------------------------------------------
Which type of machine are you using?                                                                                                        
multi-GPU                                                                                                                                   
How many different machines will you use (use more than 1 for multi-node training)? [1]: 1                                                  
Should distributed operations be checked while running for errors? This can avoid timeout issues but will be slower. [yes/NO]: no           
Do you wish to optimize your script with torch dynamo?[yes/NO]:no                                                                           
Do you want to use DeepSpeed? [yes/NO]: no                                                                                                  
Do you want to use FullyShardedDataParallel? [yes/NO]: no                                                                                   
Do you want to use Megatron-LM ? [yes/NO]: no                                                                                               
How many GPU(s) should be used for distributed training? [1]:2
What GPU(s) (by id) should be used for training on this machine as a comma-seperated list? [all]:4,5
--------------------------------------------------------------------------------------------------------------------------------------------
Do you wish to use FP16 or BF16 (mixed precision)?
no   

得到配置如下所示: 

compute_environment: LOCAL_MACHINE
debug: false
distributed_type: MULTI_GPU
downcast_bf16: 'no'
gpu_ids: 4,5
machine_rank: 0
main_training_function: main
mixed_precision: 'no'
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
accelerate launch --config_file .yaml的路径(若没有则用默认.cache下面的) ../train.py 

运行train.sh文件,报错如下:

ConnectionError: Tried to launch distributed communication on port `29500`, but another process is utilizing it. Please specify a different 
port (such as using the `----main_process_port` flag or specifying a different `main_process_port` in your config file) and rerun your scrip
t. To automatically use the next open port (on a single node), you can set this to `0`. 

然后在default_config.yaml文件中设置main_process_port:

main_process_port: 0

发现程序不报错,但是卡在加载模型权重之前。

解决办法:在tran.sh中加上机器的ip地址

accelerate launch --config_file .yaml的路径(若没有则用默认.cache下面的) --machine_rank=0 --main_process_ip=**.***.***.***(这里是机器的ip)  ../train.py 

成功解决!

  • 2
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值