使用accelerate单机多卡无法正常训练，端口报错

最新推荐文章于 2024-07-07 10:44:58 发布

已经变five了

最新推荐文章于 2024-07-07 10:44:58 发布

阅读量114

点赞数 2

文章标签：服务器前端 linux 深度学习

本文链接：https://blog.csdn.net/m0_52280920/article/details/140008989

版权

复现一下问题发生与解决的过程：

首先是按照默认方式生成default_config.yaml

accelerate config

得到的config文件在路径/root/.cache/huggingface/accelerate/default_config.yaml下面，根据问答情况一步步设置，内容如下：

In which compute environment are you running?
This machine                                                                                                                                
--------------------------------------------------------------------------------------------------------------------------------------------
Which type of machine are you using?                                                                                                        
multi-GPU                                                                                                                                   
How many different machines will you use (use more than 1 for multi-node training)? [1]: 1                                                  
Should distributed operations be checked while running for errors? This can avoid timeout issues but will be slower. [yes/NO]: no           
Do you wish to optimize your script with torch dynamo?[yes/NO]:no                                                                           
Do you want to use DeepSpeed? [yes/NO]: no                                                                                                  
Do you want to use FullyShardedDataParallel? [yes/NO]: no                                                                                   
Do you want to use Megatron-LM ? [yes/NO]: no                                                                                               
How many GPU(s) should be used for distributed training? [1]:2
What GPU(s) (by id) should be used for training on this machine as a comma-seperated list? [all]:4,5
--------------------------------------------------------------------------------------------------------------------------------------------
Do you wish to use FP16 or BF16 (mixed precision)?
no

得到配置如下所示：

compute_environment: LOCAL_MACHINE
debug: false
distributed_type: MULTI_GPU
downcast_bf16: 'no'
gpu_ids: 4,5
machine_rank: 0
main_training_function: main
mixed_precision: 'no'
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

accelerate launch --config_file .yaml的路径(若没有则用默认.cache下面的) ../train.py

运行train.sh文件，报错如下：

ConnectionError: Tried to launch distributed communication on port `29500`, but another process is utilizing it. Please specify a different 
port (such as using the `----main_process_port` flag or specifying a different `main_process_port` in your config file) and rerun your scrip
t. To automatically use the next open port (on a single node), you can set this to `0`.

然后在default_config.yaml文件中设置main_process_port:

main_process_port: 0

发现程序不报错，但是卡在加载模型权重之前。

解决办法：在tran.sh中加上机器的ip地址

accelerate launch --config_file .yaml的路径(若没有则用默认.cache下面的) --machine_rank=0 --main_process_ip=**.***.***.***(这里是机器的ip)  ../train.py

成功解决！

已经变five了

关注

2
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
使用accelerate单机多卡无法正常训练，端口报错

首先是按照默认方式生成default_config.yaml。解决办法：在tran.sh中加上机器的ip地址。发现程序不报错，但是卡在加载模型权重之前。
复制链接

扫一扫

使用accelerate单机多卡无法正常训练，端口报错

“相关推荐”对你有帮助么？