复现一下问题发生与解决的过程:
首先是按照默认方式生成default_config.yaml
accelerate config
得到的config文件在路径/root/.cache/huggingface/accelerate/default_config.yaml下面,根据问答情况一步步设置,内容如下:
In which compute environment are you running?
This machine
--------------------------------------------------------------------------------------------------------------------------------------------
Which type of machine are you using?
multi-GPU
How many different machines will you use (use more than 1 for multi-node training)? [1]: 1
Should distributed operations be checked while running for errors? This can avoid timeout issues but will be slower. [yes/NO]: no
Do you wish to optimize your script with torch dynamo?[yes/NO]:no
Do you want to use DeepSpeed? [yes/NO]: no
Do you want to use FullyShardedDataParallel? [yes/NO]: no
Do you want to use Megatron-LM ? [yes/NO]: no
How many GPU(s) should be used for distributed training? [1]:2
What GPU(s) (by id) should be used for training on this machine as a comma-seperated list? [all]:4,5
--------------------------------------------------------------------------------------------------------------------------------------------
Do you wish to use FP16 or BF16 (mixed precision)?
no
得到配置如下所示:
compute_environment: LOCAL_MACHINE
debug: false
distributed_type: MULTI_GPU
downcast_bf16: 'no'
gpu_ids: 4,5
machine_rank: 0
main_training_function: main
mixed_precision: 'no'
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
accelerate launch --config_file .yaml的路径(若没有则用默认.cache下面的) ../train.py
运行train.sh文件,报错如下:
ConnectionError: Tried to launch distributed communication on port `29500`, but another process is utilizing it. Please specify a different
port (such as using the `----main_process_port` flag or specifying a different `main_process_port` in your config file) and rerun your scrip
t. To automatically use the next open port (on a single node), you can set this to `0`.
然后在default_config.yaml文件中设置main_process_port:
main_process_port: 0
发现程序不报错,但是卡在加载模型权重之前。
解决办法:在tran.sh中加上机器的ip地址
accelerate launch --config_file .yaml的路径(若没有则用默认.cache下面的) --machine_rank=0 --main_process_ip=**.***.***.***(这里是机器的ip) ../train.py
成功解决!