问题描述1
输入运行语句:
CUDA_VISIBLE_DEVICES=2 accelerate launch
--multi_gpu
--mixed_precision 'fp16'
--machine_rank 0
--main_process_ip 125.216.241.108
--main_process_port 7236
--num_machines 2 --num_processes 2
./diff_prompter/ppo_prompter_ft.py
--data ./data
--gpt_path ./gpt2
--trl_config ./diff_prompter/configs/ppo_config_ft.yml
--checkpoint_dir ./ckpt_dir_ft
出现问题:
The following values were not passed to `accelerate launch` and had defaults used instead:
`--dynamo_backend` was set to a value of `'no'`
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
问题分析
num_machines
表示使用的机器数目,num_processes
表示使用的线程数(卡数),我要使用单机多卡,所以num_machines=1,num_processes=n
(n为所需要的卡数)。
问题解决
运行语句改为
CUDA_VISIBLE_DEVICES=2 accelerate launch
--multi_gpu
--mixed_precision 'fp16'
--machine_rank 0
--main_process_ip 125.216.241.108
--main_process_port 7236
--num_machines 1 --num_processes 2
./diff_prompter/ppo_prompter_ft.py
--data ./data
--gpt_path ./gpt2
--trl_config ./diff_prompter/configs/ppo_config_ft.yml
--checkpoint_dir ./ckpt_dir_ft
问题描述2
程序跑动了,但在某一处又卡住了
The following values were not passed to `accelerate launch` and had defaults used instead:
`--dynamo_backend` was set to a value of `'no'`
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
[2024-04-10 13:37:01,030] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-04-10 13:37:01,034] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
问题分析
调试程序,看看程序在哪一个语句中卡死。发现程序在一下语句中卡死:
orch.distributed.init_process_group(backend='gloo', rank=0, world_size=1)
问题解决
注释掉该语句及其相关语句,即注释掉以下语句:
os.environ['LOCAL_RANK'] = '0'
os.environ['MASTER_ADDR'] = 'localhost'
os.environ['MASTER_PORT'] = '12385'
torch.distributed.init_process_group(backend='gloo', rank=0, world_size=1)