【Linux多机多卡训练步骤四】训练过程中的报错

报错1

训练过程中,如果报错如下:
[2023/06/05 10:56:42] ppcls INFO: [Train][Epoch 1/200][Iter: 0/151]lr(CosineAnnealingDecay): 0.00100000, top1: 0.03125, top5: 1.00000, CELoss: 1.79536, loss: 1.79536, batch_cost: 35.73629s, reader_cost: 0.38432, ips: 0.89545 samples/s, eta: 12 days, 11:47:15
[2023/06/05 11:31:35] ppcls INFO: [Train][Epoch 1/200][Iter: 100/151]lr(CosineAnnealingDecay): 0.00099997, top1: 0.69431, top5: 1.00000, CELoss: 1.07162, loss: 1.07162, batch_cost: 20.96136s, reader_cost: 0.00091, ips: 1.52662 samples/s, eta: 7 days, 7:15:36
[2023/06/05 11:48:19] ppcls INFO: [Train][Epoch 1/200][Avg]top1: 0.71164, top5: 1.00000, CELoss: 1.02763, loss: 1.02763
Traceback (most recent call last):
File “tools/train.py”, line 35, in
engine.train()
File “/media/uvtec/WORK/PaddleClas/ppcls/engine/engine.py”, line 372, in train
acc = self.eval(epoch_id)
File “/home/uvtec/miniconda3/envs/paddle_clas_muti/lib/python3.8/site-packages/decorator.py”, line 232, in fun
return caller(func, *(extras + args), **kw)
File “/home/uvtec/miniconda3/envs/paddle_clas_muti/lib/python3.8/site-packages/paddle/fluid/dygraph/base.py”, line 375, in _decorate_function
return func(*args, **kwargs)
File “/media/uvtec/WORK/PaddleClas/ppcls/engine/engine.py”, line 451, in eval
eval_result = self.eval_func(self, epoch_id)
File “/media/uvtec/WORK/PaddleClas/ppcls/engine/evaluation/classification.py”, line 99, in classification_eval
paddle.distributed.all_gather(pred_list, out)
File “/home/uvtec/miniconda3/envs/paddle_clas_muti/lib/python3.8/site-packages/paddle/distributed/collective.py”, line 953, in all_gather
task = group.process_group.all_gather(tensor, out)
OSError: (External) NCCL error(5), invalid usage. Detail: Resource temporarily unavailable
Please try one of the following solutions:

  1. export NCCL_SHM_DISABLE=1;
  2. export NCCL_P2P_LEVEL=SYS;
  3. Increase shared memory by setting the -shm-size option when starting docker container, e.g., setting -shm-size=2g.

[Hint: Please search for the error code(5) on website (https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/api/types.html#ncclresult-t) to get Nvidia’s official solution and advice about NCCL Error.] (at /paddle/paddle/fluid/platform/device/gpu/nccl_helper.h:117)

解决方案1

那么,就按照提示中的一种方案进行解决。
直接在主机和从机的命令行中,输入:

export NCCL_SHM_DISABLE=1
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

深耕AI

谢谢鼓励~我将继续创作优质博文

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值