解决kaldi训练报错 CUDA error: ‘out of memory‘，gpu不够用（亲测有效）

关彼得

已于 2022-04-25 09:58:16 修改

阅读量2.2k

点赞数 2

分类专栏：人工智能机器学习深度学习文章标签： linux 语音识别人工智能

于 2021-07-28 16:50:25 首次发布

本文链接：https://blog.csdn.net/qq_43744723/article/details/119183173

版权

人工智能同时被 3 个专栏收录

13 篇文章 3 订阅

订阅专栏

深度学习

13 篇文章 3 订阅

订阅专栏

机器学习

12 篇文章 2 订阅

订阅专栏

kaldi训练mobvoihotwords报错：
如下截图：在这里插入图片描述终端报错提示在，报错信息被记录在exp/chain/tdnn_1a/log/train.1.1.log里面，于是找到log文件，打开发现有如下报错,报错是说gpu不够用。报错里面也给出了解决报错的建议。

ERROR (nnet3-chain-train[5.5]:AllocateNewRegion():cu-allocator.cc:491)
Failed to allocate a memory region of 8388608 bytes. Possibly this is
due to sharing the GPU. Try switching the GPUs to exclusive mode
(nvidia-smi -c 3) and using the option –use-gpu=wait to scripts like
steps/nnet3/chain/train.py. Memory info: free:14M, used:1985M,
total:2000M, free/total:0.00740509 CUDA error: ‘out of memory’

解决办法如下：（亲测有效）

因为报错是在运行run.sh的第13个步骤(如下图)时出现的
在这里插入图片描述

1.修改GPU模式：

sudo nvidia-smi -c 3

要加sudo，不然会报错。
运行成功截图如下：
在这里插入图片描述注意！这个命令的效果在每一次关机后就失效了。所以在下次开机的时候，还需要再次执行命令使其生效。

2.修改local/chain/run_tdnn.sh

把--use-gpu=true \
设置成 --use-gpu=wait \
注意，这里修改不要加注释！因为–use-gpu=wait `是要传的参数，添加注释之后，会有其他的报错，导致不能正常运行。
然后重新运行run.sh脚本。

问题解决啦！
在这里插入图片描述

补充：查看ubuntu实时的gpu使用情况，-n 后面接的数字是终端刷新一次间隔的秒数。

watch -n 1 nvidia-smi

watch -n 0.1 nvidia-smi

补充：当-use-gpu wait会怎么样？use-gpu 取值true或者wait的不同

在2个GPU的情况，并发数为3，那么只会使用其中1个GPU，当第一个任务使用GPU，其他的两个任务在等待。另外一个GPU被另外的计算进程占用了，因为计算模式设置为Exclusive Process，所以不让其他计算进程使用。

感谢以下链接的帮助：
https://www.pianshen.com/article/84721873090/
https://blog.csdn.net/boyStray/article/details/89046837

关彼得

关注

2
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
解决kaldi训练报错 CUDA error: ‘out of memory‘，gpu不够用（亲测有效）

kaldi训练mobvoihotwords报错：如下截图：终端报错提示在，报错信息被记录在exp/chain/tdnn_1a/log/train.1.1.log里面，于是找到log文件，打开发现有如下报错,报错是说gpu不够用。报错里面也给出了解决报错的建议。ERROR (nnet3-chain-train[5.5]:AllocateNewRegion():cu-allocator.cc:491)Failed to allocate a memory region of 8388608 bytes.
复制链接

扫一扫