Tensorflow_gpu 2.3 + cuda10.2 多核训练踩坑经历

最新推荐文章于 2022-05-24 09:29:20 发布

杰弗瑞同学

最新推荐文章于 2022-05-24 09:29:20 发布

阅读量1.4k

点赞数

分类专栏： tensorflow 文章标签： tensorflow 神经网络深度学习 github windows

本文链接：https://blog.csdn.net/weixin_37651557/article/details/109630478

版权

本渣渣实验室电脑的配置是4*2080ti，系统是windows server 2019，环境是：
anaconda + tensorflow2.3 +python3.8，在GPU单核训练中运行正常，但尝试了深层网络之后报错：

Resource exhausted: OOM when allocating tensor with shape [ , ]

具体维度忘了，就是GPU显存不足。之后开始尝试GPU多核训练。尝试了两种方法：

os.environ[‘CUDA_VISIBLE_DEVICES’] = ‘0,1,2,3’ ，尝试失败，只运行了一个gpu。
model = keras.utils.training_utils.multi_gpu_model(M, gpus=4)，尝试失败，检测不到四个GPU，具体错误如下：
we expect the following devices to be available: [’/cpu:0’, ‘/gpu:0’, ‘/gpu:1’, ‘/gpu:2’, ‘/gpu:3’]. However this machine only has: [’/cpu:0’, ‘/xla_gpu:0’, ‘/xla_gpu:1’, ‘/xla_cpu:0’]. Try reducing gpus.
strategy = tf.distribute.MirroredStrategy(devices=["/gpu:0", “/gpu:1”, ‘/gpu:2’, ‘/gpu:3’])，尝试失败，原因与2.一致。

但是前面加载tensorflow和cudnn的时候确实是检测得到四个GPU的：

在这里插入图片描述
之后在经过在github上找到解释：

关注