Docker【部署 07】镜像内安装tensorflow-gpu及调用GPU多个问题处理Could not find cuda drivers+unable to find libcuda

  =========================================================
  The "tensorflow-gpu" package has been removed!

  Please install "tensorflow" instead.

  Other than the name, the two packages have been identical
  since TensorFlow 2.1, or roughly since Sep 2019. For more
  information, see: pypi.org/project/tensorflow-gpu
  =========================================================


  [end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
ERROR: Failed building wheel for tensorflow-gpu
Failed to build tensorflow-gpu


Other than the name, the two packages have been identical since TensorFlow 2.1 也就是说安装2.1版本的已经自带GPU支持。


## 2.Docker使用GPU


不同型号的GPU及驱动版本有所区别,环境驱动及CUDA版本如下:



[root@localhost ~]# nvidia-smi

查询结果

±----------------------------------------------------------------------------+
| NVIDIA-SMI 460.27.04 Driver Version: 460.27.04 CUDA Version: 11.2 |
|-------------------------------±---------------------±---------------------+


### 2.1 Could not find cuda drivers



报错

I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.


在Docker容器中的程序无法识别CUDA环境变量,可以尝试以下步骤来解决这个问题:


1. 检查CUDA版本:首先,需要确认宿主机上已经正确安装了CUDA。在宿主机上运行`nvcc --version`命令来检查CUDA版本。
2. 使用NVIDIA Docker镜像:NVIDIA提供了一些预先配置好的Docker镜像,这些镜像已经包含了CUDA和其他必要的库。可以使用这些镜像作为Dockerfile的基础镜像。
3. 设置环境变量:在Dockerfile中,可以使用`ENV`指令来设置环境变量。例如,如果CUDA安装在`/usr/local/cuda`目录下,可以添加以下行到Dockerfile中:`ENV PATH /usr/local/cuda/bin:$PATH`。
4. 使用nvidia-docker:nvidia-docker是一个用于运行GPU加速的Docker容器的工具。


检测CUDA版本是必要的,由于使用的是导出的镜像文件,2和3的方法无法使用,最终使用`-e`进行环境变量设置:



添加cuda的环境变量

-e PATH=/usr/local/cuda-11.2/bin: P A T H − e L D _ L I B R A R Y _ P A T H = / u s r / l o c a l / c u d a − 11.2 / l i b 64 : PATH -e LD\_LIBRARY\_PATH=/usr/local/cuda-11.2/lib64: PATHeLD_LIBRARY_PATH=/usr/local/cuda11.2/lib64:LD_LIBRARY_PATH

启动命令

nvidia-docker run --name deepface --privileged=true --restart=always --net=“host” -e PATH=/usr/local/cuda-11.2/bin: P A T H − e L D _ L I B R A R Y _ P A T H = / u s r / l o c a l / c u d a − 11.2 / l i b 64 : PATH -e LD\_LIBRARY\_PATH=/usr/local/cuda-11.2/lib64: PATHeLD_LIBRARY_PATH=/usr/local/cuda11.2/lib64:LD_LIBRARY_PATH -v /root/.deepface/weights/:/root/.deepface/weights/ -v /usr/local/cuda-11.2/:/usr/local/cuda-11.2/ -d deepface_image


### 2.2 was unable to find libcuda.so DSO



I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:168] retrieving CUDA diagnostic information for host: localhost.localdomain
I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:175] hostname: localhost.localdomain
I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:199] libcuda reported version is: NOT_FOUND: was unable to find libcuda.so DSO loaded into this program
I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:203] kernel reported version is: 460.27.4


在Linux环境下,Docker可以支持将宿主机上的目录挂载到容器里。这意味着,如果宿主机上的目录包含软链接,那么这些软链接也会被挂载到容器中。然而,需要注意的是,这些软链接指向的路径必须在Docker容器中是可访问的。也就是说,如果软链接指向的路径没有被挂载到Docker容器中,那么在容器中访问这个软链接可能会失败。  
 原文链接:<https://blog.csdn.net/u013546508/article/details/88637434>,当前环境下问题解决步骤:



1.查找 libcuda.so 文件位置

find / -name libcuda.so*

查找结果

/usr/lib/libcuda.so
/usr/lib/libcuda.so.1
/usr/lib/libcuda.so.460.27.04
/usr/lib64/libcuda.so
/usr/lib64/libcuda.so.1
/usr/lib64/libcuda.so.460.27.04

2.查看LD_LIBRARY_PATH

echo $LD_LIBRARY_PATH

查询结果

/usr/local/cuda/lib64

3.将64位的libcuda.so.460.27.04复制到LD_LIBRARY_PATH路径下【libcuda.so和libcuda.so.1都是软连接】

cp /usr/lib64/libcuda.so.460.27.04 /usr/local/cuda-11.2/lib64/

4.创建软连接

ln -s libcuda.so.460.27.04 libcuda.so.1
ln -s libcuda.so.1 libcuda.so


### 2.3 Could not find TensorRT&&Cannot dlopen some GPU libraries



I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.

W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
W tensorflow/core/common_runtime/gpu/gpu_device.cc:1960] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices…


这个问题实际上是Docker镜像文件未安装`TensorRT`导致的,可以在Dockerfile里添加安装命令后重新构建镜像:



RUN pip install tensorrt -i https://pypi.tuna.tsinghua.edu.cn/simple


以下操作不推荐,进入容器进行安装:



1.查询容器ID

docker ps

2.在running状态进入容器

docker exec -it ContainerID /bin/bash

3.安装软件

pip install tensorrt -i https://pypi.tuna.tsinghua.edu.cn/simple

4.提交新的镜像【可以将新的镜像导出使用】

docker commit ContainerID imageName:version


安装后的现象:



root@localhost:/app# python
Python 3.8.18 (default, Sep 20 2023, 11:41:31)
[GCC 12.2.0] on linux
Type “help”, “copyright”, “credits” or “license” for more information.

使用tensorflow报错

import tensorflow as tf
2023-10-09 10:15:55.482545: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-10-09 10:15:56.498608: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT

先导入tensorrt后使用tensorflow看我用

import tensorrt as tr
import tensorflow as tf
tf.test.is_gpu_available()
WARNING:tensorflow:From :1: is_gpu_available (from tensorflow.python.framework.test_util) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.config.list\_physical\_devices('GPU') instead.
2023-10-09 10:16:41.452672: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Created device /device:GPU:0 with 11389 MB memory: -> device: 0, name: Tesla T4, pci bus id: 0000:2f:00.0, compute capability: 7.5
True


尝试解决,在容器启动要执行的py文件内加入以下代码,我将以下代码加入到`app.py`文件内:



import tensorrt as tr
import tensorflow as tf

if name == “__main__”:
available = tf.config.list_physical_devices(‘GPU’)
print(f"available:{available}")


加入代码后的文件为:



3rd parth dependencies

import tensorrt as tr
import tensorflow as tf
from flask import Flask
from routes import blueprint

def create_app():
available = tf.config.list_physical_devices(‘GPU’)
print(f"available:{available}")
app = Flask(name)
app.register_blueprint(blueprint)
return app


启动容器:



nvidia-docker run --name deepface --privileged=true --restart=always --net=“host” -e PATH=/usr/local/cuda-11.2/bin: P A T H − e L D _ L I B R A R Y _ P A T H = / u s r / l o c a l / c u d a − 11.2 / l i b 64 : PATH -e LD\_LIBRARY\_PATH=/usr/local/cuda-11.2/lib64: PATHeLD_LIBRARY_PATH=/usr/local/cuda11.2/lib64:LD_LIBRARY_PATH -v /root/.deepface/weights/:/root/.deepface/weights/ -v /usr/local/cuda-11.2/:/usr/local/cuda-11.2/ -v /opt/xinan-facesearch-service-public/deepface/api/app.py:/app/app.py -d deepface_image


### 2.4 Could not create cudnn handle: CUDNN\_STATUS\_NOT\_INITIALIZED



E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:437] Could not create cudnn handle: CUDNN_STATUS_NOT_INITIALIZED
E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:441] Memory usage: 1100742656 bytes free, 15843721216 bytes total.
E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:451] Possibly insufficient driver version: 460.27.4
W tensorflow/core/framework/op_kernel.cc:1828] OP_REQUIRES failed at conv_ops_impl.h:770 : UNIMPLEMENTED: DNN library is not found.


未安装cuDNN导致的问题,安装即可。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值