CUDA9.0更新到10.1等相关软件(TensorFlow,TensorRT,openCV)调整

最新推荐文章于 2024-08-12 23:05:20 发布

Tosonw

最新推荐文章于 2024-08-12 23:05:20 发布

阅读量6k

点赞数 2

分类专栏： Linux 文章标签： CUDA cuda10 tensorflow tensorrt opencv

本文链接：https://blog.csdn.net/Tosonw/article/details/93602505

版权

Linux 专栏收录该内容

11 篇文章 1 订阅

订阅专栏

系统：
Ubuntu 16.04LTS
配置：
GeForce GTX 1060 （6078MiB）
已安装好的显卡驱动：
NVIDIA-SMI 418.56 Driver Version: 418.56

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.56       Driver Version: 418.56       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 106...  Off  | 00000000:01:00.0 Off |                  N/A |
| 24%   38C    P8     5W / 130W |    244MiB /  6078MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

CUDA 10.1

我下载的是runfile文件：
Archived Releases：CUDA Toolkit 10.1 (Feb 2019)
https://developer.nvidia.com/cuda-10.1-download-archive-base （需要先登录，然后再点开此链接。）
安装：
1.Run sudo sh cuda_10.1.105_418.39_linux.run
2.Follow the command-line prompts

│ Do you accept the above EULA? (accept/decline/quit):
│ accept  
│─────────────────────────────────────────────────────
#安装选项，由于我已经安装有Driver: 418.56，所以没有选择。
│ CUDA Installer
│ - [ ] Driver
│      [ ] 418.39
│ + [X] CUDA Toolkit 10.1
│   [X] CUDA Samples 10.1
│   [X] CUDA Demo Suite 10.1
│   [X] CUDA Documentation 10.1
│   Install 
│   Options
#遇到错误，看信息需要关闭 X server
[INFO]: ERROR: You appear to be running an X server; please exit X before
[INFO]:        installing.  For further details, please see the section INSTALLING
[INFO]:        THE NVIDIA DRIVER in the README available on the Linux driver
[INFO]:        download page at www.nvidia.com.
#关闭 X server
#  先退出已经登录的ubuntu系统，再使用ctrl+alt+F1进入命令行
$ sudo service lightdm stop
$ sudo init 3

#提示中带有nouveau字眼，貌似安装cuda需要禁用nouveau
#  在安装cuda的时候，由于涉及到NVIDIA驱动的安装，使得nouveau驱动与NVIDIA驱动冲突，为了能够继续安装，必须禁用此驱动。禁用步骤如下：
#1）把nouveau驱动加入黑名单，即在/etc/modprobe.d/blacklist.conf的后面加入：
vim /etc/modprobe.d/blacklist.conf
blacklist nouveau
#2）备份initramfs文件
sudo mv /boot/initramfs-$(uname -r).img /boot/initramfs-$(uname -r).img.bat
#3)重新建立initramfs文件
sudo dracut -v /boot/initramfs-$(uname -r).img $(uname -r)
#4)检查nouveau驱动，确保没有被加载
lsmod | grep nouveau

#安装完成后查看一下cuda版本
$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Fri_Feb__8_19:08:17_PST_2019
Cuda compilation tools, release 10.1, V10.1.105

cuDNN 7.5.0

上官网下载对应的cudnn：
https://developer.nvidia.com/cudnn
由于项目上程序跑起来的时候提示[W] [TRT] TensorRT was compiled against cuDNN 7.5.0 but is linked against cuDNN 7.4.2
所以我下载的是cuDNN v7.5.0 (Feb 25, 2019), for CUDA 10.1

# 直接解压到指定目录
$ sudo tar -zxvf cudnn-10.1-linux-x64-v7.5.0.56.tgz -C /usr/local

其他软件

1.openCV 3.4升级为 4.0

需要卸载并删除openCV 3.4相关头文件和库文件：

#进入openCV 3.4 build目录，卸载
$ sudo make uninstall
#删除残留头文件
$ sudo rm -rf /usr/local/include/opencv*
#删除残留库文件
$ sudo rm -rf /usr/local/lib/libopencv_*

然后再安装openCV 4.0

#进入 openCV4.0/build 目录
$ cmake -D CMAKE_BUILD_TYPE=Release ..
#编译
$ make -j12
#安装
$ sudo make install

提示有未定义等问题：

libemotion_sdk.so: undefined reference to `cv::imread(std::string const&, int)'
libemotion_sdk.so: undefined reference to `cv::VideoCapture::VideoCapture(std::string const&, int)'

https://github.com/opencv/opencv/issues/13000
提示：cmake的时候先在CMakeLists.txt中添加：
add_definitions(-D_GLIBCXX_USE_CXX11_ABI=0)
注："-D_GLIBCXX_USE_CXX11_ABI=0"是由于protobuf是基于GCC4等等一系列原因。

问题：

//当openCV更新为4.1版本后，出现问题：
//  [swscaler @ 0x7fe4e457a8c0] deprecated pixel format used, make sure you did set range correctly
//  原因就是使用的格式已经被废除了。
//该函数用于解决该问题。
AVPixelFormat ConvertDeprecatedFormat(enum AVPixelFormat format)
{
	switch (format)
	{
		case AV_PIX_FMT_YUVJ420P:
			return AV_PIX_FMT_YUV420P;
			break;
		case AV_PIX_FMT_YUVJ422P:
			return AV_PIX_FMT_YUV422P;
			break;
		case AV_PIX_FMT_YUVJ444P:
			return AV_PIX_FMT_YUV444P;
			break;
		case AV_PIX_FMT_YUVJ440P:
			return AV_PIX_FMT_YUV440P;
			break;
		default:
			return format;
			break;
	}
}

2. TensorRT

我原来的TensorRT是：TensorRT-5.1.5.0.Ubuntu-16.04.5.x86_64-gnu.cuda-9.0.cudnn7.5
需要重新下载支持cuda10.1的TensorRT：
下载（需要登录）：https://developer.nvidia.com/tensorrt
TensorRT 5.1.5.0 GA for Ubuntu 16.04 and CUDA 10.1 tar package
然后解压到自己选择的目录，再在CMakeLists.txt中包含就可以了。

3. TensorFlow

当换成cuda10.1后，编译时TensorFlow会有报错，它仍然试图链接cuda9的库：

/usr/bin/ld: warning: libcublas.so.9.0, needed by /home/toson/tf_include/libtensorflow_cc.so, not found (try using -rpath or -rpath-link)

我们需要重新编译TensorFlow。

bazel
直接编译TensorFlow会提示：

Cannot find bazel. Please install bazel.
Configuration finished

TensorFlow依赖bazel，需要先安装bazel：
我下载的是0.15.2版本：https://github.com/bazelbuild/bazel/releases/download/0.15.2/bazel-0.15.2-installer-linux-x86_64.sh
源码编译还要依赖Java比较麻烦，我使用bash脚本的方式：
bazel-0.15.2-installer-linux-x86_64.sh

#安装
$ chmod +x bazel-0.15.2-installer-linux-x86_64.sh
$ sudo ./bazel-0.15.2-installer-linux-x86_64.sh #--user

protobuf
我的系统里已经安装了有protobuf，不过我还是写一下编译安装步骤：

$ tar zvxf protobuf-all-3.6.1.tar.gz
$ cd protobuf-3.6.1
$ ./configure –prefix=/usr/local/
$ make
$ make check
$ sudo make install

TensorFlow
我下载的是1.12版本：https://github.com/tosonw/tensorflow/archive/v1.12.0.tar.gz
解压后命令行进入目录：

$ mkdir build
$ cd build
$ ../configure 
Extracting Bazel installation...
WARNING: --batch mode is deprecated. Please instead explicitly shut down your Bazel server using the command "bazel shutdown".
You have bazel 0.15.2 installed.
Please specify the location of python. [Default is /home/toson/anaconda3/bin/python]: 


Found possible Python library paths:
  /home/toson/anaconda3/lib/python3.6/site-packages
  /opt/intel/openvino_2019.1.144/python/python3.6
  /opt/intel/openvino_2019.1.144/deployment_tools/model_optimizer
  /home/toson/compile_libs/caffes/caffe_origin/python
Please input the desired Python library path to use.  Default is [/home/toson/anaconda3/lib/python3.6/site-packages]

Do you wish to build TensorFlow with Apache Ignite support? [Y/n]: n
No Apache Ignite support will be enabled for TensorFlow.

Do you wish to build TensorFlow with XLA JIT support? [Y/n]: n
No XLA JIT support will be enabled for TensorFlow.

Do you wish to build TensorFlow with OpenCL SYCL support? [y/N]: n
No OpenCL SYCL support will be enabled for TensorFlow.

Do you wish to build TensorFlow with ROCm support? [y/N]: n
No ROCm support will be enabled for TensorFlow.

Do you wish to build TensorFlow with CUDA support? [y/N]: y
CUDA support will be enabled for TensorFlow.

Please specify the CUDA SDK version you want to use. [Leave empty to default to CUDA 9.0]: 10.1


Please specify the location where CUDA 10.1 toolkit is installed. Refer to README.md for more details. [Default is /usr/local/cuda]: 


Please specify the cuDNN version you want to use. [Leave empty to default to cuDNN 7]: 


Please specify the location where cuDNN 7 library is installed. Refer to README.md for more details. [Default is /usr/local/cuda]: 


Do you wish to build TensorFlow with TensorRT support? [y/N]: n
No TensorRT support will be enabled for TensorFlow.

Please specify the NCCL version you want to use. If NCCL 2.2 is not installed, then you can use version 1.3 that can be fetched automatically but it may have worse performance with multiple GPUs. [Default is 2.2]: 1.3


Please specify a list of comma-separated Cuda compute capabilities you want to build with.
You can find the compute capability of your device at: https://developer.nvidia.com/cuda-gpus.
Please note that each additional compute capability significantly increases your build time and binary size. [Default is: 6.1]: 


Do you want to use clang as CUDA compiler? [y/N]: n
nvcc will be used as CUDA compiler.

Please specify which gcc should be used by nvcc as the host compiler. [Default is /usr/bin/gcc]: 


Do you wish to build TensorFlow with MPI support? [y/N]: n
No MPI support will be enabled for TensorFlow.

Please specify optimization flags to use during compilation when bazel option "--config=opt" is specified [Default is -march=native]: 


Would you like to interactively configure ./WORKSPACE for Android builds? [y/N]: 
Not configuring the WORKSPACE for Android builds.

Preconfigured Bazel build configs. You can use any of the below by adding "--config=<>" to your build command. See tools/bazel.rc for more details.
	--config=mkl         	# Build with MKL support.
	--config=monolithic  	# Config for mostly static monolithic build.
	--config=gdr         	# Build with GDR support.
	--config=verbs       	# Build with libverbs support.
	--config=ngraph      	# Build with Intel nGraph support.
Configuration finished

编译：

# 注："-D_GLIBCXX_USE_CXX11_ABI=0"是由于protobuf是基于GCC4等等一系列原因。
$ bazel build --config=opt --config=cuda --cxxopt="-D_GLIBCXX_USE_CXX11_ABI=0" //tensorflow:libtensorflow_cc.so
#等待有点久，最后提示以下内容就算成功了：
INFO: Elapsed time: 1365.375s, Critical Path: 185.64s
INFO: 4956 processes: 4956 local.
INFO: Build completed successfully, 4985 total actions

如果有问题：

#找不到库文件
Cuda Configuration Error: Cannot find cuda library libcublas.so.10.1
#我查询后发现确实找不到，但是只是名字不对应
$ locate libcublas.so.10.1
/usr/lib/x86_64-linux-gnu/libcublas.so.10.1.0.105
/usr/local/cuda-10.1/targets/x86_64-linux/lib/libcublas.so.10.1.0.105
#我就手动拷贝了一下
sudo cp /usr/local/cuda-10.1/targets/x86_64-linux/lib/libcublas.so.10.1.0.105 /usr/local/cuda-10.1/targets/x86_64-linux/lib/libcublas.so.10.1
#再次编译，仍有问题提示：找不到库文件。通过上述方式进行拷贝可解决。
Cuda Configuration Error: Cannot find cuda library libcusolver.so.10.1
Cuda Configuration Error: Cannot find cuda library libcurand.so.10.1
Cuda Configuration Error: Cannot find cuda library libcufft.so.10.1

编译成功后，在 /bazel-bin/tensorflow 目录下会出现 libtensorflow_cc.so 文件

C版本： bazel build :libtensorflow.so
C++版本： bazel build :libtensorflow_cc.so

需要的头文件，要在源码里拷贝出来使用：
bazel-genfiles/...，eigen/...，include/...，tf/...

4. PyTorch

PyTorch的C++调用库可以在官网直接下载，解压即可，无需安装。
需要重新下载基于cuda10的：https://pytorch.org/get-started/locally/
在这里插入图片描述
https://download.pytorch.org/libtorch/cu100/libtorch-shared-with-deps-latest.zip
下载后解压，并在CMakeLists.txt中包含引用：

set(Torch_DIR /home/toson/download_libs/libtorch/share/cmake/Torch)
find_package(Torch REQUIRED)

程序中会报错：找不到 tensorFromBlob() 函数
可能是我使用的是PyTorch 1.1的缘故，但也没找到1.0的libtorch在哪里下载。
我尝试查找问题：
在GitHub上有看到该函数被替换了：
https://github.com/pytorch/pytorch/pull/18780
https://github.com/pytorch/pytorch/issues/15426
注：torch::CPU和torch::CUDA在PyTorch 1.0已被弃用，应该写这个：
torch::from_blob(img_float.data, {1, 224, 224, 3}).to(torch::kCUDA)

5. Caffe

GitHub地址：https://github.com/BVLC/caffe
Caffe的GPU编译需要依赖CUDA，所以要重新编译。
编译的是GPU版本：

$ cd caffe
# 依赖项
$ sudo apt-get install 
# 修改选项 # 修改Makefile.config，例如我们可以打开CPU_ONLY选项。
$ cp Makefile.config.example Makefile.config
# 我是打开了USE_CUDNN选项，关闭CPU_ONLY选项
 USE_CUDNN := 1
 # CPU_ONLY := 1
 USE_OPENCV := 0
 OPENCV_VERSION := 3
 CUDA_DIR := /usr/local/cuda
 BLAS := open
# 编译
$ make clean # 如果编译有奇怪问题，干脆直接clean一下。
$ make all -j12
# make runtest -j16
# make pycaffe