【手把手系列】全方位GPU深度学习环境搭建(Nvidia，cuda，cudnn，tensorflow, xgboost)

最新推荐文章于 2024-08-14 17:53:59 发布

Xy-tech

最新推荐文章于 2024-08-14 17:53:59 发布

阅读量1.4k

点赞数 3

分类专栏：环境配置 tensorflow

本文链接：https://blog.csdn.net/sunnycoder_xy/article/details/84705725

版权

tensorflow 同时被 2 个专栏收录

3 篇文章 0 订阅

订阅专栏

环境配置

2 篇文章 0 订阅

订阅专栏

GPU服务器搭建

Ubuntu16.04

显卡型号：Nvidia GTX1080Ti

Author : 小项同学?

blog： https://blog.csdn.net/sunnycoder_xy

文章目录

GPU服务器搭建

安装nvidia显卡驱动

ubuntu 16.04默认安装了第三方开源的驱动程序nouveau，安装nvidia显卡驱动首先需要禁用nouveau，不然会碰到冲突的问题，导致无法安装nvidia显卡驱动。

Nvidia官网下载驱动程序：http://www.nvidia.cn/Download/index.aspx?lang=cn

编辑blacklist.conf

vim /etc/modprobe.d/blacklist.conf
# 文件最后一行加入下面几行语句
blacklist vga16fb
blacklist nouveau
blacklist rivafb
blacklist rivatv
blacklist nvidiafb
options nouveau modeset=0
# 保存退出，更新文件
sudo update-initramfs -u
# 重启系统（一定要重启！）
reboot

验证nouveau是否已禁用，没有信息显示，说明nouveau已被禁用

lsmod | grep nouveau

关闭图形界面（不执行会出错），若在ubuntu图形界面下按ctrl+alt+f1进入命令行界面

sudo service lightdm stop

卸载掉原有驱动（若安装过其他版本或其他方式安装过驱动执行此项）

sudo apt-get remove nvidia-*

安装Nvidia驱动

 sudo chmod  a+x NVIDIA-Linux-x86_64-410.78.run
 sudo ./NVIDIA-Linux-x86_64-410.78.run -no-x-check -no-nouveau-check -no-opengl-files # 只有禁用opengl这样安装才不会出现循环登陆的问题

安装过程中的选项

# The distribution-provided pre-install script failed! Are you sure you want to continue? 【continue】
# Would you like to run the nvidia-xconfigutility to automatically update your x configuration so that the NVIDIA x driver will be used when you restart x? Any pre-existing x confile will be backed up.  【Yes】 
#Would you like to register the kernel module souces with DKMS? This will allow DKMS to automatically build a new module, if you install a different kernel later?  【No】
# Nvidia's 32-bit compatibility libraries?  【No】

挂载Nvidia驱动(不必要)

modprobe nvidia

检查驱动是否安装成功

nvidia-smi

安装成功，reboot 重启

可能出现的问题：

内核版本和驱动不对应的情况

ERROR:Unable to load the kernel module 'nvidia.ko'......

禁用nouveau没有reboot系统

NVIDIA CUDA Installation Guide for Linux

官网指南：https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#abstract

Pre-installation Actions

Verify you have a CUDA-Capable GPU

lspci | grep -i nvidia

Verify you have a Supported Version of Linux

uname -m && cat /etc/*release

Verify the System has gcc installed

gcc --version

RUNFILE Installation

Run the installer and follow the on-screen prompts:

chmod a+x cuda_9.0.176_384.81_linux.run
sudo sh cuda_9.0.176_384.81_linux.run

安装过程中的选项

Install NVIDIA Accelerated Graphics Driver for Linux-x86_64 384.81? 【no】 # 前面驱动已安装
Install the CUDA 9.0 Toolkit? 【yes】
Do you want to install a symbolic link at /usr/local/cuda? 【yes】
Install the CUDA 9.0 Samples? 【yes】# 便于后面测试

安装cuda时可能有下面的信息, 原因是缺少相关的依赖库,安装相应库就解决了：

Installing the CUDA Toolkit in /usr/local/cuda-8.0 …
Missing recommended library: libGLU.so
Missing recommended library: libX11.so
Missing recommended library: libXi.so
Missing recommended library: libXmu.so

sudo apt-get install freeglut3-dev build-essential libx11-dev libxmu-dev libxi-dev libgl1-mesa-glx libglu1-mesa libglu1-mesa-dev

再次安装,就不再提示了

sudo sh cuda_9.0.176_384.81_linux.run

配置环境变量,在文件末尾添加路径

vim /etc/profile
export  PATH=/usr/local/cuda-9.0/bin:$PATH
export  LD_LIBRARY_PATH=/usr/local/cuda-9.0/lib64:$LD_LIBRARY_PATH

测试CUDA的samples

cd /root/NVIDIA_CUDA-9.0_Samples/1_Utilities/deviceQuery
sudo make
sudo ./deviceQuery

Installing cuDNN on Linux

https://docs.nvidia.com/deeplearning/sdk/cudnn-install/index.html#installlinux

Installing from a Tar File

Unzip the cuDNN package.

chmod a+x cudnn-9.0-linux-x64-v7.4.1.5.tgz
tar -xzvf cudnn-9.0-linux-x64-v7.4.1.5.tgz

Copy the following files into the CUDA Toolkit directory, and change the file permissions.

sudo cp cuda/include/cudnn.h /usr/local/cuda/include
sudo cp cuda/lib64/libcudnn* /usr/local/cuda/lib64
sudo chmod a+r /usr/local/cuda/include/cudnn.h /usr/local/cuda/lib64/libcudnn*

Installing from a Debian File(prefer)

Navigate to your directory containing cuDNN Debian file.(顺序安装)

Install the runtime library, for example:

sudo dpkg -i libcudnn7_7.4.1.5-1+cuda9.0_amd64.deb

Install the developer library, for example:

sudo dpkg -i libcudnn7-dev_7.4.1.5-1+cuda9.0_amd64.deb

Install the code samples and the cuDNN Library User Guide, for example:

sudo dpkg -i libcudnn7-doc_7.4.1.5-1+cuda9.0_amd64.deb

Verifying

To verify that cuDNN is installed and is running properly, compile the mnistCUDNN sample located in the /usr/src/cudnn_samples_v7 directory in the debian file.

Copy the cuDNN sample to a writable path.

cp -r /usr/src/cudnn_samples_v7/ $HOME

Go to the writable path.

cd  $HOME/cudnn_samples_v7/mnistCUDNN

Compile the mnistCUDNN sample.

make clean && make

Run the mnistCUDNN sample.

./mnistCUDNN

If cuDNN is properly installed and running on your Linux system, you will see a message similar to the following:

Test passed!

可能出现的错误

出现下面报错，采取修改方式：

error while loading shared libraries: libcudart.so.9.0: cannot open shared object file: No such file or directory

sudo cp /usr/local/cuda-9.0/lib64/libcusolver.so.9.0 /usr/local/lib/libcusolver.so.9.0 && sudo ldconfig 
sudo cp /usr/local/cuda-9.0/lib64/libcudart.so.9.0 /usr/local/lib/libcudart.so.9.0 && sudo ldconfig 
sudo cp /usr/local/cuda-9.0/lib64/libcufft.so.9.0 /usr/local/lib/libcufft.so.9.0 && sudo ldconfig 
sudo cp /usr/local/cuda-9.0/lib64/libcurand.so.9.0 /usr/local/lib/libcurand.so.9.0 && sudo ldconfig

安装Anaconda

chmod a+x Anaconda3-5.3.0-Linux-x86_64.sh
bash Anaconda3-5.3.0-Linux-x86_64.sh
source ~/.bashrc # 不可遗漏，让.bashrc中添加的环境变量生效

安装之后我的python为python3.7版本，使用下面命令编程与tensorflow兼容的python3.6版本

conda install python=3.6

安装Tensorflow

下面命令会安装最新 tensorflow-gpu-1.12.0，如果需要其他版本tensorflow，参考官网https://www.tensorflow.org/install/

pip install --upgrade tensorflow-gpu

安装XGboost

An up-to-date version of the CUDA toolkit is required.

git download xgboost project directory

git clone --recursive https://github.com/dmlc/xgboost

From the command line on Linux starting from the XGBoost directory:

mkdir build
cd build
cmake .. -DUSE_CUDA=ON
# 如果是multi GPU 用下面的命令
# cmake .. -DUSE_CUDA=ON -DUSE_NCCL=ON -DNCCL_ROOT=/path/to/nccl2
make -j4

此时如果 import xgboost 会报错, 执行下面命令解决

ImportError: No module named xgboost #报错
sh build.sh
cd python-package
python setup.py install

jupyter notebook

生成一个notebook配置文件

默认情况下，配置文件~/.jupyter/jupyter_notebook_config.py并不存在，使用命令生成配置文件：

jupter notebook --generate-config

如果是root用户执行上面的命令，会发生一个问题：

Running as root it not recommended. Use --allow-root to bypass.

root 用户执行时需要加上 --allow-root 选项。

jupyter notebook --generate-config --allow-config

执行成功后，会出现下面的信息

Writing default config to: /root/.jupyter/jupyter_notebook_config.py

生成密码

打开ipython执行下面内容

In [1]: from notebook.auth import passwd
In [2]: passwd()
Enter password:
Verify password:
Out[2]: 'sha1:67c9e60bb8b6:9ffede0825894254b2e042ea597d771089e11aed'

在jupyter_notebook_config.py 添加的密码

c.NotebookApp.password = u'sha1:67c9e60bb8b6:9ffede0825894254b2e042ea597d771089e11aed'

修改配置文件

在 jupyter_notebook_config.py 中找到下面的行，取消注释并修改

c.NotebookApp.ip='*'
c.NotebookApp.password = u'sha:ce...刚才复制的那个密文'
c.NotebookApp.open_browser = False
c.NotebookApp.port =8888 #可自行指定一个端口, 访问时使用该端口

不同环境中不同版本的kernel控制：https://ipython.readthedocs.io/en/stable/install/kernel_install.html#kernel-install

conda install ipykernel # or pip install ipykernel
source activate env1
python -m ipykernel install --user --name env1 --display-name "env1"
source activate env2
python -m ipykernel install --user --name env2 --display-name "env2"