【手把手系列】全方位GPU深度学习环境搭建(Nvidia,cuda,cudnn,tensorflow, xgboost)

GPU服务器搭建

Ubuntu16.04

显卡型号:Nvidia GTX1080Ti

Author : 小项同学?

blog: https://blog.csdn.net/sunnycoder_xy

安装nvidia显卡驱动

ubuntu 16.04默认安装了第三方开源的驱动程序nouveau,安装nvidia显卡驱动首先需要禁用nouveau,不然会碰到冲突的问题,导致无法安装nvidia显卡驱动。

Nvidia官网下载驱动程序:http://www.nvidia.cn/Download/index.aspx?lang=cn

编辑blacklist.conf

vim /etc/modprobe.d/blacklist.conf
# 文件最后一行加入下面几行语句
blacklist vga16fb
blacklist nouveau
blacklist rivafb
blacklist rivatv
blacklist nvidiafb
options nouveau modeset=0
# 保存退出,更新文件
sudo update-initramfs -u
# 重启系统(一定要重启!)
reboot

验证nouveau是否已禁用,没有信息显示,说明nouveau已被禁用

lsmod | grep nouveau

关闭图形界面(不执行会出错),若在ubuntu图形界面下按ctrl+alt+f1进入命令行界面

sudo service lightdm stop 

卸载掉原有驱动(若安装过其他版本或其他方式安装过驱动执行此项)

sudo apt-get remove nvidia-*  

安装Nvidia驱动

 sudo chmod  a+x NVIDIA-Linux-x86_64-410.78.run
 sudo ./NVIDIA-Linux-x86_64-410.78.run -no-x-check -no-nouveau-check -no-opengl-files # 只有禁用opengl这样安装才不会出现循环登陆的问题

安装过程中的选项

# The distribution-provided pre-install script failed! Are you sure you want to continue? 【continue】
# Would you like to run the nvidia-xconfigutility to automatically update your x configuration so that the NVIDIA x driver will be used when you restart x? Any pre-existing x confile will be backed up.  【Yes】 
#Would you like to register the kernel module souces with DKMS? This will allow DKMS to automatically build a new module, if you install a different kernel later?  【No】
# Nvidia's 32-bit compatibility libraries?  【No】

挂载Nvidia驱动(不必要)

modprobe nvidia

检查驱动是否安装成功

nvidia-smi

安装成功,reboot 重启

可能出现的问题:

  • 内核版本和驱动不对应的情况

  • ERROR:Unable to load the kernel module 'nvidia.ko'......
    
  • 禁用nouveau没有reboot系统

NVIDIA CUDA Installation Guide for Linux

官网指南:https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#abstract

Pre-installation Actions

Verify you have a CUDA-Capable GPU

lspci | grep -i nvidia

Verify you have a Supported Version of Linux

uname -m && cat /etc/*release

Verify the System has gcc installed

gcc --version

RUNFILE Installation

Run the installer and follow the on-screen prompts:

chmod a+x cuda_9.0.176_384.81_linux.run
sudo sh cuda_9.0.176_384.81_linux.run

安装过程中的选项

Install NVIDIA Accelerated Graphics Driver for Linux-x86_64 384.81? 【no】 # 前面驱动已安装
Install the CUDA 9.0 Toolkit? 【yes】
Do you want to install a symbolic link at /usr/local/cuda? 【yes】
Install the CUDA 9.0 Samples? 【yes】# 便于后面测试

安装cuda时可能有下面的信息, 原因是缺少相关的依赖库,安装相应库就解决了:

Installing the CUDA Toolkit in /usr/local/cuda-8.0 …
Missing recommended library: libGLU.so
Missing recommended library: libX11.so
Missing recommended library: libXi.so
Missing recommended library: libXmu.so

sudo apt-get install freeglut3-dev build-essential libx11-dev libxmu-dev libxi-dev libgl1-mesa-glx libglu1-mesa libglu1-mesa-dev 

再次安装,就不再提示了

sudo sh cuda_9.0.176_384.81_linux.run

配置环境变量,在文件末尾添加路径

vim /etc/profile
export  PATH=/usr/local/cuda-9.0/bin:$PATH
export  LD_LIBRARY_PATH=/usr/local/cuda-9.0/lib64:$LD_LIBRARY_PATH 

测试CUDA的samples

cd /root/NVIDIA_CUDA-9.0_Samples/1_Utilities/deviceQuery
sudo make
sudo ./deviceQuery

Installing cuDNN on Linux

https://docs.nvidia.com/deeplearning/sdk/cudnn-install/index.html#installlinux

Installing from a Tar File

Unzip the cuDNN package.

chmod a+x cudnn-9.0-linux-x64-v7.4.1.5.tgz
tar -xzvf cudnn-9.0-linux-x64-v7.4.1.5.tgz

Copy the following files into the CUDA Toolkit directory, and change the file permissions.

sudo cp cuda/include/cudnn.h /usr/local/cuda/include
sudo cp cuda/lib64/libcudnn* /usr/local/cuda/lib64
sudo chmod a+r /usr/local/cuda/include/cudnn.h /usr/local/cuda/lib64/libcudnn*

Installing from a Debian File(prefer)

Navigate to your directory containing cuDNN Debian file.(顺序安装)

Install the runtime library, for example:

sudo dpkg -i libcudnn7_7.4.1.5-1+cuda9.0_amd64.deb

Install the developer library, for example:

sudo dpkg -i libcudnn7-dev_7.4.1.5-1+cuda9.0_amd64.deb

Install the code samples and the cuDNN Library User Guide, for example:

sudo dpkg -i libcudnn7-doc_7.4.1.5-1+cuda9.0_amd64.deb

Verifying

To verify that cuDNN is installed and is running properly, compile the mnistCUDNN sample located in the /usr/src/cudnn_samples_v7 directory in the debian file.

Copy the cuDNN sample to a writable path.

cp -r /usr/src/cudnn_samples_v7/ $HOME

Go to the writable path.

cd  $HOME/cudnn_samples_v7/mnistCUDNN

Compile the mnistCUDNN sample.

make clean && make

Run the mnistCUDNN sample.

./mnistCUDNN

If cuDNN is properly installed and running on your Linux system, you will see a message similar to the following:

Test passed!

可能出现的错误

  • 出现下面报错,采取修改方式:
error while loading shared libraries: libcudart.so.9.0: cannot open shared object file: No such file or directory
sudo cp /usr/local/cuda-9.0/lib64/libcusolver.so.9.0 /usr/local/lib/libcusolver.so.9.0 && sudo ldconfig 
sudo cp /usr/local/cuda-9.0/lib64/libcudart.so.9.0 /usr/local/lib/libcudart.so.9.0 && sudo ldconfig 
sudo cp /usr/local/cuda-9.0/lib64/libcufft.so.9.0 /usr/local/lib/libcufft.so.9.0 && sudo ldconfig 
sudo cp /usr/local/cuda-9.0/lib64/libcurand.so.9.0 /usr/local/lib/libcurand.so.9.0 && sudo ldconfig

安装Anaconda

chmod a+x Anaconda3-5.3.0-Linux-x86_64.sh
bash Anaconda3-5.3.0-Linux-x86_64.sh
source ~/.bashrc # 不可遗漏,让.bashrc中添加的环境变量生效

安装之后我的python为python3.7版本,使用下面命令编程与tensorflow兼容的python3.6版本

conda install python=3.6

安装Tensorflow

下面命令会安装最新 tensorflow-gpu-1.12.0,如果需要其他版本tensorflow,参考官网https://www.tensorflow.org/install/

pip install --upgrade tensorflow-gpu

安装XGboost

An up-to-date version of the CUDA toolkit is required.

git download xgboost project directory

git clone --recursive https://github.com/dmlc/xgboost

From the command line on Linux starting from the XGBoost directory:

mkdir build
cd build
cmake .. -DUSE_CUDA=ON
# 如果是multi GPU 用下面的命令
# cmake .. -DUSE_CUDA=ON -DUSE_NCCL=ON -DNCCL_ROOT=/path/to/nccl2
make -j4

此时如果 import xgboost 会报错, 执行下面命令解决

ImportError: No module named xgboost #报错
sh build.sh
cd python-package
python setup.py install

jupyter notebook

生成一个notebook配置文件

默认情况下,配置文件~/.jupyter/jupyter_notebook_config.py并不存在,使用命令生成配置文件:

jupter notebook --generate-config

如果是root用户执行上面的命令,会发生一个问题:

Running as root it not recommended. Use --allow-root to bypass.

root 用户执行时需要加上 --allow-root 选项。

jupyter notebook --generate-config --allow-config

执行成功后,会出现下面的信息

Writing default config to: /root/.jupyter/jupyter_notebook_config.py

生成密码

打开ipython执行下面内容

In [1]: from notebook.auth import passwd
In [2]: passwd()
Enter password:
Verify password:
Out[2]: 'sha1:67c9e60bb8b6:9ffede0825894254b2e042ea597d771089e11aed'

jupyter_notebook_config.py 添加的密码

c.NotebookApp.password = u'sha1:67c9e60bb8b6:9ffede0825894254b2e042ea597d771089e11aed'

修改配置文件

jupyter_notebook_config.py 中找到下面的行,取消注释并修改

c.NotebookApp.ip='*'
c.NotebookApp.password = u'sha:ce...刚才复制的那个密文'
c.NotebookApp.open_browser = False
c.NotebookApp.port =8888 #可自行指定一个端口, 访问时使用该端口

不同环境中不同版本的kernel控制:https://ipython.readthedocs.io/en/stable/install/kernel_install.html#kernel-install

conda install ipykernel # or pip install ipykernel
source activate env1
python -m ipykernel install --user --name env1 --display-name "env1"
source activate env2
python -m ipykernel install --user --name env2 --display-name "env2"

下一篇文章介绍pycharm如何集成远程服务器环境

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值