GPU服务器搭建
Ubuntu16.04
显卡型号:Nvidia GTX1080Ti
Author : 小项同学?
文章目录
安装nvidia显卡驱动
ubuntu 16.04默认安装了第三方开源的驱动程序nouveau,安装nvidia显卡驱动首先需要禁用nouveau,不然会碰到冲突的问题,导致无法安装nvidia显卡驱动。
Nvidia官网下载驱动程序:http://www.nvidia.cn/Download/index.aspx?lang=cn
编辑blacklist.conf
vim /etc/modprobe.d/blacklist.conf
# 文件最后一行加入下面几行语句
blacklist vga16fb
blacklist nouveau
blacklist rivafb
blacklist rivatv
blacklist nvidiafb
options nouveau modeset=0
# 保存退出,更新文件
sudo update-initramfs -u
# 重启系统(一定要重启!)
reboot
验证nouveau是否已禁用,没有信息显示,说明nouveau已被禁用
lsmod | grep nouveau
关闭图形界面(不执行会出错),若在ubuntu图形界面下按ctrl+alt+f1进入命令行界面
sudo service lightdm stop
卸载掉原有驱动(若安装过其他版本或其他方式安装过驱动执行此项)
sudo apt-get remove nvidia-*
安装Nvidia驱动
sudo chmod a+x NVIDIA-Linux-x86_64-410.78.run
sudo ./NVIDIA-Linux-x86_64-410.78.run -no-x-check -no-nouveau-check -no-opengl-files # 只有禁用opengl这样安装才不会出现循环登陆的问题
安装过程中的选项
# The distribution-provided pre-install script failed! Are you sure you want to continue? 【continue】
# Would you like to run the nvidia-xconfigutility to automatically update your x configuration so that the NVIDIA x driver will be used when you restart x? Any pre-existing x confile will be backed up. 【Yes】
#Would you like to register the kernel module souces with DKMS? This will allow DKMS to automatically build a new module, if you install a different kernel later? 【No】
# Nvidia's 32-bit compatibility libraries? 【No】
挂载Nvidia驱动(不必要)
modprobe nvidia
检查驱动是否安装成功
nvidia-smi
安装成功,reboot 重启
可能出现的问题:
-
内核版本和驱动不对应的情况
-
ERROR:Unable to load the kernel module 'nvidia.ko'......
-
禁用nouveau没有reboot系统
NVIDIA CUDA Installation Guide for Linux
官网指南:https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#abstract
Pre-installation Actions
Verify you have a CUDA-Capable GPU
lspci | grep -i nvidia
Verify you have a Supported Version of Linux
uname -m && cat /etc/*release
Verify the System has gcc installed
gcc --version
RUNFILE Installation
Run the installer and follow the on-screen prompts:
chmod a+x cuda_9.0.176_384.81_linux.run
sudo sh cuda_9.0.176_384.81_linux.run
安装过程中的选项
Install NVIDIA Accelerated Graphics Driver for Linux-x86_64 384.81? 【no】 # 前面驱动已安装
Install the CUDA 9.0 Toolkit? 【yes】
Do you want to install a symbolic link at /usr/local/cuda? 【yes】
Install the CUDA 9.0 Samples? 【yes】# 便于后面测试
安装cuda时可能有下面的信息, 原因是缺少相关的依赖库,安装相应库就解决了:
Installing the CUDA Toolkit in /usr/local/cuda-8.0 …
Missing recommended library: libGLU.so
Missing recommended library: libX11.so
Missing recommended library: libXi.so
Missing recommended library: libXmu.so
sudo apt-get install freeglut3-dev build-essential libx11-dev libxmu-dev libxi-dev libgl1-mesa-glx libglu1-mesa libglu1-mesa-dev
再次安装,就不再提示了
sudo sh cuda_9.0.176_384.81_linux.run
配置环境变量,在文件末尾添加路径
vim /etc/profile
export PATH=/usr/local/cuda-9.0/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda-9.0/lib64:$LD_LIBRARY_PATH
测试CUDA的samples
cd /root/NVIDIA_CUDA-9.0_Samples/1_Utilities/deviceQuery
sudo make
sudo ./deviceQuery
Installing cuDNN on Linux
https://docs.nvidia.com/deeplearning/sdk/cudnn-install/index.html#installlinux
Installing from a Tar File
Unzip the cuDNN package.
chmod a+x cudnn-9.0-linux-x64-v7.4.1.5.tgz
tar -xzvf cudnn-9.0-linux-x64-v7.4.1.5.tgz
Copy the following files into the CUDA Toolkit directory, and change the file permissions.
sudo cp cuda/include/cudnn.h /usr/local/cuda/include
sudo cp cuda/lib64/libcudnn* /usr/local/cuda/lib64
sudo chmod a+r /usr/local/cuda/include/cudnn.h /usr/local/cuda/lib64/libcudnn*
Installing from a Debian File(prefer)
Navigate to your directory containing cuDNN Debian file.(顺序安装)
Install the runtime library, for example:
sudo dpkg -i libcudnn7_7.4.1.5-1+cuda9.0_amd64.deb
Install the developer library, for example:
sudo dpkg -i libcudnn7-dev_7.4.1.5-1+cuda9.0_amd64.deb
Install the code samples and the cuDNN Library User Guide, for example:
sudo dpkg -i libcudnn7-doc_7.4.1.5-1+cuda9.0_amd64.deb
Verifying
To verify that cuDNN is installed and is running properly, compile the mnistCUDNN sample located in the /usr/src/cudnn_samples_v7 directory in the debian file.
Copy the cuDNN sample to a writable path.
cp -r /usr/src/cudnn_samples_v7/ $HOME
Go to the writable path.
cd $HOME/cudnn_samples_v7/mnistCUDNN
Compile the mnistCUDNN sample.
make clean && make
Run the mnistCUDNN sample.
./mnistCUDNN
If cuDNN is properly installed and running on your Linux system, you will see a message similar to the following:
Test passed!
可能出现的错误
- 出现下面报错,采取修改方式:
error while loading shared libraries: libcudart.so.9.0: cannot open shared object file: No such file or directory
sudo cp /usr/local/cuda-9.0/lib64/libcusolver.so.9.0 /usr/local/lib/libcusolver.so.9.0 && sudo ldconfig
sudo cp /usr/local/cuda-9.0/lib64/libcudart.so.9.0 /usr/local/lib/libcudart.so.9.0 && sudo ldconfig
sudo cp /usr/local/cuda-9.0/lib64/libcufft.so.9.0 /usr/local/lib/libcufft.so.9.0 && sudo ldconfig
sudo cp /usr/local/cuda-9.0/lib64/libcurand.so.9.0 /usr/local/lib/libcurand.so.9.0 && sudo ldconfig
安装Anaconda
chmod a+x Anaconda3-5.3.0-Linux-x86_64.sh
bash Anaconda3-5.3.0-Linux-x86_64.sh
source ~/.bashrc # 不可遗漏,让.bashrc中添加的环境变量生效
安装之后我的python为python3.7版本,使用下面命令编程与tensorflow兼容的python3.6版本
conda install python=3.6
安装Tensorflow
下面命令会安装最新 tensorflow-gpu-1.12.0,如果需要其他版本tensorflow,参考官网https://www.tensorflow.org/install/
pip install --upgrade tensorflow-gpu
安装XGboost
An up-to-date version of the CUDA toolkit is required.
git download xgboost project directory
git clone --recursive https://github.com/dmlc/xgboost
From the command line on Linux starting from the XGBoost directory:
mkdir build
cd build
cmake .. -DUSE_CUDA=ON
# 如果是multi GPU 用下面的命令
# cmake .. -DUSE_CUDA=ON -DUSE_NCCL=ON -DNCCL_ROOT=/path/to/nccl2
make -j4
此时如果 import xgboost 会报错, 执行下面命令解决
ImportError: No module named xgboost #报错
sh build.sh
cd python-package
python setup.py install
jupyter notebook
生成一个notebook配置文件
默认情况下,配置文件~/.jupyter/jupyter_notebook_config.py
并不存在,使用命令生成配置文件:
jupter notebook --generate-config
如果是root用户执行上面的命令,会发生一个问题:
Running as root it not recommended. Use --allow-root to bypass.
root 用户执行时需要加上 --allow-root 选项。
jupyter notebook --generate-config --allow-config
执行成功后,会出现下面的信息
Writing default config to: /root/.jupyter/jupyter_notebook_config.py
生成密码
打开ipython执行下面内容
In [1]: from notebook.auth import passwd
In [2]: passwd()
Enter password:
Verify password:
Out[2]: 'sha1:67c9e60bb8b6:9ffede0825894254b2e042ea597d771089e11aed'
在jupyter_notebook_config.py
添加的密码
c.NotebookApp.password = u'sha1:67c9e60bb8b6:9ffede0825894254b2e042ea597d771089e11aed'
修改配置文件
在 jupyter_notebook_config.py
中找到下面的行,取消注释并修改
c.NotebookApp.ip='*'
c.NotebookApp.password = u'sha:ce...刚才复制的那个密文'
c.NotebookApp.open_browser = False
c.NotebookApp.port =8888 #可自行指定一个端口, 访问时使用该端口
不同环境中不同版本的kernel控制:https://ipython.readthedocs.io/en/stable/install/kernel_install.html#kernel-install
conda install ipykernel # or pip install ipykernel
source activate env1
python -m ipykernel install --user --name env1 --display-name "env1"
source activate env2
python -m ipykernel install --user --name env2 --display-name "env2"
下一篇文章介绍pycharm如何集成远程服务器环境