ubuntu小技巧21--配置 GPU 机器环境
1 简介
由于工作需要,笔者经常需要配置各类gpu环境,因此在此处记录下 gpu 环境的常见配置过程。
一般情况下,需要先初始化机器,包括:基本参数配置,磁盘raid配置,以及其它基础环境变量的配置;然后,安装显卡驱动,cuda,cudnn,nccl,tf 等基础依赖包和软件;最后,测试 gpu 服务正常,并交给相应用户。 此处重点介绍第二部分。
本文以2080GPU,cuda10.0 为例子; 实际中也可以按照此步骤配置1080、2080、3090、P40、V100等常见系列的机器。
2 初始化环境
一般个人用户,只需要设置下面2项即可;具备一定规模的团队,可以订制初始化脚本,初始化 ulimit、piip 源头、基础pip包等内容。
apt-get install -y linux-headers-$(uname -r)
apt-get install -y freeglut3-dev build-essential libx11-dev libxmu-dev libxi-dev libgl1-mesa-glx libglu1-mesa libglu1-mesa-dev
3 安装nvidia驱动
./NVIDIA-Linux-x86_64-440.82.run 即可安装驱动,安装过程中会有多个选项,一般默认选择ok即可;
例如动态注册内核模块需要选择为true,xconfig 之类的可以拒绝备份,安装后可以通过 $ nvidia-smi 查看显卡信息,如下图:
驱动卸载方法:nvidia-uninstall
重装驱动时候,建议先正常 nvidia-uninstall 卸载,然后重启机器,最后再安装新版本驱动。卸载后重启机器可以避免某些内核模块没有正常卸载的问题。
4 安装cuda
./cuda_10.0.130_410.48_linux.run 安装,按需要设置路径如 /home/cuda/cuda-10.0,最好取消默认的显卡驱动,默认版本比较低,一般不推荐。
也可以通过静默方式安装:
./cuda_10.0.130_410.48_linux.run --toolkit --toolkitpath=/home/cuda/cuda-10.0
如果不使用静默安装,碰见需要设置 NVIDIA Accelerated Graphics Driver 和 nvidia-xconfig 的地方,之间选No即可。
如果安装期间出现tmp目录空间不够用的情况,则添加参数 --tmpdir=/home/new-tmp-dir 即可。
5 安装cudnn
dpkg -i libcudnn7_7.6.4.38-1+cuda10.0_amd64.deb
dpkg -i libcudnn7-dev_7.6.4.38-1+cuda10.0_amd64.deb
上面 dpkg 安装方式也可以更改 tgz 的方式包安装:
tar zxf cudnn-10.0-linux-x64-v7.6.5.32.tgz
cp -R cuda/lib64/* /usr/local/cuda/lib64
cp cuda/include/cudnn.h /usr/local/cuda/include/
更新动态链接库高速缓存:
echo “include /usr/local/cuda/lib64” >> /etc/ld.so.conf
ldconfig
6 安装nccl
dpkg -i nccl-repo-ubuntu1604-2.4.7-ga-cuda10.0_1-1_amd64.deb
提示:
\# dpkg -i nccl-repo-ubuntu1604-2.4.7-ga-cuda10.0_1-1_amd64.deb
Selecting previously unselected package nccl-repo-ubuntu1604-2.4.7-ga-cuda10.0.
(Reading database ... 75848 files and directories currently installed.)
Preparing to unpack nccl-repo-ubuntu1604-2.4.7-ga-cuda10.0_1-1_amd64.deb ...
Unpacking nccl-repo-ubuntu1604-2.4.7-ga-cuda10.0 (1-1) ...
Setting up nccl-repo-ubuntu1604-2.4.7-ga-cuda10.0 (1-1) ...
The public CUDA GPG key does not appear to be installed.
To install the key, run this command:
sudo apt-key add /var/nccl-repo-2.4.7-ga-cuda10.0/7fa2af80.pub
因此如下安装:
sudo apt-key add /var/nccl-repo-2.4.7-ga-cuda10.0/7fa2af80.pub
dpkg -i nccl-repo-ubuntu1604-2.4.7-ga-cuda10.0_1-1_amd64.deb
dpkg -L nccl-repo-ubuntu1604-2.4.7-ga-cuda10.0
/.
/usr
/usr/share
/usr/share/doc
/usr/share/doc/nccl-repo-ubuntu1604-2.4.7-ga-cuda10.0
/usr/share/doc/nccl-repo-ubuntu1604-2.4.7-ga-cuda10.0/changelog.Debian.gz
/var
/var/nccl-repo-2.4.7-ga-cuda10.0
/var/nccl-repo-2.4.7-ga-cuda10.0/Release.gpg
/var/nccl-repo-2.4.7-ga-cuda10.0/7fa2af80.pub
/var/nccl-repo-2.4.7-ga-cuda10.0/libnccl2_2.4.7-1+cuda10.0_amd64.deb
/var/nccl-repo-2.4.7-ga-cuda10.0/Release
/var/nccl-repo-2.4.7-ga-cuda10.0/libnccl-dev_2.4.7-1+cuda10.0_amd64.deb
/var/nccl-repo-2.4.7-ga-cuda10.0/Packages.gz
/etc
/etc/apt
/etc/apt/sources.list.d
/etc/apt/sources.list.d/nccl-2.4.7-ga-cuda10.0.list
继续安装:
dpkg -i /var/nccl-repo-2.4.7-ga-cuda10.0/libnccl2_2.4.7-1+cuda10.0_amd64.deb
dpkg -i /var/nccl-repo-2.4.7-ga-cuda10.0/libnccl-dev_2.4.7-1+cuda10.0_amd64.deb
通过 dpkg -l|grep nccl (-l 为 L的小写字母)查看 nccl 安装情况:
7 设置环境变量
touch /etc/profile.d/tfenv.sh
#!/bin/bash
export CUDA_HOME=/usr/local/cuda
export CUDA_ROOT=/usr/local/cuda
export PATH=${CUDA_HOME}/bin:$PATH
export LD_LIBRARY_PATH=${CUDA_ROOT}/extras/CUPTI/lib64:${CUDA_ROOT}/lib64
添加后 bash /etc/profile.d/tfenv.sh 执行一下就会生效,此时在任何地方都可以查看到cuda信息; 如下,切换到普通用户执行 nvcc 查看 cuda 版本
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Sat_Aug_25_21:08:01_CDT_2018
Cuda compilation tools, release 10.0, V10.0.130
8 安装tensorflow
- 升级 pip 为最新版本
easy_install -U pip 或者 pip install --upgrade pip - 安装 tensorflow-gpu
pip install --upgrade tensorflow-gpu==1.14
安装后,通过pip list 确认正常安装tf了;
由于tf 和 cuda 存在某些版本不兼容问题,所以不要任意版本搭配,笔者测试过cuda 10.0 和 tensorflow-gpu 1.14 正常兼容的。 - 测试 tf 正常
通过如下脚本测试,能正常识别出各显卡,且不报错即可认为 driver+cuda+tf 都是正常。import tensorflow as tf hello = tf.constant('Hello, TensorFlow!') sess = tf.Session() print(sess.run(hello)) :~$ python test.py WARNING:tensorflow:From test.py:3: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead. 2020-11-17 20:22:19.056539: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcuda.so.1 2020-11-17 20:22:25.667005: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties: name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.545 pciBusID: 0000:1a:00.0 ...... 2020-11-17 20:22:25.674539: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 7 with properties: name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.545 pciBusID: 0000:b2:00.0 2020-11-17 20:22:25.676754: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0 2020-11-17 20:22:25.745095: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0 2020-11-17 20:22:25.772363: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcufft.so.10.0 2020-11-17 20:22:25.779587: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcurand.so.10.0 2020-11-17 20:22:25.855708: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusolver.so.10.0 2020-11-17 20:22:25.897265: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusparse.so.10.0 2020-11-17 20:22:26.025719: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7 2020-11-17 20:22:26.040652: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1763] Adding visible gpu devices: 0, 1, 2, 3, 4, 5, 6, 7 2020-11-17 20:22:26.041398: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA 2020-11-17 20:22:27.132712: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x2a371c0 executing computations on platform CUDA. Devices: 2020-11-17 20:22:27.132771: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): GeForce RTX 2080 Ti, Compute Capability 7.5 ...... 2020-11-17 20:22:27.132869: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (7): GeForce RTX 2080 Ti, Compute Capability 7.5 2020-11-17 20:22:27.141872: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3099990000 Hz 2020-11-17 20:22:27.149054: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x504d030 executing computations on platform Host. Devices: 2020-11-17 20:22:27.149109: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): <undefined>, <undefined> 2020-11-17 20:22:27.153946: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties: ...... 2020-11-17 20:22:27.164498: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7 2020-11-17 20:22:27.183440: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1763] Adding visible gpu devices: 0, 1, 2, 3, 4, 5, 6, 7 2020-11-17 20:22:27.184158: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0 2020-11-17 20:22:27.194829: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix: 2020-11-17 20:22:27.194849: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187] 0 1 2 3 4 5 6 7 2020-11-17 20:22:27.194855: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 0: N N N N N N N N 2020-11-17 20:22:27.194859: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 1: N N N N N N N N 2020-11-17 20:22:27.194863: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 2: N N N N N N N N 2020-11-17 20:22:27.194867: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 3: N N N N N N N N 2020-11-17 20:22:27.194871: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 4: N N N N N N N N 2020-11-17 20:22:27.194876: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 5: N N N N N N N N 2020-11-17 20:22:27.194880: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 6: N N N N N N N N 2020-11-17 20:22:27.194884: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 7: N N N N N N N N ...... 2020-11-17 20:22:27.221914: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:7 with 10310 MB memory) -> physical GPU (device: 7, name: GeForce RTX 2080 Ti, pci bus id: 0000:b2:00.0, compute capability: 7.5) Hello, TensorFlow!
9 说明
-
系统和硬件环境
笔者系统为ubuntu 1604 server 版本, GPU 为 GeForce RTX 2080 Ti,系统配置8卡 -
相关文件下载链接:
nvidia 官方 driver
cuda 官方下载网址 , cuda 历史版本
cudnn 官方下载网址
nccl 官方下载网址 ,nccl 历史版本