ubuntu小技巧21--配置 gpu 机器环境

1 简介

由于工作需要,笔者经常需要配置各类gpu环境,因此在此处记录下 gpu 环境的常见配置过程。
一般情况下,需要先初始化机器,包括:基本参数配置,磁盘raid配置,以及其它基础环境变量的配置;然后,安装显卡驱动,cuda,cudnn,nccl,tf 等基础依赖包和软件;最后,测试 gpu 服务正常,并交给相应用户。 此处重点介绍第二部分。

本文以2080GPU,cuda10.0 为例子; 实际中也可以按照此步骤配置1080、2080、3090、P40、V100等常见系列的机器。

2 初始化环境

一般个人用户,只需要设置下面2项即可;具备一定规模的团队,可以订制初始化脚本,初始化 ulimit、piip 源头、基础pip包等内容。

apt-get install -y linux-headers-$(uname -r)
apt-get install -y freeglut3-dev build-essential libx11-dev libxmu-dev libxi-dev libgl1-mesa-glx libglu1-mesa libglu1-mesa-dev

3 安装nvidia驱动

./NVIDIA-Linux-x86_64-440.82.run 即可安装驱动,安装过程中会有多个选项,一般默认选择ok即可;
例如动态注册内核模块需要选择为true,xconfig 之类的可以拒绝备份,安装后可以通过 $ nvidia-smi 查看显卡信息,如下图:
在这里插入图片描述
驱动卸载方法:nvidia-uninstall
重装驱动时候,建议先正常 nvidia-uninstall 卸载,然后重启机器,最后再安装新版本驱动。卸载后重启机器可以避免某些内核模块没有正常卸载的问题。

4 安装cuda

./cuda_10.0.130_410.48_linux.run 安装,按需要设置路径如 /home/cuda/cuda-10.0,最好取消默认的显卡驱动,默认版本比较低,一般不推荐。
也可以通过静默方式安装:
./cuda_10.0.130_410.48_linux.run --toolkit --toolkitpath=/home/cuda/cuda-10.0
如果不使用静默安装,碰见需要设置 NVIDIA Accelerated Graphics Driver 和 nvidia-xconfig 的地方,之间选No即可。
如果安装期间出现tmp目录空间不够用的情况,则添加参数 --tmpdir=/home/new-tmp-dir 即可。

5 安装cudnn

dpkg -i libcudnn7_7.6.4.38-1+cuda10.0_amd64.deb
dpkg -i libcudnn7-dev_7.6.4.38-1+cuda10.0_amd64.deb
上面 dpkg 安装方式也可以更改 tgz 的方式包安装:
tar zxf cudnn-10.0-linux-x64-v7.6.5.32.tgz
cp -R cuda/lib64/* /usr/local/cuda/lib64
cp cuda/include/cudnn.h /usr/local/cuda/include/
更新动态链接库高速缓存:
echo “include /usr/local/cuda/lib64” >> /etc/ld.so.conf
ldconfig

6 安装nccl

dpkg -i nccl-repo-ubuntu1604-2.4.7-ga-cuda10.0_1-1_amd64.deb
提示:

     \# dpkg -i  nccl-repo-ubuntu1604-2.4.7-ga-cuda10.0_1-1_amd64.deb
     Selecting previously unselected package nccl-repo-ubuntu1604-2.4.7-ga-cuda10.0.
     (Reading database ... 75848 files and directories currently installed.)
     Preparing to unpack nccl-repo-ubuntu1604-2.4.7-ga-cuda10.0_1-1_amd64.deb ...
     Unpacking nccl-repo-ubuntu1604-2.4.7-ga-cuda10.0 (1-1) ...
     Setting up nccl-repo-ubuntu1604-2.4.7-ga-cuda10.0 (1-1) ...
  
     The public CUDA GPG key does not appear to be installed.
     To install the key, run this command:
     sudo apt-key add /var/nccl-repo-2.4.7-ga-cuda10.0/7fa2af80.pub
因此如下安装:
     sudo apt-key add /var/nccl-repo-2.4.7-ga-cuda10.0/7fa2af80.pub
     dpkg -i  nccl-repo-ubuntu1604-2.4.7-ga-cuda10.0_1-1_amd64.deb
     dpkg -L nccl-repo-ubuntu1604-2.4.7-ga-cuda10.0
     /.
     /usr
     /usr/share
     /usr/share/doc
     /usr/share/doc/nccl-repo-ubuntu1604-2.4.7-ga-cuda10.0
     /usr/share/doc/nccl-repo-ubuntu1604-2.4.7-ga-cuda10.0/changelog.Debian.gz
     /var
     /var/nccl-repo-2.4.7-ga-cuda10.0
     /var/nccl-repo-2.4.7-ga-cuda10.0/Release.gpg
     /var/nccl-repo-2.4.7-ga-cuda10.0/7fa2af80.pub
     /var/nccl-repo-2.4.7-ga-cuda10.0/libnccl2_2.4.7-1+cuda10.0_amd64.deb
     /var/nccl-repo-2.4.7-ga-cuda10.0/Release
     /var/nccl-repo-2.4.7-ga-cuda10.0/libnccl-dev_2.4.7-1+cuda10.0_amd64.deb
     /var/nccl-repo-2.4.7-ga-cuda10.0/Packages.gz
     /etc
     /etc/apt
     /etc/apt/sources.list.d
     /etc/apt/sources.list.d/nccl-2.4.7-ga-cuda10.0.list

继续安装:
dpkg -i /var/nccl-repo-2.4.7-ga-cuda10.0/libnccl2_2.4.7-1+cuda10.0_amd64.deb
dpkg -i /var/nccl-repo-2.4.7-ga-cuda10.0/libnccl-dev_2.4.7-1+cuda10.0_amd64.deb
通过 dpkg -l|grep nccl (-l 为 L的小写字母)查看 nccl 安装情况:
在这里插入图片描述

7 设置环境变量

touch /etc/profile.d/tfenv.sh

#!/bin/bash
export CUDA_HOME=/usr/local/cuda
export CUDA_ROOT=/usr/local/cuda
export PATH=${CUDA_HOME}/bin:$PATH
export LD_LIBRARY_PATH=${CUDA_ROOT}/extras/CUPTI/lib64:${CUDA_ROOT}/lib64

添加后 bash /etc/profile.d/tfenv.sh 执行一下就会生效,此时在任何地方都可以查看到cuda信息; 如下,切换到普通用户执行 nvcc 查看 cuda 版本

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Sat_Aug_25_21:08:01_CDT_2018
Cuda compilation tools, release 10.0, V10.0.130

8 安装tensorflow

  1. 升级 pip 为最新版本
    easy_install -U pip 或者 pip install --upgrade pip
  2. 安装 tensorflow-gpu
    pip install --upgrade tensorflow-gpu==1.14
    安装后,通过pip list 确认正常安装tf了;
    由于tf 和 cuda 存在某些版本不兼容问题,所以不要任意版本搭配,笔者测试过cuda 10.0 和 tensorflow-gpu 1.14 正常兼容的。
  3. 测试 tf 正常
    通过如下脚本测试,能正常识别出各显卡,且不报错即可认为 driver+cuda+tf 都是正常。
    import tensorflow as tf
    hello = tf.constant('Hello, TensorFlow!')
    sess = tf.Session()
    print(sess.run(hello))
    
    :~$  python test.py 
    WARNING:tensorflow:From test.py:3: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead.
    
    2020-11-17 20:22:19.056539: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcuda.so.1
    2020-11-17 20:22:25.667005: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties: name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.545
    pciBusID: 0000:1a:00.0
    ......
    2020-11-17 20:22:25.674539: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 7 with properties: name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.545
    pciBusID: 0000:b2:00.0
    2020-11-17 20:22:25.676754: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0
    2020-11-17 20:22:25.745095: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0
    2020-11-17 20:22:25.772363: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcufft.so.10.0
    2020-11-17 20:22:25.779587: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcurand.so.10.0
    2020-11-17 20:22:25.855708: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusolver.so.10.0
    2020-11-17 20:22:25.897265: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusparse.so.10.0
    2020-11-17 20:22:26.025719: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
    2020-11-17 20:22:26.040652: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1763] Adding visible gpu devices: 0, 1, 2, 3, 4, 5, 6, 7
    2020-11-17 20:22:26.041398: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
    2020-11-17 20:22:27.132712: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x2a371c0 executing computations on platform CUDA. Devices:
    2020-11-17 20:22:27.132771: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): GeForce RTX 2080 Ti, Compute Capability 7.5
    ......
    2020-11-17 20:22:27.132869: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (7): GeForce RTX 2080 Ti, Compute Capability 7.5
    2020-11-17 20:22:27.141872: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3099990000 Hz
    2020-11-17 20:22:27.149054: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x504d030 executing computations on platform Host. Devices:
    2020-11-17 20:22:27.149109: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): <undefined>, <undefined>
    2020-11-17 20:22:27.153946: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties: 
    ......
    2020-11-17 20:22:27.164498: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
    2020-11-17 20:22:27.183440: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1763] Adding visible gpu devices: 0, 1, 2, 3, 4, 5, 6, 7
    2020-11-17 20:22:27.184158: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0
    2020-11-17 20:22:27.194829: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix:
    2020-11-17 20:22:27.194849: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187]      0 1 2 3 4 5 6 7 
    2020-11-17 20:22:27.194855: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 0:   N N N N N N N N 
    2020-11-17 20:22:27.194859: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 1:   N N N N N N N N 
    2020-11-17 20:22:27.194863: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 2:   N N N N N N N N 
    2020-11-17 20:22:27.194867: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 3:   N N N N N N N N 
    2020-11-17 20:22:27.194871: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 4:   N N N N N N N N 
    2020-11-17 20:22:27.194876: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 5:   N N N N N N N N 
    2020-11-17 20:22:27.194880: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 6:   N N N N N N N N 
    2020-11-17 20:22:27.194884: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 7:   N N N N N N N N 
    ......
    2020-11-17 20:22:27.221914: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:7 with 10310 MB memory) -> physical GPU (device: 7, name: GeForce RTX 2080 Ti, pci bus id: 0000:b2:00.0, compute capability: 7.5)
    Hello, TensorFlow!
    

9 说明

  1. 系统和硬件环境
    笔者系统为ubuntu 1604 server 版本, GPU 为 GeForce RTX 2080 Ti,系统配置8卡

  2. 相关文件下载链接:
    nvidia 官方 driver
    cuda 官方下载网址 , cuda 历史版本
    cudnn 官方下载网址
    nccl 官方下载网址nccl 历史版本

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

昕光xg

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值