NVIDIA+PyTorch机器学习环境搭建手记

北极象

已于 2024-05-05 10:48:45 修改

阅读量536

点赞数

于 2023-11-02 17:07:59 首次发布

本文链接：https://blog.csdn.net/jgku/article/details/134186978

版权

机器学习专栏收录该内容

2 篇文章

订阅专栏

CUDA和CUDNN编程环境简介

CUDA™是一种由NVIDIA推出的通用并行计算 ; 架构，该架构使 GPU，能够解决复杂的计算。
CUDNN是一个GPU 加速的深度神经网络基元库，能够以高度优化的方式实现标准例程（如前向和反向卷积、池化层、归一化和激活层）。全球的深度学习研究人员和框架开发者都依赖cuDNN 来实现高性能GPU 加速。

环境准备

安装依赖

安装前提：zlib， openssl，glibc这些常用库先安装好，并选择合适的版本

One must install kernel-devel and gcc kernel on a CentOS 7:
$ sudo yum group install "Development Tools"
$ sudo yum install kernel-devel

$ sudo yum -y install epel-release
$ sudo yum -y install dkms

升级gcc

yum install centos-release-scl
yum install devtoolset-8 ## devtoolset-8对应gcc8.x.x版本

## 在当前shell激活gcc:
scl enable devtoolset-8 bash
或
source /opt/rh/devtoolset-8/enable

安装三部曲

先查看显卡型号:

lspci | grep -i --color ‘vga|3d|2d’
或者：sudo lshw -class display
如果是英伟达显卡，则 lspci | grep -i nvidia，再lspci -v -s <设备号>, 如下图:
查得显卡存储大小为32G。

驱动安装，输入nvidia-smi，如果没有该命令，就去下载NV的驱动。
CUDA Toolkit，我下载的是cuda-repo-rhel7-12-3-local-12.3.1_545.23.08-1.x86_64.rpm，安装完执行deviceQuery、bandwidthTest检查是否安装成功。切换到CUDA Sample目录，deviceQuery默认在/home/xxx/NVIDIA_CUDA-xxx/下，make一下就编译出来了。
cudnn安装。我下载的是cudnn-local-repo-rhel7-8.9.6.50-1.0-1.x86_64.rpm,下载到/usr/local目录下，执行：tar -zxvf cudnn-8.0-linux-x64-v6.0.tgz，解开，配置环境变量即可。

加几个环境变量：

export CUDA_HOME=$CUDA_HOME:/usr/local/cuda
export PATH=$PATH:/usr/local/cuda/bin
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/lib64

Nvidia驱动安装

下载驱动，驱动有两种形式：一种是run包，一种是rpm包：
在这里插入图片描述一个1.3G，一个143M。

方式一：run包的安装

到官网驱动页面，选择自己显卡的驱动：
在这里插入图片描述 The procedure to install proprietary Nvidia GPU Drivers on CentOS 7 Linux is as follows:

Update your system running yum command or dnf command
Blacklist nouveau driver
Download the Nvidia driver for CentOS 7
Install required software to install the proprietary Nvidia driver on CentOS
Disable Nouveau driver in CentOS 7
Switch to CentOS 7 to text mode
Run the NVIDIA driver installer
Reboot the CentOS 7 to use the NVIDIA driver

1 – Download the driver from NVIDIA. Driver’s filename looks like to NVIDIA-Linux-x86_64-290.10.run. 

2 – To install the driver, the X-server must be stopped. All operations must be executed from the command line (virtual console). Fortunately, only few operations are required. To open the first virtual console, type [Ctrl]+[Alt]+[F1]. There are six virtual consoles and any console from F1 to F6 is ok. Once the console it opened, enter your login. Now you can stop the X-server with:

    sudo /etc/init.d/gdm stop

This command is better:

    sudo stop gdm

The following command can also help:

    sudo killall Xorg

3 – Now you can install the driver:

    sudo sh NVIDIA-Linux-x86_64-290.10.run

在CentOS上，还需要：
禁掉Nouveau驱动，Nouveau是由第三方为NVIDIA显卡开发的一个开源3D驱动。

$ cat /etc/modprobe.d/blacklist-nvidia-nouveau.conf
blacklist nouveau
options nouveau modeset=0

更新内核参数：dracut --force，然后reboot。
在这里插入图片描述 GPU驱动安装完成后，Persistence-M默认为关闭（off）状态，GPU驱动在开启Persistence-M属性状态下性能更稳定。为了业务更稳定地进行，建议您通过NVIDIA Persistence Daemon方式开启Persistence-M属性。
执行以下命令，运行NVIDIA Persistence Daemon:

nvidia-persistenced --user username

方式二：驱动rpm包的安装

先装三个主要的依赖：gcc、kernel-devel、dkms，其中需要注意的是，kernel-devel的版本需要与当前内核的版本一致，不然后面会出现找不到文件的情况。

首先安装一下1.3G的rpm包，再刷新yum源，再安装driver：

nvidia-driver-local-repo-rhel7-535.129.03-1.0-1.x86_64.rpm

yum clean all
yum makecache
yum -y update

yum install nvidia-driver

重启机器后，运行：
在这里插入图片描述有时需要升级内核：

## 检查内核版本：
uname -r
## 查看可安装的版本
yum list | grep kernel-devel
## 安装内核
yum install kernel-devel.x86_64
## 安装依赖
yum -y install gcc dkms

卸载NV驱动

./NVIDIA-Linux-x86_64-440.33.01.run --uninstall
或：sudo /usr/bin/nvidia-uninstall

各种环境安装简述

Mac Pro

我的Mac Pro是2015年中的，配备的是 Intel Iris Pro Graphics 集成GPU显卡，所以用不了CUDA。

很不幸，Pytorch团队不发布 Mac OS CUDA版。macOS 10.14 (Mojave) 及更高版本目前暂不支持CUDA。因此，如需获得CUDA 支持，请勿升级至高于macOS 10.13.6 的版本。从Premiere Pro 14.0 开始，不再支持CUDA。但是有热心人帮我们编译了pytorch-osx-build版本。

ThinkPad W530

我的ThinkPad W530配的是NVIDIA Quadro K1000M显卡，这是一款采用了台积电 28nm工艺的GPU，采用Nvidia Kepler架构，上市时间为2012年6月1日。具有 12.7亿个晶体管、192 个 CUDA 核心和 2GB DDR3 显存，具备 256KB 二级缓存，理论算力326.4GFLOPS，总功耗为45W。K1000M is a Kepler GPU. It supports CUDA.

下载安装CUDA:

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-ubuntu2004.pin
sudo mv cuda-ubuntu2004.pin /etc/apt/preferences.d/cuda-repository-pin-600
wget https://developer.download.nvidia.com/compute/cuda/12.3.1/local_installers/cuda-repo-ubuntu2004-12-3-local_12.3.1-545.23.08-1_amd64.deb
sudo dpkg -i cuda-repo-ubuntu2004-12-3-local_12.3.1-545.23.08-1_amd64.deb
sudo cp /var/cuda-repo-ubuntu2004-12-3-local/cuda-*-keyring.gpg /usr/share/keyrings/
sudo apt-get update
sudo apt-get -y install cuda-toolkit-12-3

ThinkPad P15

NVIDIA Quadro RTX5000显卡：

Turing GPU
3,072 NVIDIA® CUDA® 核心
384 NVIDIA® Tensor核心
48 NVIDIA® RT 核心
16GB GDDR6 内存
高达 448GB/s 显存带宽
62T RTX-OPS
8 Giga Rays/s 光线投射
11.2 TFLOPS FP32 效能
22.3 TFLOPS FP16 效能
178.4TOPS INT8 效能
89.2 TFLOPS Tensor运算
最大功耗：265W
4x DisplayPort 1.4
1x VirtualLink

CentOS 7

于是借了台朋友的Linux服务器，安装有NV的显卡Tesla P40，显存32G，价格估计3万出头。理论算力11.76 TFLOPS（FP32）、367.4 GFLOPS。

[root@iotdb-1 ~]# lspci | grep NVIDIA
04:00.0 3D controller: NVIDIA Corporation GP102GL [Tesla P40] (rev a1)
[root@iotdb-1 ~]# lspci -v -s 04:00.0
04:00.0 3D controller: NVIDIA Corporation GP102GL [Tesla P40] (rev a1)
        Subsystem: NVIDIA Corporation Device 11d9
        Flags: bus master, fast devsel, latency 0, IRQ 96, NUMA node 0
        Memory at 91000000 (32-bit, non-prefetchable) [size=16M]
        Memory at 3b000000000 (64-bit, prefetchable) [size=32G]
        Memory at 3b800000000 (64-bit, prefetchable) [size=32M]
        Capabilities: [60] Power Management version 3
        Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+
        Capabilities: [78] Express Endpoint, MSI 00
        Capabilities: [100] Virtual Channel
        Capabilities: [250] Latency Tolerance Reporting
        Capabilities: [128] Power Budgeting <?>
        Capabilities: [420] Advanced Error Reporting
        Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
        Capabilities: [900] #19
        Kernel driver in use: nouveau
        Kernel modules: nouveau

下载安装CUDA：

wget https://developer.download.nvidia.com/compute/cuda/12.3.1/local_installers/cuda-repo-rhel7-12-3-local-12.3.1_545.23.08-1.x86_64.rpm
sudo rpm -i cuda-repo-rhel7-12-3-local-12.3.1_545.23.08-1.x86_64.rpm
sudo yum clean all
sudo yum -y install cuda-toolkit-12-3

使用阿里镜像源：
先下载cuda-rhel7.repo文件，然后修改：

sed -e ‘s,developer.download.nvidia.cn/compute/cuda/repos/,mirrors.aliyun.com/nvidia-cuda,g’ \
-e ‘s,developer.download.nvidia.com/compute/cuda/repos,mirrors.aliyun.com/nvidia-cuda,g’ \
-i /etc/yum.repos.d/cuda-rhel7.repo

然后安装：

yum makecache
yum install cuda-12-3

PyTorch安装

安装

直接在CentOS 7上安装，报找不到:
安装失败发现我用的是python 3.12，果断降到3.10。

pip install torch1.12.1+cu102 torchvision0.13.1+cu102 torchaudio==0.12.1 --extra-index-url https://download.pytorch.org/whl/cu102

再Macbook Pro上安装：

pip3 install torch torchvision torchaudio

在Apple M1 芯片上安装:

pip3 install --pre torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/nightly/cpu

验证

>>> import torch
>>> print(torch.__version__)
2.1.1+cu121
>>> print(torch.cuda.is_available())
True
>>> x = torch.rand(5, 3)
>>> print(x)
tensor([[0.5475, 0.8505, 0.5119],
        [0.7170, 0.0864, 0.8615],
        [0.2313, 0.8355, 0.9407],
        [0.8058, 0.2958, 0.4819],
        [0.6380, 0.3769, 0.6650]])
>>>

附录

Pytorch 各个GPU版本CUDA和cuDNN对应版本
torch、torchvision、cuda 、python对应版本匹配，参照官网https://pytorch.org/get-started/previous-versions/
在这里插入图片描述 CUDA与显卡驱动：https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html