linux gpu 驱动程序,适用于 Linux 的 Azure N 系列 GPU 驱动程序安装 - Azure Virtual Machines | Microsoft Docs...

您现在访问的是微软AZURE全球版技术文档网站,若需要访问由世纪互联运营的MICROSOFT AZURE中国区技术文档网站,请访问 https://docs.azure.cn.

在运行 Linux 的 N 系列 VM 上安装 NVIDIA GPU 驱动程序Install NVIDIA GPU drivers on N-series VMs running Linux

11/11/2019

本文内容

若要利用 NVIDIA GPU 支持的 Azure N 系列 VM 的 GPU 功能,必须安装 NVIDIA GPU 驱动程序。To take advantage of the GPU capabilities of Azure N-series VMs backed by NVIDIA GPUs, you must install NVIDIA GPU drivers. NVIDIA GPU 驱动程序扩展可在 N 系列 VM 上安装适当的 NVIDIA CUDA 或 GRID 驱动程序。The NVIDIA GPU Driver Extension installs appropriate NVIDIA CUDA or GRID drivers on an N-series VM. 请使用 Azure 门户或工具(例如 Azure CLI 或 Azure 资源管理器模板)安装或管理该扩展。Install or manage the extension using the Azure portal or tools such as the Azure CLI or Azure Resource Manager templates. 有关受支持的分发版和部署步骤,请参阅 NVIDIA GPU 驱动程序扩展文档。See the NVIDIA GPU Driver Extension documentation for supported distributions and deployment steps.

如果选择手动安装 NVIDIA GPU 驱动程序,本文提供受支持的分发版、驱动程序以及安装和验证步骤。If you choose to install NVIDIA GPU drivers manually, this article provides supported distributions, drivers, and installation and verification steps. 针对 Windows VM 也提供了驱动程序手动安装信息。Manual driver setup information is also available for Windows VMs.

有关 N 系列 VM 规格、存储容量和磁盘详细信息,请参阅 GPU Linux VM 大小。For N-series VM specs, storage capacities, and disk details, see GPU Linux VM sizes.

支持的分发和驱动程序Supported distributions and drivers

NVIDIA CUDA 驱动程序NVIDIA CUDA drivers

仅下表列出的 Linux 分发中支持适用于 NC、NCv2、NCv3、ND 和 NDv2 系列 VM 的 NVIDIA CUDA 驱动程序(对 NV 系列为可选项)。NVIDIA CUDA drivers for NC, NCv2, NCv3, ND, and NDv2-series VMs (optional for NV-series) are supported only on the Linux distributions listed in the following table. 本文发布时,CUDA 驱动程序信息为最新版本。CUDA driver information is current at time of publication. 有关最新的 CUDA 驱动程序和支持的操作系统,请访问 NVIDIA 网站。For the latest CUDA drivers and supported operating systems, visit the NVIDIA website. 确保安装或升级到最新 CUDA 驱动程序分发软件包。Ensure that you install or upgrade to the latest CUDA drivers for your distribution.

提示

作为一种在 Linux VM 上手动安装 CUDA 驱动程序的替代方法,可以部署 Azure 数据科学虚拟机映像。As an alternative to manual CUDA driver installation on a Linux VM, you can deploy an Azure Data Science Virtual Machine image. 用于 Ubuntu 16.04 LTS 或 CentOS 7.4 的 DSVM 版本预安装 NVIDIA CUDA 驱动程序、CUDA 深度神经网络库和其他工具。The DSVM editions for Ubuntu 16.04 LTS or CentOS 7.4 pre-install NVIDIA CUDA drivers, the CUDA Deep Neural Network Library, and other tools.

NVIDIA GRID 驱动程序NVIDIA GRID drivers

Microsoft 为用作虚拟工作站或虚拟应用程序的 NV 和 NVv3 系列 Vm 重新分发 NVIDIA 网格驱动程序安装程序。Microsoft redistributes NVIDIA GRID driver installers for NV and NVv3-series VMs used as virtual workstations or for virtual applications. 请仅在下表所列操作系统上的 Azure NV VM 上安装这些 GRID 驱动程序。Install only these GRID drivers on Azure NV VMs, only on the operating systems listed in the following table. 这些驱动程序包括 Azure 中 GRID Virtual GPU Software 的许可。These drivers include licensing for GRID Virtual GPU Software in Azure. 无需设置 NVIDIA vGPU 软件许可证服务器。You do not need to set up a NVIDIA vGPU software license server.

Azure 重新分发的网格驱动程序不适用于非 NV 系列 Vm,如 NC、NCv2、NCv3、ND 和 NDv2 系列 Vm。The GRID drivers redistributed by Azure do not work on non-NV series VMs like NC, NCv2, NCv3, ND, and NDv2-series VMs.

分发Distribution

驱动程序Driver

Ubuntu 18.04 LTSUbuntu 18.04 LTS

Ubuntu 16.04 LTSUbuntu 16.04 LTS

Red Hat Enterprise Linux 7.7 到7.9、8.0、8。1Red Hat Enterprise Linux 7.7 to 7.9, 8.0, 8.1

SUSE Linux Enterprise Server 12 SP2SUSE Linux Enterprise Server 12 SP2

SUSE Linux Enterprise Server 15 SP2SUSE Linux Enterprise Server 15 SP2

NVIDIA GRID 12.0、driver branch R460 ( .exe)NVIDIA GRID 12.0, driver branch R460(.exe)

请访问 GitHub 获取所有以前的 Nvidia GRID 驱动程序链接的完整列表。Visit GitHub for the complete list of all previous Nvidia GRID driver links.

警告

在 Red Hat 产品上安装第三方软件可能会影响 Red Hat 支持条款。Installation of third-party software on Red Hat products can affect the Red Hat support terms.

在 N 系列 VM 上安装 CUDA 驱动程序Install CUDA drivers on N-series VMs

从 NVIDIA CUDA 工具包在 N 系列 VM 上安装 CUDA 驱动程序的步骤如下。Here are steps to install CUDA drivers from the NVIDIA CUDA Toolkit on N-series VMs.

C 和 C++ 开发人员可以选择安装完整的工具包来生成 GPU 加速应用程序。C and C++ developers can optionally install the full Toolkit to build GPU-accelerated applications. 有关详细信息,请参阅 CUDA 安装指南。For more information, see the CUDA Installation Guide.

要安装 CUDA 驱动程序,请建立到每个 VM 的 SSH 连接。To install CUDA drivers, make an SSH connection to each VM. 若要验证系统是否具有支持 CUDA 的 GPU,请运行以下命令:To verify that the system has a CUDA-capable GPU, run the following command:

lspci | grep -i NVIDIA

会看到类似于以下示例(显示 NVIDIA Tesla K80 卡)的输出:You will see output similar to the following example (showing an NVIDIA Tesla K80 card):

b5a0316c8f4e27a977bee607e6c155c7.png

lspci 列出了 VM 上的 PCIe 设备,包括 InfiniBand NIC 和 GPU(如果有)。lspci lists the PCIe devices on the VM, including the InfiniBand NIC and GPUs, if any. 如果 lspci 没有成功返回,你可能需要在 CentOS/RHEL 上安装 LIS(说明如下)。If lspci doesn't return successfully, you may need to install LIS on CentOS/RHEL (instructions below).

然后,运行特定于分发的安装命令。Then run installation commands specific for your distribution.

UbuntuUbuntu

从 NVIDIA 网站下载并安装 CUDA 驱动程序。Download and install the CUDA drivers from the NVIDIA website.

备注

以下示例显示了 Ubuntu 16.04 的 CUDA 包路径。The example below shows the CUDA package path for Ubuntu 16.04. 替换特定于你计划使用的版本的路径。Replace the path specific to the version you plan to use.

Visit the [Nvidia Download Center] (https://developer.download.nvidia.com/compute/cuda/repos/) for the full path specific to each version.

CUDA_REPO_PKG=cuda-repo-ubuntu1604_10.0.130-1_amd64.deb

wget -O /tmp/${CUDA_REPO_PKG} https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/${CUDA_REPO_PKG}

sudo dpkg -i /tmp/${CUDA_REPO_PKG}

sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/7fa2af80.pub

rm -f /tmp/${CUDA_REPO_PKG}

sudo apt-get update

sudo apt-get install cuda-drivers

安装可能需要几分钟。The installation can take several minutes.

若要安装完整的 CUDA 工具包,请键入:To optionally install the complete CUDA toolkit, type:

sudo apt-get install cuda

重新启动 VM,并继续验证安装。Reboot the VM and proceed to verify the installation.

CUDA 驱动程序更新CUDA driver updates

在部署后,建议定期更新 CUDA 驱动程序。We recommend that you periodically update CUDA drivers after deployment.

sudo apt-get update

sudo apt-get upgrade -y

sudo apt-get dist-upgrade -y

sudo apt-get install cuda-drivers

sudo reboot

CentOS 或 Red Hat Enterprise LinuxCentOS or Red Hat Enterprise Linux

更新内核(建议)。Update the kernel (recommended). 如果选择不更新内核,请确保 kernel-devel 和 dkms 的版本适合你的内核。If you choose not to update the kernel, ensure that the versions of kernel-devel and dkms are appropriate for your kernel.

sudo yum install kernel kernel-tools kernel-headers kernel-devel

sudo reboot

通过验证 lspci 的结果来检查是否需要 LIS。Check if LIS is required by verifying the results of lspci. 如果所有 GPU 设备都按预期列出(并已在上面记录),则不需要安装 .LIS。If all GPU devices are listed as expected (and documented above), installing LIS is not required.

请注意,LIS 适用于 Red Hat Enterprise Linux、CentOS 和 Oracle Linux Red Hat 兼容内核 5.2-5.11、6.0-6.10 和 7.0-7.7。Please note that LIS is applicable to Red Hat Enterprise Linux, CentOS, and the Oracle Linux Red Hat Compatible Kernel 5.2-5.11, 6.0-6.10, and 7.0-7.7. Please refer to the [Linux Integration Services documentation] (https://www.microsoft.com/en-us/download/details.aspx?id=55106) for more details.

如果计划使用 CentOS/RHEL 7.8(或更高版本),请跳过此步骤,因为这些版本不再需要 LIS。Skip this step if you plan to use CentOS/RHEL 7.8 (or higher versions) as LIS is no longer required for these versions.

wget https://aka.ms/lis

tar xvzf lis

cd LISISO

sudo ./install.sh

sudo reboot

重新连接到 VM 并使用以下命令继续安装:Reconnect to the VM and continue installation with the following commands:

sudo rpm -Uvh https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm

sudo yum install dkms

CUDA_REPO_PKG=cuda-repo-rhel7-10.0.130-1.x86_64.rpm

wget https://developer.download.nvidia.com/compute/cuda/repos/rhel7/x86_64/${CUDA_REPO_PKG} -O /tmp/${CUDA_REPO_PKG}

sudo rpm -ivh /tmp/${CUDA_REPO_PKG}

rm -f /tmp/${CUDA_REPO_PKG}

sudo yum install cuda-drivers

安装可能需要几分钟。The installation can take several minutes.

备注

请访问 Fedora 和 Nvidia CUDA 存储库,为要使用的 CentOS 或 RHEL 版本选择正确的包。Visit Fedora and Nvidia CUDA repo to pick the correct package for the CentOS or RHEL version you want to use.

例如,CentOS 8 和 RHEL 8 将需要以下步骤。For example, CentOS 8 and RHEL 8 will need the following steps.

sudo rpm -Uvh https://dl.fedoraproject.org/pub/epel/epel-release-latest-8.noarch.rpm

sudo yum install dkms

CUDA_REPO_PKG=cuda-repo-rhel8-10.2.89-1.x86_64.rpm

wget https://developer.download.nvidia.com/compute/cuda/repos/rhel7/x86_64/${CUDA_REPO_PKG} -O /tmp/${CUDA_REPO_PKG}

sudo rpm -ivh /tmp/${CUDA_REPO_PKG}

rm -f /tmp/${CUDA_REPO_PKG}

sudo yum install cuda-drivers

若要安装完整的 CUDA 工具包,请键入:To optionally install the complete CUDA toolkit, type:

sudo yum install cuda

备注

如果你看到与缺少 vulkan-filesystem 等包有关的错误消息,则可能需要编辑 /etc/yum.repos.d/rh-cloud,寻找 optional-rpms 并将“已启用”设置为“1”If you see an error message related to missing packages like vulkan-filesystem then you may need to edit /etc/yum.repos.d/rh-cloud , look for optional-rpms and set enabled to 1

重新启动 VM,并继续验证安装。Reboot the VM and proceed to verify the installation.

验证驱动程序安装Verify driver installation

要查询 GPU 设备状态,请建立到 VM 的 SSH 连接,并运行与驱动程序一起安装的 nvidia-smi 命令行实用工具。To query the GPU device state, SSH to the VM and run the nvidia-smi command-line utility installed with the driver.

如果安装了驱动程序,将看到如下输出。If the driver is installed, you will see output similar to the following. 请注意,除非当前正在 VM 上运行 GPU 工作负荷,否则 GPU-Util 将显示 0%。Note that GPU-Util shows 0% unless you are currently running a GPU workload on the VM. 驱动程序版本和 GPU 详细信息可能与所示的内容不同。Your driver version and GPU details may be different from the ones shown.

614f25776be5ed750cd01107839f6463.png

RDMA 网络连接RDMA network connectivity

可以在支持 RDMA 的 N 系列 VM(例如 NC24r)上启用 RDMA 网络连接,这些 VM 部署在同一可用性集中或虚拟机 (VM) 规模集的单个放置组中。RDMA network connectivity can be enabled on RDMA-capable N-series VMs such as NC24r deployed in the same availability set or in a single placement group in a virtual machine (VM) scale set. 对于使用 Intel MPI 5.x 或更高版本运行的应用程序,RDMA 网络支持消息传递接口 (MPI) 流量。The RDMA network supports Message Passing Interface (MPI) traffic for applications running with Intel MPI 5.x or a later version. 其他要求如下:Additional requirements follow:

分发Distributions

在 N 系列 VM 上,在支持 RDMA 连接的 Azure 市场中,从以下映像之一部署支持 RDMA 的 N 系列 VM:Deploy RDMA-capable N-series VMs from one of the images in the Azure Marketplace that supports RDMA connectivity on N-series VMs:

Ubuntu 16.04 LTS - 在 VM 上配置 RDMA 驱动程序,并注册 Intel 下载 Intel MPI:Ubuntu 16.04 LTS - Configure RDMA drivers on the VM and register with Intel to download Intel MPI:

安装 dapl、rdmacm、ibverbs 和 mlx4Install dapl, rdmacm, ibverbs, and mlx4

sudo apt-get update

sudo apt-get install libdapl2 libmlx4-1

在 /etc/waagent.conf 中,通过取消注释以下配置行来启用 RDMA。In /etc/waagent.conf, enable RDMA by uncommenting the following configuration lines. 需要根访问权限才能编辑该文件。You need root access to edit this file.

OS.EnableRDMA=y

OS.UpdateRdmaDriver=y

在 /etc/security/limits.conf 文件中,添加或更改 KB 中的以下内存设置。Add or change the following memory settings in KB in the /etc/security/limits.conf file. 需要根访问权限才能编辑该文件。You need root access to edit this file. 出于测试目的,可以将 memlock 设置为不受限制。For testing purposes you can set memlock to unlimited. 例如: hard memlock unlimited。For example: hard memlock unlimited.

hard memlock

soft memlock

安装 Intel MPI 库。Install Intel MPI Library. 从 Intel 购买和下载库或下载免费评估版本。Either purchase and download the library from Intel or download the free evaluation version.

wget http://registrationcenter-download.intel.com/akdlm/irc_nas/tec/9278/l_mpi_p_5.1.3.223.tgz

仅支持 Intel MPI 5.x 运行时。Only Intel MPI 5.x runtimes are supported.

启用非根非调试器进程的 ptrace(为最新版本的 Intel MPI 所需)。Enable ptrace for non-root non-debugger processes (needed for the most recent versions of Intel MPI).

echo 0 | sudo tee /proc/sys/kernel/yama/ptrace_scope

基于 CentOS 的 7.4 HPC - 在 VM 上安装 RDMA 驱动程序和 Intel MPI 5.1。CentOS-based 7.4 HPC - RDMA drivers and Intel MPI 5.1 are installed on the VM.

基于 CentOS 的 HPC - CentOS-HPC 7.6 及更高版本(适用于通过 SR-IOV 支持 InfiniBand 的 SKU)。CentOS-based HPC - CentOS-HPC 7.6 and later (for SKUs where InfiniBand is supported over SR-IOV). 这些映像预安装了 Mellanox OFED 和 MPI 库。These images have Mellanox OFED and MPI libraries pre-installed.

备注

仅 Mellanox OFED 的 LTS 版本支持 CX3-Pro 卡。CX3-Pro cards are supported only through LTS versions of Mellanox OFED. 在带有 ConnectX3-Pro 卡的 N 系列 VM 上使用 LTS Mellanox OFED 版本 (4.9-0.1.7.0)。Use LTS Mellanox OFED version (4.9-0.1.7.0) on the N-series VMs with ConnectX3-Pro cards. 有关详细信息,请参阅 Linux 驱动程序。For more information, see Linux Drivers.

另外,某些最新的 Azure 市场 HPC 映像具有 Mellanox OFED 5.1 及更高版本,这些版本不支持 ConnectX3-Pro 卡。Also, some of the latest Azure Marketplace HPC images have Mellanox OFED 5.1 and later, which don't support ConnectX3-Pro cards. 请先检查 HPC 映像中的 Mellanox OFED 版本,然后再将其用于带有 ConnectX3-Pro 卡的 VM。Check the Mellanox OFED version in the HPC image before using it on VMs with ConnectX3-Pro cards.

以下映像是支持 ConnectX3-Pro 卡的最新 CentOS-HPC 映像:The following images are the latest CentOS-HPC images that support ConnectX3-Pro cards:

OpenLogic:CentOS-HPC:7.6:7.6.2020062900OpenLogic:CentOS-HPC:7.6:7.6.2020062900

OpenLogic:CentOS-HPC:7_6gen2:7.6.2020062901OpenLogic:CentOS-HPC:7_6gen2:7.6.2020062901

OpenLogic:CentOS-HPC:7.7:7.7.2020062600OpenLogic:CentOS-HPC:7.7:7.7.2020062600

OpenLogic:CentOS-HPC:7_7-gen2:7.7.2020062601OpenLogic:CentOS-HPC:7_7-gen2:7.7.2020062601

OpenLogic:CentOS-HPC:8_1:8.1.2020062400OpenLogic:CentOS-HPC:8_1:8.1.2020062400

OpenLogic:CentOS-HPC:8_1-gen2:8.1.2020062401OpenLogic:CentOS-HPC:8_1-gen2:8.1.2020062401

在 NV 或 NVv3 系列 VM 上安装 GRID 驱动程序Install GRID drivers on NV or NVv3-series VMs

若要在 NV 或 NVv3 系列 VM 上安装 NVIDIA GRID 驱动程序,请通过 SSH 连接到每个 VM,并执行 Linux 发行版的步骤。To install NVIDIA GRID drivers on NV or NVv3-series VMs, make an SSH connection to each VM and follow the steps for your Linux distribution.

UbuntuUbuntu

运行 lspci 命令。Run the lspci command. 验证 NVIDIA M60 卡是否显示为 PCI 设备。Verify that the NVIDIA M60 card or cards are visible as PCI devices.

安装更新。Install updates.

sudo apt-get update

sudo apt-get upgrade -y

sudo apt-get dist-upgrade -y

sudo apt-get install build-essential ubuntu-desktop -y

sudo apt-get install linux-azure -y

禁用 Nouveau 内核驱动程序,该驱动程序与 NVIDIA 驱动程序不兼容。Disable the Nouveau kernel driver, which is incompatible with the NVIDIA driver. (只能在 NV 或 NVv2 VM 上使用 NVIDIA 驱动程序。)若要执行此操作,请在 /etc/modprobe.d 中创建一个名为 nouveau.conf 的文件,其中包含以下内容:(Only use the NVIDIA driver on NV or NVv2 VMs.) To do this, create a file in /etc/modprobe.d named nouveau.conf with the following contents:

blacklist nouveau

blacklist lbm-nouveau

重新启动 VM,并重新连接。Reboot the VM and reconnect. 退出 X 服务器:Exit X server:

sudo systemctl stop lightdm.service

下载并安装 GRID 驱动程序:Download and install the GRID driver:

wget -O NVIDIA-Linux-x86_64-grid.run https://go.microsoft.com/fwlink/?linkid=874272

chmod +x NVIDIA-Linux-x86_64-grid.run

sudo ./NVIDIA-Linux-x86_64-grid.run

当系统询问你是否要运行 nvidia-xconfig 实用程序以更新 X 配置文件时,请选择“是”。When you're asked whether you want to run the nvidia-xconfig utility to update your X configuration file, select Yes.

完成安装后,将 /etc/nvidia/gridd.conf.template 复制到位于 /etc/nvidia/ 的新文件 gridd.confAfter installation completes, copy /etc/nvidia/gridd.conf.template to a new file gridd.conf at location /etc/nvidia/

sudo cp /etc/nvidia/gridd.conf.template /etc/nvidia/gridd.conf

将下列内容添加到 /etc/nvidia/gridd.conf:Add the following to /etc/nvidia/gridd.conf:

IgnoreSP=FALSE

EnableUI=FALSE

将以下内容从 /etc/nvidia/gridd.conf 中删除(如果其存在):Remove the following from /etc/nvidia/gridd.conf if it is present:

FeatureType=0

重新启动 VM,并继续验证安装。Reboot the VM and proceed to verify the installation.

CentOS 或 Red Hat Enterprise LinuxCentOS or Red Hat Enterprise Linux

更新内核和 DKMS(建议)。Update the kernel and DKMS (recommended). 如果选择不更新内核,请确保 kernel-devel 和 dkms 的版本适合你的内核。If you choose not to update the kernel, ensure that the versions of kernel-devel and dkms are appropriate for your kernel.

sudo yum update

sudo yum install kernel-devel

sudo rpm -Uvh https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm

sudo yum install dkms

sudo yum install hyperv-daemons

禁用 Nouveau 内核驱动程序,该驱动程序与 NVIDIA 驱动程序不兼容。Disable the Nouveau kernel driver, which is incompatible with the NVIDIA driver. (只能在 NV 或 NV3 VM 上使用 NVIDIA 驱动程序。)若要执行此操作,请在 /etc/modprobe.d 中创建一个名为 nouveau.conf 的文件,其中包含以下内容:(Only use the NVIDIA driver on NV or NV3 VMs.) To do this, create a file in /etc/modprobe.d named nouveau.conf with the following contents:

blacklist nouveau

blacklist lbm-nouveau

Reboot the VM, reconnect, and install the latest Linux Integration Services for Hyper-V and Azure. 通过验证 lspci 的结果来检查是否需要 LIS。Check if LIS is required by verifying the results of lspci. 如果所有 GPU 设备都按预期列出(并已在上面记录),则不需要安装 .LIS。If all GPU devices are listed as expected (and documented above), installing LIS is not required.

如果计划使用 CentOS/RHEL 7.8(或更高版本),请跳过此步骤,因为这些版本不再需要 LIS。Skip this step if you plan to use CentOS/RHEL 7.8 (or higher versions) as LIS is no longer required for these versions.

wget https://aka.ms/lis

tar xvzf lis

cd LISISO

sudo ./install.sh

sudo reboot

重新连接到 VM 并运行 lspci 命令。Reconnect to the VM and run the lspci command. 验证 NVIDIA M60 卡是否显示为 PCI 设备。Verify that the NVIDIA M60 card or cards are visible as PCI devices.

下载并安装 GRID 驱动程序:Download and install the GRID driver:

wget -O NVIDIA-Linux-x86_64-grid.run https://go.microsoft.com/fwlink/?linkid=874272

chmod +x NVIDIA-Linux-x86_64-grid.run

sudo ./NVIDIA-Linux-x86_64-grid.run

当系统询问你是否要运行 nvidia-xconfig 实用程序以更新 X 配置文件时,请选择“是”。When you're asked whether you want to run the nvidia-xconfig utility to update your X configuration file, select Yes.

完成安装后,将 /etc/nvidia/gridd.conf.template 复制到位于 /etc/nvidia/ 的新文件 gridd.confAfter installation completes, copy /etc/nvidia/gridd.conf.template to a new file gridd.conf at location /etc/nvidia/

sudo cp /etc/nvidia/gridd.conf.template /etc/nvidia/gridd.conf

将下列内容添加到 /etc/nvidia/gridd.conf:Add the following to /etc/nvidia/gridd.conf:

IgnoreSP=FALSE

EnableUI=FALSE

将以下内容从 /etc/nvidia/gridd.conf 中删除(如果其存在):Remove the following from /etc/nvidia/gridd.conf if it is present:

FeatureType=0

重新启动 VM,并继续验证安装。Reboot the VM and proceed to verify the installation.

验证驱动程序安装Verify driver installation

要查询 GPU 设备状态,请建立到 VM 的 SSH 连接,并运行与驱动程序一起安装的 nvidia-smi 命令行实用工具。To query the GPU device state, SSH to the VM and run the nvidia-smi command-line utility installed with the driver.

如果安装了驱动程序,将看到如下输出。If the driver is installed, you will see output similar to the following. 请注意,除非当前正在 VM 上运行 GPU 工作负荷,否则 GPU-Util 将显示 0%。Note that GPU-Util shows 0% unless you are currently running a GPU workload on the VM. 驱动程序版本和 GPU 详细信息可能与所示的内容不同。Your driver version and GPU details may be different from the ones shown.

d787ceadaf2b50571b8df60a28a92eab.png

X11 服务器X11 server

如果需要使用 X11 服务器远程连接到 NV 或 NVv2 VM,建议使用 x11vnc,因为它允许硬件图形加速。If you need an X11 server for remote connections to an NV or NVv2 VM, x11vnc is recommended because it allows hardware acceleration of graphics. 必须手动将 M60 设备的 BusID 添加到 X11 配置文件(通常为 etc/X11/xorg.conf)中。The BusID of the M60 device must be manually added to the X11 configuration file (usually, etc/X11/xorg.conf). 添加 "Device" 节,如下所示:Add a "Device" section similar to the following:

Section "Device"

Identifier "Device0"

Driver "nvidia"

VendorName "NVIDIA Corporation"

BoardName "Tesla M60"

BusID "PCI:0@your-BusID:0:0"

EndSection

此外,更新 "Screen" 节以使用此设备。Additionally, update your "Screen" section to use this device.

通过运行以下命令可找到十进制 BusIDThe decimal BusID can be found by running

nvidia-xconfig --query-gpu-info | awk '/PCI BusID/{print $4}'

重新分配或重新启动 VM 后,BusID 可能会更改。The BusID can change when a VM gets reallocated or rebooted. 因此,重新启动 VM 后,可能需要创建脚本来更新 X11 配置中的 BusID。Therefore, you may want to create a script to update the BusID in the X11 configuration when a VM is rebooted. 例如,创建名为 busidupdate.sh(或所选的其他名称)的脚本,其内容如下所示:For example, create a script named busidupdate.sh (or another name you choose) with contents similar to the following:

#!/bin/bash

XCONFIG="/etc/X11/xorg.conf"

OLDBUSID=`awk '/BusID/{gsub(/"/, "", $2); print $2}' ${XCONFIG}`

NEWBUSID=`nvidia-xconfig --query-gpu-info | awk '/PCI BusID/{print $4}'`

if [[ "${OLDBUSID}" == "${NEWBUSID}" ]] ; then

echo "NVIDIA BUSID not changed - nothing to do"

else

echo "NVIDIA BUSID changed from \"${OLDBUSID}\" to \"${NEWBUSID}\": Updating ${XCONFIG}"

sed -e 's|BusID.*|BusID '\"${NEWBUSID}\"'|' -i ${XCONFIG}

fi

然后,在 /etc/rc.d/rc3.d 中为更新脚本创建一个条目,以便在启动时以 root 身份调用该脚本。Then, create an entry for your update script in /etc/rc.d/rc3.d so the script is invoked as root on boot.

疑难解答Troubleshooting

可以使用 nvidia-smi 设置持久性模式,以便在需要查询卡时该命令的输出更快。You can set persistence mode using nvidia-smi so the output of the command is faster when you need to query cards. 若要设置持久性模式,请执行 nvidia-smi -pm 1。To set persistence mode, execute nvidia-smi -pm 1. 请注意,如果重启 VM,此模式设置将消失。Note that if the VM is restarted, the mode setting goes away. 你可以始终将该模式设置编写为在启动时执行。You can always script the mode setting to execute upon startup.

如果已将 NVIDIA CUDA 驱动程序更新到最新版本,并且发现 RDMA 连接不再工作,请重新安装 RDMA 驱动程序以重新建立该连接。If you updated the NVIDIA CUDA drivers to the latest version and find RDMA connectivity is no longer working, reinstall the RDMA drivers to reestablish that connectivity.

安装 LIS 期间,如果 LIS 不支持特定的 CentOS/RHEL OS 版本(或内核),则会引发“内核版本不受支持”错误。During installation of LIS, if a certain CentOS/RHEL OS version (or kernel) is not supported for LIS, an error “Unsupported kernel version” is thrown. 请报告此错误以及 OS 和内核版本。Please report this error along with the OS and kernel versions.

后续步骤Next steps

若要捕获安装了 NVIDIA 驱动程序的 Linux VM 映像,请参阅如何通用化和捕获 Linux 虚拟机。To capture a Linux VM image with your installed NVIDIA drivers, see How to generalize and capture a Linux virtual machine.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值