目录
2、切换到物理GPU对应的mdev_supported_types目录
3、mdev_supported_types子目录遍历待创建的vGPU类型
2、注册 NVIDIA 企业账号,申请 license(或直接购买)
六、vGPU Guest虚机连接License Server
八、实际应用问题2 -- 创建 windows Guest 虚机黑屏问题
1、vGPU License Server注入license文件失败
一、vGPU产品类型
1、NVIDIA vGPU产品简介
NVIDIA vGPU,通过分片虚拟化技术,将物理GPU卡重新划分,同一块GPU卡经虚拟化分割后可分配至不同的云主机使用,实现了GPU计算能力的多虚机共享。
vGPU根据不同的场景,提供四种类型产品。每种类型vGPU运行时需要相应的软件授权(License),对操作系统的要求也有所区别。
(1)NVIDIA GRID Virtual Applications (GRID vApps)
适用于虚拟应用程序/Windows远程桌面(RDSH)/共享桌面
(2)NVIDIA GRID Virtual PC (GRID vPC)
适用于虚拟桌面应用,可运行所有的PC应用程序
(3)NVIDIA Quadro Virtual Data Center Workstation (Quadro vDWS)
适用于专业图形图像,以及AI/DL/HPC等计算场景
(4)NVIDIA Virtual Compute Server (vCS)
适用于人工智能(AI)/深度学习(DL)/高性能计算(HPC)等计算密集型场景,仅支持Linux系统
Series |
Optimal Workload |
Q-series |
Virtual workstations for creative and technical professionals who require the performance and features of Quadro technology |
C-series |
Compute-intensive server workloads, such as artificial intelligence (AI), deep learning, or high-performance computing (HPC) |
B-series |
Virtual desktops for business professionals and knowledge workers |
A-series |
App streaming or session-based solutions for virtual applications users |
2、如何选择合适的vGPU
二、基于KVM创建vGPU设备
此处以创建一个M10-2Q类型的vGPU设备为例。
1、物理机安装vGPU驱动
GPU物理节点的Linux驱动,在安装过程中需要编译kernel module,安装vGPU-kvm驱动之后,需要reboot重启计算节点。
重启之前,没有mdev设备相关信息(对应pci设备目录下没有mdev_supported_types目录);
重启之后,可看到mdev设备相关信息(对应pci设备目录下可以看到mdev_supported_types目录)。
[root@localhost ~]# chmod +x NVIDIA-Linux-x86_64-430.67-vgpu-kvm.run
[root@localhost ~]# ./NVIDIA-Linux-x86_64-430.67-vgpu-kvm.run -s
# 配置驱动
[root@localhost ~]# nvidia-smi -e 0
[root@localhost ~]# nvidia-smi -pm 1
# 重启节点,让驱动生效
[root@localhost ~]# reboot
[root@localhost vgpu]# ./NVIDIA-Linux-x86_64-430.67-vgpu-kvm.run -s
Verifying archive integrity... OK
Uncompressing NVIDIA Accelerated Graphics Driver for Linux-x86_64 430.67.....................................................................................................................................................
......................................................
[root@localhost vgpu]#
[root@localhost vgpu]# nvidia-smi -e 0
ECC support is already Disabled for GPU 00000000:3D:00.0.
ECC support is already Disabled for GPU 00000000:3E:00.0.
ECC support is already Disabled for GPU 00000000:40:00.0.
ECC support is already Disabled for GPU 00000000:41:00.0.
ECC support is already Disabled for GPU 00000000:B1:00.0.
ECC support is already Disabled for GPU 00000000:B2:00.0.
ECC support is already Disabled for GPU 00000000:B4:00.0.
ECC support is already Disabled for GPU 00000000:B5:00.0.
All done.
[root@localhost vgpu]# nvidia-smi -pm 1
Enabled persistence mode for GPU 00000000:3D:00.0.
Enabled persistence mode for GPU 00000000:3E:00.0.
Enabled persistence mode for GPU 00000000:40:00.0.
Enabled persistence mode for GPU 00000000:41:00.0.
Enabled persistence mode for GPU 00000000:B1:00.0.
Enabled persistence mode for GPU 00000000:B2:00.0.
Enabled persistence mode for GPU 00000000:B4:00.0.
Enabled persistence mode for GPU 00000000:B5:00.0.
All done.
[root@localhost vgpu]#
[root@localhost ]# lspci -nn| grep 3D
3d:00.0 3D controller [0302]: NVIDIA Corporation Device [10de:1eb8] (rev a1)
3e:00.0 3D controller [0302]: NVIDIA Corporation Device [10de:1eb8] (rev a1)
40:00.0 3D controller [0302]: NVIDIA Corporation Device [10de:1eb8] (rev a1)
41:00.0 3D controller [0302]: NVIDIA Corporation Device [10de:1eb8] (rev a1)
b1:00.0 3D controller [0302]: NVIDIA Corporation Device [10de:1eb8] (rev a1)
b2:00.0 3D controller [0302]: NVIDIA Corporation Device [10de:1eb8] (rev a1)
b4:00.0 3D controller [0302]: NVIDIA Corporation Device [10de:1eb8] (rev a1)
b5:00.0 3D controller [0302]: NVIDIA Corporation Device [10de:1eb8] (rev a1)
[root@localhost 0000:3d:00.0]#
[root@localhost 0000:3d:00.0]# cd /sys/bus/pci/devices/0000\:3d\:00.0/
[root@localhost 0000:3d:00.0]# ls
broken_parity_status consistent_dma_mask_bits dma_mask_bits enable iommu local_cpulist msi_bus power reset resource1 resource3_wc subsystem uevent
class d3cold_allowed driver i2c-4 iommu_group local_cpus msi_irqs remove resource resource1_wc sriov_numvfs subsystem_device vendor
config device driver_override i2c-5 irq modalias numa_node rescan resource0 resource3 sriov_totalvfs subsystem_vendor
[root@localhost 0000:3d:00.0]#
[root@localhost 0000:3d:00.0]# reboot
[root@localhost ~]# cd /sys/bus/pci/devices/0000\:3d\:00.0/
[root@localhost 0000:3d:00.0]#
[root@localhost 0000:3d:00.0]# ls
broken_parity_status consistent_dma_mask_bits dma_mask_bits enable iommu local_cpulist modalias numa_node rescan resource0 resource3 sriov_totalvfs subsystem_vendor
class d3cold_allowed driver i2c-4 iommu_group local_cpus msi_bus power reset resource1 resource3_wc subsystem uevent
config device driver_override i2c-5 irq mdev_supported_types msi_irqs remove resource resource1_wc sriov_numvfs subsystem_device vendor
[root@localhost 0000:3d:00.0]#
2、切换到物理GPU对应的mdev_supported_types目录
其中,0000:06:00.0表示domain:bus:slot.function
[root@localhost ~]# cd /sys/bus/pci/devices/0000\:3d\:00.0/mdev_supported_types/
3、mdev_supported_types子目录遍历可创建的vGPU类型
查看支持的所有vGPU类型
[root@localhost mdev_supported_types]# grep -l "T4-8Q" nvidia-*/name
nvidia-233/name
# 或直接自定义shell脚本遍历,获取所有支持的vGPU类型
[root@localhost mdev_supported_types]# for i in `ls`;do cat $i/name |awk '{print $2}'; done;
4、确认可在物理GPU上创建vGPU类型实例的个数
注意:此处available_instances如果返回0,表示该物理GPU上已经存在另一种vGPU类型实例,或者已经创建了允许的最大实例数。
[root@localhost mdev_supported_types]# cat nvidia-233/available_instances
2
5、生成随机uuid写入create文件
使用uuidgen生成随机uuid,写入需要创建的vGPU类型目录下的create文件中。
[root@localhost mdev_supported_types]# echo "aa618089-8b16-4d01-a136-25a0f3c73123" > nvidia-233/create
6、确认vGPU设备已创建
[root@localhost ~]# ls -l /sys/bus/mdev/devices/
lrwxrwxrwx. 1 root root 0 Nov 24 13:33 aa618089-8b16-4d01-a136-25a0f3c73123 -> ../../../devices/pci0000:00/0000:00:03.0/0000:03:00.0/0000:04:09.0/0000:06:00.0/aa618089-8b16-4d01-a136-25a0f3c73123
至此,vGPU设备创建完成。
可使用shell脚本循环创建mdev设备,或其它组件调用方式创建mdev设备(例如openstack Queens版nova组件初始化时,会自动创建mdev设备)
三、查看节点生成的 mdev 设备列表
1、create文件注入uuid之前
说明:向指定vGPU类型目录下create文件注入uuid之前,vGPU类型对应目录下available_instances数