背景
bar1 Base Address Register 1 用于内存映射的寄存器,定义了设备的内存映射区域,BAR1专门分配给gpu的一部分内存区域,允许cpu通过pcie总线直接访问显存VRAM中的数据。但bar1的大小是有限的,在常规的4090上,bar1只有256M,基于nvidia开源的open-gpu-kernel-modules模块通过将bar1的寄存器地址增大至32G来提高计算效率
系统版本
root@exai-165:~# cat /etc/os-release
PRETTY_NAME="Ubuntu 22.04.4 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.4 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy
root@exai-165:~# uname -a
Linux exai-165 6.5.0-44-generic #44~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Tue Jun 18 14:36:16 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
实施
- 编译开源的nvidia驱动模块
- 编译p2p模块
破解前bar1大小
root@exai-165:/opt# lspci -s 0000:81:00.0 -v
81:00.0 VGA compatible controller: NVIDIA Corporation Device 2684 (rev a1) (prog-if 00 [VGA controller])
Subsystem: NVIDIA Corporation Device 167c
Flags: bus master, fast devsel, latency 0, IRQ 164, IOMMU group 27
Memory at b8000000 (32-bit, non-prefetchable) [size=16M]
Memory at 20030000000 (64-bit, prefetchable) [size=256M] # 这里
Memory at 20040000000 (64-bit, prefetchable) [size=32M]
I/O ports at 6000 [size=128]
Expansion ROM at b9000000 [virtual] [disabled] [size=512K]
Capabilities: [60] Power Management version 3
Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
Capabilities: [78] Express Legacy Endpoint, MSI 00
Capabilities: [b4] Vendor Specific Information: Len=14 <?>
Capabilities: [100] Virtual Channel
Capabilities: [250] Latency Tolerance Reporting
Capabilities: [258] L1 PM Substates
Capabilities: [128] Power Budgeting <?>
Capabilities: [420] Advanced Error Reporting
Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
Capabilities: [900] Secondary PCI Express
Capabilities: [bb0] Physical Resizable BAR
Capabilities: [c1c] Physical Layer 16.0 GT/s <?>
Capabilities: [d00] Lane Margining at the Receiver <?>
Capabilities: [e00] Data Link Feature <?>
Kernel driver in use: nvidia
Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia
nvidia驱动模块
卸载机器上原本的驱动
./NVIDIA-Linux-x86_64-535.183.01.run --uninstall
克隆开源的驱动
自行配置git使用代理
git clone --branch 550.54.15 --single-branch https://github.com/NVIDIA/open-gpu-kernel-modules.git
git branch
git checkout -b 550.54.15
因为机器上的CC和编译内核使用的gcc不是同一个版本,所以这里手工指定make使用哪个gcc
make CC=x86_64-linux-gnu-gcc-12 modules -j$(nproc)
make modules_install CC=x86_64-linux-gnu-gcc-12 modules -j$(nproc)
备注:通过机器上的多版本管理工具来实现cc版本管理不生效
验证
root@exai-165:~# cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX Open Kernel Module for x86_64 550.54.15 Release Build (root@exai-165) 2024年 09月 06日 星期五 10:49:38 CST
GCC version: gcc version 12.3.0 (Ubuntu 12.3.0-1ubuntu1~22.04)
p2p
https://github.com/tinygrad/open-gpu-kernel-modules
克隆,编译,按照readme里面的来没啥问题
root@exai-165:/opt/nvidia-p2p/open-gpu-kernel-modules# ./install.sh
make -C src/nvidia
make -C src/nvidia-modeset
make[1]: Entering directory '/opt/nvidia-p2p/open-gpu-kernel-modules/src/nvidia'
make[1]: Entering directory '/opt/nvidia-p2p/open-gpu-kernel-modules/src/nvidia-modeset'
make[1]: Nothing to be done for 'default'.
make[1]: Leaving directory '/opt/nvidia-p2p/open-gpu-kernel-modules/src/nvidia-modeset'
cd kernel-open/nvidia-modeset/ && ln -sf ../../src/nvidia-modeset/_out/Linux_x86_64/nv-modeset-kernel.o nv-modeset-kernel.o_binary
make[1]: Nothing to be done for 'default'.
make[1]: Leaving directory '/opt/nvidia-p2p/open-gpu-kernel-modules/src/nvidia'
cd kernel-open/nvidia/ && ln -sf ../../src/nvidia/_out/Linux_x86_64/nv-kernel.o nv-kernel.o_binary
make -C kernel-open modules
make[1]: Entering directory '/opt/nvidia-p2p/open-gpu-kernel-modules/kernel-open'
make[2]: Entering directory '/usr/src/linux-headers-6.5.0-44-generic'
warning: the compiler differs from the one used to build the kernel
The kernel was built by: x86_64-linux-gnu-gcc-12 (Ubuntu 12.3.0-1ubuntu1~22.04) 12.3.0
You are using: cc (Ubuntu 12.3.0-1ubuntu1~22.04) 12.3.0
make[2]: Leaving directory '/usr/src/linux-headers-6.5.0-44-generic'
make[1]: Leaving directory '/opt/nvidia-p2p/open-gpu-kernel-modules/kernel-open'
make -C kernel-open modules_install
make[1]: Entering directory '/opt/nvidia-p2p/open-gpu-kernel-modules/kernel-open'
make[2]: Entering directory '/usr/src/linux-headers-6.5.0-44-generic'
INSTALL /lib/modules/6.5.0-44-generic/kernel/drivers/video/nvidia.ko
INSTALL /lib/modules/6.5.0-44-generic/kernel/drivers/video/nvidia-uvm.ko
INSTALL /lib/modules/6.5.0-44-generic/kernel/drivers/video/nvidia-modeset.ko
INSTALL /lib/modules/6.5.0-44-generic/kernel/drivers/video/nvidia-drm.ko
INSTALL /lib/modules/6.5.0-44-generic/kernel/drivers/video/nvidia-peermem.ko
SIGN /lib/modules/6.5.0-44-generic/kernel/drivers/video/nvidia-peermem.ko
SIGN /lib/modules/6.5.0-44-generic/kernel/drivers/video/nvidia-modeset.ko
SIGN /lib/modules/6.5.0-44-generic/kernel/drivers/video/nvidia-drm.ko
SIGN /lib/modules/6.5.0-44-generic/kernel/drivers/video/nvidia.ko
SIGN /lib/modules/6.5.0-44-generic/kernel/drivers/video/nvidia-uvm.ko
DEPMOD /lib/modules/6.5.0-44-generic
Warning: modules_install: missing 'System.map' file. Skipping depmod.
make[2]: Leaving directory '/usr/src/linux-headers-6.5.0-44-generic'
make[1]: Leaving directory '/opt/nvidia-p2p/open-gpu-kernel-modules/kernel-open'
Fri Sep 6 15:24:49 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15 Driver Version: 550.54.15 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 4090 Off | 00000000:01:00.0 Off | Off |
| 30% 36C P0 53W / 450W | 0MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA GeForce RTX 4090 Off | 00000000:81:00.0 Off | Off |
| 31% 44C P0 69W / 450W |<