问题: 当尝试使用“/opt/app/singularity/bin/singularity exec --nv xxx.sif nvidia-smi”命令时,会显示:
NVIDIA-SMI couldn't find libnvidia-ml.so library in your system. Please make sure that the NVIDIA Display Driver is properly installed and present in your system.
Please also try adding directory that contains libnvidia-ml.so to your system PATH.
解决方法:
1、尝试使用“sudo find / -name libnvidia-ml.so”寻找这一文件,系统返回如下:
/opt/app/cuda/12.2/targets/x86_64-linux/lib/stubs/libnvidia-ml.so
/opt/app/cuda/11.2/targets/x86_64-linux/lib/stubs/libnvidia-ml.so
/opt/app/cuda/11.7/targets/x86_64-linux/lib/stubs/libnvidia-ml.so
/opt/app/cuda/11.8/targets/x86_64-linux/lib/stubs/libnvidia-ml.so
/opt/app/nvidia/525.60.13/lib/libnvidia-ml.so
/opt/app/nvidia/525.60.13/lib32/libnvidia-ml.so
/opt/app/nvidia/535.154.05/lib/libnvidia-ml.so
/opt/app/nvidia/535.154.05/lib32/libnvidia-ml.so
/opt/app/nvidia/460.91.03/lib/libnvidia-ml.so
/opt/app/nvidia/460.91.03/lib32/libnvidia-ml.so
/opt/app/nvidia/450.216.04/lib/libnvidia-ml.so
2、可以根据主机nvidia-smi显示的版本选择如下之一:
/opt/app/nvidia/525.60.13/lib/libnvidia-ml.so
/opt/app/nvidia/525.60.13/lib32/libnvidia-ml.so
/opt/app/nvidia/535.154.05/lib/libnvidia-ml.so
/opt/app/nvidia/535.154.05/lib32/libnvidia-ml.so
/opt/app/nvidia/460.91.03/lib/libnvidia-ml.so
/opt/app/nvidia/460.91.03/lib32/libnvidia-ml.so
/opt/app/nvidia/450.216.04/lib/libnvidia-ml.so
以下均不可选:
/opt/app/cuda/12.2/targets/x86_64-linux/lib/stubs/libnvidia-ml.so
/opt/app/cuda/11.2/targets/x86_64-linux/lib/stubs/libnvidia-ml.so
/opt/app/cuda/11.7/targets/x86_64-linux/lib/stubs/libnvidia-ml.so
/opt/app/cuda/11.8/targets/x86_64-linux/lib/stubs/libnvidia-ml.so
原因:
WARNING:
You should always run with libnvidia-ml.so that is installed with your
NVIDIA Display Driver. By default it's installed in /usr/lib and /usr/lib64.
libnvidia-ml.so in GDK package is a stub library that is attached only for
build purposes (e.g. machine that you build your application doesn't have
to have Display Driver installed).
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
3、然后在主机执行:/opt/app/singularity/bin/singularity shell --nv -B /opt/app/nvidia/535.154.05/lib/ xxx.sif
4、进入容器执行:export LD_LIBRARY_PATH=/opt/app/nvidia/535.154.05/lib/:$LD_LIBRARY_PATH
5、此时输入nvidia-smi
就能返回正常的显示了