目录
背景:
- 在将AutoDL上的环境迁移到学校slurm集群时,会有CUDA报错
- 之前被AutoDL照顾的太好了,AutoDL都是预装CUDA,学校的服务器环境是没有CUDA的
本帖可解决的问题1:
- 安装mmdet3d需要显卡+CUDA,但是只有cudatoolkit没有CUDA,安装报错:
(sparseocc) schen744@gpu3-11:~/code/sparseocc/mmdetection3d$ pip install -v -e . Using pip 22.3.1 from /hpc2hdd/home/schen744/.conda/envs/sparseocc/lib/python3.7/site-packages/pip (python 3.7) Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple Obtaining file:///hpc2hdd/home/schen744/code/sparseocc/mmdetection3d Running command python setup.py egg_info Traceback (most recent call last): File "/hpc2hdd/home/schen744/.conda/envs/sparseocc/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 2035, in _join_cuda_home raise EnvironmentError('CUDA_HOME environment variable is not set. ' OSError: CUDA_HOME environment variable is not set. Please set it to your CUDA install root. error: subprocess-exited-with-error × python setup.py egg_info did not run successfully. │ exit code: 1 ╰─> See above for output. note: This error originates from a subprocess, and is likely not a problem with pip. full command: /hpc2hdd/home/schen744/.conda/envs/sparseocc/bin/python -c ' exec(compile('"'"''"'"''"'"' # This is <pip-setuptools-caller> -- a caller that pip uses to run setup.py # # - It imports setuptools before invoking setup.py, to enable projects that directly # import from `distutils.core` to work with newer packaging standards. # - It provides a clear error message when setuptools is not installed. # - It sets `sys.argv[0]` to the underlying `setup.py`, when invoking `setup.py` so # setuptools doesn'"'"'t think the script is `-c`. This avoids the following warning: # manifest_maker: standard file '"'"'-c'"'"' not found". # - It generates a shim setup.py, for handling setup.cfg-only projects. import os, sys, tokenize try: import setuptools except ImportError as error: print( "ERROR: Can not execute `setup.py` since setuptools is not available in " "the build environment.", file=sys.stderr, ) sys.exit(1) __file__ = %r sys.argv[0] = __file__ if os.path.exists(__file__): filename = __file__ with tokenize.open(__file__) as f: setup_py_code = f.read() else: filename = "<auto-generated setuptools caller>" setup_py_code = "from setuptools import setup; setup()" exec(compile(setup_py_code, filename, "exec")) '"'"''"'"''"'"' % ('"'"'/hpc2hdd/home/schen744/code/sparseocc/mmdetection3d/setup.py'"'"',), "<pip-setuptools-caller>", "exec"))' egg_info --egg-base /tmp/pip-pip-egg-info-xinq3w4l cwd: /hpc2hdd/home/schen744/code/sparseocc/mmdetection3d/ Preparing metadata (setup.py) ... error error: metadata-generation-failed × Encountered error while generating package metadata. ╰─> See above for output. note: This is an issue with the package mentioned above, not pip. hint: See above for details. (sparseocc) schen744@gpu3-11:~/code/sparseocc/mmdetection3d$ nvcc -V Command 'nvcc' not found, but can be installed with: apt install nvidia-cuda-toolkit Please ask your administrator. (sparseocc) schen744@gpu3-11:~/code/sparseocc/mmdetection3d$
在此之后,我重装了环境,还是会有问题
本帖可解决的问题2:
(sparseocc) schen744@gpu3-9:~/code/sparseocc$ nvidia-smi Sun Jun 1 17:11:56 2025 +---------------------------------------------------------------------------------------+ | NVIDIA-SMI 535.54.03 Driver Version: 535.54.03 CUDA Version: 12.2 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 NVIDIA A40 Off | 00000000:35:00.0 Off | 0 | | 0% 29C P8 33W / 300W | 11MiB / 46068MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ +---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | No running processes found | +---------------------------------------------------------------------------------------+ (sparseocc) schen744@gpu3-9:~/code/sparseocc$ conda list cudatoolkit # packages in environment at /hpc2hdd/home/schen744/.conda/envs/sparseocc: # # Name Version Build Channel cudatoolkit 11.3.1 hb98b00a_13 conda-forge (sparseocc) schen744@gpu3-9:~/code/sparseocc$ nvcc --version Command 'nvcc' not found, but can be installed with: apt install nvidia-cuda-toolkit Please ask your administrator. (sparseocc) schen744@gpu3-9:~/code/sparseocc$
原因分析:
nvcc
(CUDA 编译器)是 CUDA 工具包的核心组件,我们当前的环境中未安装完整的 CUDA 工具包。虽然通过conda list
看到了cudatoolkit=11.3.1
,但 Conda 的cudatoolkit
通常仅包含运行时库(如libcudart.so
),不包含编译器nvcc
及开发工具
nvidia-smi
显示我们的显卡驱动支持 CUDA 12.2(CUDA Version: 12.2
),而 Conda 安装的cudatoolkit=11.3.1
是兼容的(NVIDIA 驱动支持向下兼容旧版本 CUDA 工具包),因此版本冲突不是当前问题的主因
步骤 1:手动下载 NVIDIA 官方 CUDA 工具包
如果没有管理员权限,可从 NVIDIA CUDA Toolkit 存档 下载与你驱动兼容的 CUDA 版本(如 11.3 或 12.2)。以 CUDA 11.3 为例:
- 访问 CUDA Toolkit 11.3 下载页,或者最新的(CUDA Toolkit 12.9 Downloads | NVIDIA Developer)选择对应系统(如 Linux → x86_64 → Ubuntu → 20.04 → runfile)
- 按页面提示下载安装包并运行:
wget https://developer.download.nvidia.com/compute/cuda/11.3.0/local_installers/cuda_11.3.0_465.19.01_linux.run
检查:
md5sum cuda_11.3.0_465.19.01_linux.run
输出:
(sparseocc) schen744@gpu3-9:~/code/test$ md5sum cuda_11.3.0_465.19.01_linux.run
406cecd830bb369fa4d3bd6f50a39a7a cuda_11.3.0_465.19.01_linux.run
(sparseocc) schen744@gpu3-9:~/code/test$
和官网比对一下:(developer.download.nvidia.cn/compute/cuda/11.3.0/docs/sidebar/md5sum.txt)
是一样的,没有问题
安装 CUDA 无需 sudo (管理员权限)
这里很可能会出现没有sudo(管理员)权限的情况,因此可参考以下方案:
“sudo sh cuda_11.3.0_465.19.01_linux.run”这个官方的安装命令很可能因没管理员权限会报错:
在没有管理员权限的情况下,可以通过自定义路径安装 CUDA 工具包来解决 nvcc
缺失的问题。以下是具体步骤:
核心思路
CUDA 的 runfile
安装包支持非系统路径安装(即安装到用户可写的目录,如 ~/cuda
),无需 sudo
权限。安装时仅选择安装工具包(Toolkit),不安装驱动(Driver)和系统级组件(如 cuda
软链接到 /usr/local
)
步骤 2:以非管理员权限安装 CUDA 工具包
我们已下载了 cuda_11.3.0_465.19.01_linux.run
安装包,直接运行并选择自定义路径即可
2.1 运行安装包并跳过驱动安装
执行以下命令启动安装(无需 sudo
):
sh cuda_11.3.0_465.19.01_linux.run \
--toolkit \
--toolkitpath=/hpc2hdd/home/schen744/code/test/cuda-11.3 \
--silent
参数说明:
这些参数来自于nvidia的安装包本身,具体解释可通过以下
sh cuda_11.3.0_465.19.01_linux.run -help
命令调出:
--toolkit
:仅安装 CUDA 工具包(含nvcc
)--toolkitpath=/hpc2hdd/home/schen744/code/test/cuda-11.3
:将工具包安装到/hpc2hdd/home/schen744/code/test/
目录下的cuda-11.3
目录(可自定义路径,需确保有写入权限)(路径是最容易出错的地方,非常推荐使用绝对路径)(通过pwd命令获得当前位置的绝对路径)--silent
:静默安装(避免交互式选择)
2.2 验证安装结果
安装完成后,检查 ~/cuda-11.3
目录是否存在 bin/nvcc
文件:
ls /hpc2hdd/home/schen744/code/test/cuda-11.3/bin/nvcc
若输出路径
则工具包已成功安装
步骤 3:配置环境变量(关键)
需要将 CUDA 工具包的 bin
(含 nvcc
)和 lib64
(含运行时库)路径添加到环境变量中,否则系统无法识别 nvcc
3.1 临时配置(当前终端生效)
export PATH=$HOME/cuda-11.3/bin:$PATH
export LD_LIBRARY_PATH=/hpc2hdd/home/schen744/code/test/cuda-11.3/lib64:$LD_LIBRARY_PATH
3.2 永久配置(所有终端生效)
将上述环境变量添加到 ~/.bashrc
(或 ~/.zshrc
,根据你使用的 shell):
echo 'export PATH=/hpc2hdd/home/schen744/code/test/cuda-11.3/bin:$PATH' >> ~/.bashrc
echo 'export LD_LIBRARY_PATH=/hpc2hdd/home/schen744/code/test/cuda-11.3/lib64:$LD_LIBRARY_PATH' >> ~/.bashrc
保存后生效配置:
source ~/.bashrc
步骤 4:验证 nvcc
是否可用
运行以下命令检查 nvcc
版本:
nvcc --version
若输出类似以下信息,则安装成功:
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Mon_May__3_19:15:13_PDT_2021
Cuda compilation tools, release 11.3, V11.3.109
Build cuda_11.3.r11.3/compiler.29920130_0
问题成功解决!