无root权限安装CUDA10.0以及gcc的降级+FairMOT构建DCNv2踩坑记录

无root权限安装CUDA10.0+FairMOT构建DCNv2踩坑记录

首先介绍一下环境情况
因为比赛CUDA版本要求是10.0,所以本文是在Ubuntu20.04下安装CUDA10.0实现FairMOT的复现;但是Ubuntu20.04自带的gcc版本是9.3,创建虚拟环境创建出来的gcc版本根据python版本各异,而在CUDA10.0编译DCNv2要求gcc版本在7以下,所以还需要做一个gcc的版本降级

  • NVIDIA-SMI 460.80 Driver Version: 460.80 CUDA Version:11.2
  • Ubuntu20.04
  • GPU2080Ti
  • CUDA10.0

因为使用的是服务器,以用户身份登录是没有root权限的


无root权限安装CUDA

非root用户安装cuda与cudnn
非root用户安装cuda10.0和cudnn
nvcc: NVIDIA ® Cuda compiler driver
Copyright © 2005-2018 NVIDIA Corporation
Built on Sat_Aug_25_21:08:01_CDT_2018
Cuda compilation tools, release 10.0, V10.0.130


gcc版本的降级

非root权限升级(修改当前用户)Linux gcc版本
后来想到我们为什么要那么努力改变主机的gcc版本呢?我们明明是在虚拟环境下运行啊!经过查找发现anaconda创建出来的虚拟环境gcc版本是和主机版本不一致的,我们可以改变anaconda的gcc 版本来运行!!毕竟我们用的是虚拟环境嘛
如何改变anaconda 的 gcc 版本?
最后才发现这么多努力可以只换做一行命令行语句
conda install https://anaconda.org/brown-data-science/gcc/5.4.0/download/linux-64/gcc-5.4.0-0.tar.bz2
通过conda安装gcc5.4.0,它会自动把依赖和环境都给配好
在这里插入图片描述
泪目啊


同时因为版本的问题,大大小小把DCNv2的坑全踩了一遍,记录一下

1. 执行FairMOT demo.py出现Error:ModuleNotFoundError: No module named '_ext’

(fairmot) lyp@ubuntu-server:~/FairMOT-master/src$ python demo.py mot --load_model ../models/fairmot_dla34.pth --conf_thres 0.4
Traceback (most recent call last):
  File "demo.py", line 14, in <module>
    from track import eval_seq
  File "/home/lyp/FairMOT-master/src/track.py", line 15, in <module>
    from tracker.multitracker import JDETracker
  File "/home/lyp/FairMOT-master/src/lib/tracker/multitracker.py", line 13, in <module>
    from models.model import create_model, load_model
  File "/home/lyp/FairMOT-master/src/lib/models/model.py", line 11, in <module>
    from .networks.pose_dla_dcn import get_pose_net as get_dla_dcn
  File "/home/lyp/FairMOT-master/src/lib/models/networks/pose_dla_dcn.py", line 16, in <module>
    from dcn_v2 import DCN
  File "/home/lyp/FairMOT-master/DCNv2/dcn_v2.py", line 13, in <module>
    import _ext as _backend
ModuleNotFoundError: No module named '_ext'

解决措施:
这是一个深度学习代码运行时报的错,错误原因是/DCNv2/目录下需要重新编译,要把该目录的build文件夹(如果存在的话)删除,然后在命令行运行python setup.py build develop重新生成符合自己环境的build

2. nvcc明明在/cuda/bin/目录下却说找不到

unable to execute ':/home/lyp/cuda-10.0/bin/nvcc': No such file or directory
error: command ':/home/lyp/cuda-10.0/bin/nvcc' failed with exit status 1

解决措施:
(fairmot) lyp@ubuntu-server:~/FairMOT-master/DCNv2$ vim ~/.bashrc
改为export CUDA_HOME=/home/lyp/cuda-10.0
(fairmot) lyp@ubuntu-server:~/FairMOT-master/DCNv2$ source ~/.bashrc

3. 由于/bin:/usr/bin 不在PATH 环境变量中,故无法找到该命令。

(base) lyp@ubuntu-server:~/FairMOT-master/DCNv2$ source ~/.bashrc
命令 'dirname' 可在以下位置找到
 * /bin/dirname
 * /usr/bin/dirname
由于/bin:/usr/bin 不在PATH 环境变量中,故无法找到该命令。
dirname:未找到命令
命令 'dirname' 可在以下位置找到
 * /bin/dirname
 * /usr/bin/dirname
由于/bin:/usr/bin 不在PATH 环境变量中,故无法找到该命令。
dirname:未找到命令

解决措施:
不要改PATH,改的是CUDA_HOME

4. error: ‘THFloatBlas_gemv’ was not declared in this scope;

/home/lyp/FairMOT-master/DCNv2/src/cpu/dcn_v2_cpu.cpp:224:9: error: ‘THFloatBlas_gemv’ was not declared in this scope; did you mean ‘THFloatBlas_axpy’?
  224 |         THFloatBlas_gemv('t', k_, m_, 1.0f,
      |         ^~~~~~~~~~~~~~~~
      |         THFloatBlas_axpy
error: command 'g++' failed with exit status 1

解决措施:
在这里插入图片描述

5. 没有root权限,权限不够

(fairmot) lyp@ubuntu-server:~/FairMOT-master/DCNv2$  ./make.sh
-bash: ./make.sh: 权限不够
(fairmot) lyp@ubuntu-server:~/FairMOT-master/DCNv2$ chmod +x ./make.sh
(fairmot) lyp@ubuntu-server:~/FairMOT-master/DCNv2$ ./make.sh
running build
running build_ext
building '_ext' extension

6. #error – unsupported GNU version! gcc versions later than 7 are not supported!

/home/lyp/cuda-10.0/include/crt/host_config.h:129:2: error: #error -- unsupported GNU version! gcc versions later than 7 are not supported!
  129 | #error -- unsupported GNU version! gcc versions later than 7 are not supported!
      |  ^~~~~
error: command '/home/lyp/cuda-10.0/bin/nvcc' failed with exit status 1

归根到底,还是gcc版本太高了,Ubuntu20.04自带的gcc版本是9.3,现在需要对gcc版本做一个降级
! ! ! ! ! !不要直接将Ubuntu的gcc版本,我们只需要改变anaconda虚拟环境的gcc版本就可以啦 血泪教训

7. gcc: error trying to exec ‘cc1plus’: execvp: no such file or directory
存在问题:
gcc与g++的版本没有对应
解决措施:

 conda install https://anaconda.org/brown-data-science/gcc/5.4.0/download/linux-64/gcc-5.4.0-0.tar.bz2

大家如果还有问题可以看看这篇博客FairMOT构建DCNv2踩坑记录

最近又遇到了一个新问题

(redet) lyp@ubuntu-server:~$ bash compile.sh
/home/lyp/anaconda3/envs/redet/compiler_compat/ld: cannot find -lpthread
/home/lyp/anaconda3/envs/redet/compiler_compat/ld: cannot find -lc
collect2: 错误: ld 返回 1
error: command ‘gcc’ failed with exit status 1
解决办法:
https://www.cnblogs.com/zhangly2020/p/14213866.html

编译tensorflow: 1.python3.5,tensorflow1.12; 2.支持cuda10.0,cudnn7.3.1,TensorRT-5.0.2.6-cuda10.0-cudnn7.3; 3.支持mkl,无MPI; 软硬件硬件环境:Ubuntu16.04,GeForce GTX 1080 配置信息: hp@dla:~/work/ts_compile/tensorflow$ ./configure WARNING: --batch mode is deprecated. Please instead explicitly shut down your Bazel server using the command "bazel shutdown". You have bazel 0.19.1 installed. Please specify the location of python. [Default is /usr/bin/python]: /usr/bin/python3 Found possible Python library paths: /usr/local/lib/python3.5/dist-packages /usr/lib/python3/dist-packages Please input the desired Python library path to use. Default is [/usr/local/lib/python3.5/dist-packages] Do you wish to build TensorFlow with XLA JIT support? [Y/n]: XLA JIT support will be enabled for TensorFlow. Do you wish to build TensorFlow with OpenCL SYCL support? [y/N]: No OpenCL SYCL support will be enabled for TensorFlow. Do you wish to build TensorFlow with ROCm support? [y/N]: No ROCm support will be enabled for TensorFlow. Do you wish to build TensorFlow with CUDA support? [y/N]: y CUDA support will be enabled for TensorFlow. Please specify the CUDA SDK version you want to use. [Leave empty to default to CUDA 10.0]: Please specify the location where CUDA 10.0 toolkit is installed. Refer to README.md for more details. [Default is /usr/local/cuda]: /usr/local/cuda-10.0 Please specify the cuDNN version you want to use. [Leave empty to default to cuDNN 7]: 7.3.1 Please specify the location where cuDNN 7 library is installed. Refer to README.md for more details. [Default is /usr/local/cuda-10.0]: Do you wish to build TensorFlow with TensorRT support? [y/N]: y TensorRT support will be enabled for TensorFlow. Please specify the location where TensorRT is installed. [Default is /usr/lib/x86_64-linux-gnu]:/home/hp/bin/TensorRT-5.0.2.6-cuda10.0-cudnn7.3/targets/x86_64-linux-gnu Please specify the locally installed NCCL version you want to use. [Default is to use https://github.com/nvidia/nccl]: Please specify a list of comma-separated Cuda compute capabilities you want to build with. You can find the compute capability of your device at: https://developer.nvidia.com/cuda-gpus. Please note that each additional compute capability significantly increases your build time and binary size. [Default is: 6.1,6.1,6.1]: Do you want to use clang as CUDA compiler? [y/N]: nvcc will be used as CUDA compiler. Please specify which gcc should be used by nvcc as the host compiler. [Default is /usr/bin/gcc]: Do you wish to build TensorFlow with MPI support? [y/N]: No MPI support will be enabled for TensorFlow. Please specify optimization flags to use during compilation when bazel option "--config=opt" is specified [Default is -march=native -Wno-sign-compare]: Would you like to interactively configure ./WORKSPACE for Android builds? [y/N]: Not configuring the WORKSPACE for Android builds. Preconfigured Bazel build configs. You can use any of the below by adding "--config=" to your build command. See .bazelrc for more details. --config=mkl # Build with MKL support. --config=monolithic # Config for mostly static monolithic build. --config=gdr # Build with GDR support. --config=verbs # Build with libverbs support. --config=ngraph # Build with Intel nGraph support. --config=dynamic_kernels # (Experimental) Build kernels into separate shared objects. Preconfigured Bazel build configs to DISABLE default on features: --config=noaws # Disable AWS S3 filesystem support. --config=nogcp # Disable GCP support. --config=nohdfs # Disable HDFS support. --config=noignite # Disable Apacha Ignite support. --config=nokafka # Disable Apache Kafka support. --config=nonccl # Disable NVIDIA NCCL support. Configuration finished 编译: hp@dla:~/work/ts_compile/tensorflow$ bazel build --config=opt --config=mkl --verbose_failures //tensorflow/tools/pip_package:build_pip_package 卸载已有tensorflow: hp@dla:~/temp$ sudo pip3 uninstall tensorflow 安装自己编译的成果: hp@dla:~/temp$ sudo pip3 install tensorflow-1.12.0-cp35-cp35m-linux_x86_64.whl
### 回答1: 运行时错误:调用`cublasgemmex(handle,opa,opb,m,n,k,&falpha,a,cuda_r_16f,lda,b,cuda_r_16f,ldb,&fbeta,c,cuda_r_16f,ldc,cuda_r_32f,cublas_gemm_dfalt_tensor_op)`时出现cuda错误:cublas_status_execution_failed。 ### 回答2: 该错误信息表明CUDA运行时库在执行一个CUBLAS函数时遇到了一个错误。 CUBLAS是CUDA的一部分,是一组为矩阵运算提供加速的库函数。它利用了GPU的并行计算能力来加速线性代数计算。其中的核心函数cublasgemmex用于矩阵乘法的加速计算,它可以同时计算多个矩阵乘法操作,对于大规模的矩阵计算非常有用。 在这种情况下,错误可能是由于矩阵乘法的输入参数不正确引起的。错误信息中提到的参数包括: - handle:一个CUBLAS库的句柄,用于管理和跟踪GPU资源分配和释放。 - opa/ opb:输入矩阵的转置选项,通常为CUBLAS_OP_N(不转置)或CUBLAS_OP_T(转置)。 - m,n,k:输入矩阵的维度,以及它们的乘积的大小。 为了解决这个问题,可以尝试检查参数是否正确传递,并确保它们符合所需的数据类型和格式。此外,还可以考虑检查GPU的可用内存和各种资源状况。在某些情况下,这个错误可能会发生在GPU系统资源不足或者GPU出现故障的情况下。 总之,runtimeerror: cuda error: cublas_status_execution_failed when calling `cublasgemmex函数的错误常常是由于输入参数错误或GPU资源不足引起的,开发者可以通过仔细检查参数的传递和GPU资源的分配来解决这个问题。 ### 回答3: 该错误代码提示您的代码正在使用CUDA(计算机统一设备架构)进行加速运算,并且出现了一个被称为“cublasgemmex”的错误。 该错误通常可能是由于以下原因引起的: 1.系统不支持CUDA(计算机统一设备架构):您要运行的程序可能需要使用特殊的GPU处理器来运行。如果您已经尝试过安装GPU驱动程序和CUDA(计算机统一设备架构)包,但仍然出现此错误,请检查您的系统是否完全支持CUDA(计算机统一设备架构)。 2.图形处理器(GPU)可能已经失效:如果您的GPU已经过时,或因其他原因已经失效,则您的程序可能无法找到所需的资源并引发该错误。尝试更新或更换损坏的GPU可能是解决问题的一种方法。 3.使用不兼容的驱动程序:与您的GPU不兼容的驱动程序可能会导致此错误。请确认您的驱动程序是否与您的GPU兼容。 4.内存不足:您的计算机上可用的内存可能不足以支持所需的运算。请检查您的系统配置,确保GPU有足够的内存来运行您的程序。 5.代码中的错误:还有一种可能是您的代码中存在一些问题,可能导致了计算失败。请检查您的代码是否完全符合CUDA(计算机统一设备架构)的语言要求,是否存在语法错误、拼写错误或其他问题。 需要解决这个问题,可以根据上述可能的原因逐一排查。如果您确定GPU驱动程序和CUDA(计算机统一设备架构)包已正确配置并且所需的内存可用,则可以考虑检查您的代码以找出可能的问题并进行更改。另外,也可以尝试与CUDA(计算机统一设备架构)社区寻求帮助,与其他开发人员交流您的问题并获取更多的解决方案。
评论 6
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

努力学习DePeng

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值