报错信息:
import torch_sparse
返回
OSError: libcusparse.so.10: cannot open shared object file: No such file or directory
背景
根据官方文档安装PYTORCH GEOMETRIC,但是在import torch_sparse这个环节失败了。
翻遍了全网debug依然没能成功解决问题,下面记录一下自己的分析思路。
参考博客,debug网址
水流兄:No module named torch_sparse, 及pytorch-geometric安装
Install error · Issue #43 · rusty1s/pytorch_geometric
Please help me with OSError: libcusparse.so.10: cannot open shared object file: No such file or directory · Issue #1125 · rusty1s/pytorch_geometric
大多都是版本的问题,按照教程和作者的回复,大多都能装好,我也按照教程开始装包:
pip install torch-scatter
pip install torch-sparse # 这是我出问题的包,其他三个依赖包都能成功import
pip install torch-cluster
pip install torch-spline-conv
pip install torch-geometric
然后开始排查问题。
首先换版本(torch 1.4 1.5 1.7)以及对应的cu101版本全部试遍,换了一台服务器等,均不奏效(排除了版本问题,因为同样的安装方式,其他三个包都能import)
然后再anaconda下的base环境,新建环境均尝试,也都报错。查看一下报错信息:
Python 3.6.12 |Anaconda, Inc.| (default, Sep 8 2020, 23:10:56)
[GCC 7.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch_sparse
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/lzk/anaconda3/envs/kg2text/lib/python3.6/site-packages/torch_sparse/__init__.py", line 12, in <module>
library, [osp.dirname(__file__)]).origin)
File "/home/lzk/anaconda3/envs/kg2text/lib/python3.6/site-packages/torch/_ops.py", line 105, in load_library
ctypes.CDLL(path)
File "/home/lzk/anaconda3/envs/kg2text/lib/python3.6/ctypes/__init__.py", line 350, in __init__
self._handle = _dlopen(self._name, mode)
OSError: /usr/local/cuda/lib64/libcusparse.so.10: version `libcusparse.so.10' not found (required by /home/lzk/anaconda3/envs/kg2text/lib/python3.6/site-packages/torch_sparse/_spspmm.so)
证明是libcusparse.so.10出了问题:
顺着路径去找,发现这个东西确实存在:
各种路径都加了,都是按照作者的指示来的:
$ export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
$ echo $LD_LIBRARY_PATH
>>> /usr/local/cuda/lib64:...
$ export DYLD_LIBRARY_PATH=/usr/local/cuda/lib:$DYLD_LIBRARY_PATH
$ echo $DYLD_LIBRARY_PATH
>>> /usr/local/cuda/lib:...
有没有可能是软链的问题?没有可能,因为能上的软链都上了,soso.10so.10.0等等
然后去我的python的包环境下找答案,顺着:
/home/lzk/anaconda3/envs/kg2text/lib/python3.6/site-packages/torch_sparse/__init__.py
去看源码,发现这是一个torch加载library的命令:
既然说找不到libcusparse.so.10,那就先试试直接加载这个so文件有没有问题(直接复制torch_sparse/__init__.py 里面尝试)
此时定位问题出在_spspmm.so间接加载libcusparse.so.10失败。并且这是一个在$LD_LIBRARY_PATH配置了都无法解决的问题。然后我开始尝试参考博客:
关于程序运行时加载动态库失败的解决方法_sylalak123的博客-CSDN博客_主框架动态库加载出错blog.csdn.net加入/etc/ld.so.conf文件,但是依然找不到。于是开始做2手准备:
- 把PyTorch Geometric换成dgl(亚马逊开发的另一个图神经网络框架,虽然也有些问题,例如在dp模式下会报错,参考我上一次写bug日记)
- 发票圈求救
经过新权大佬指点,得知了一种可以强制让so文件加rpath的工具patchelf 。于是下载该工具(github地址):
(小插曲:可能会提示有些依赖还没装好,例如我的autoconf ,诶真的我这么命硬全靠有sudo权限,不然早就凉凉了)
sudo apt-get install autoconf automake libtool
然后把下载好的zip拖到服务器解压,进入文件夹:
. bootstrap.sh
./configure
make
sudo make install
(关于linux链接动态库的资料有很多,这里主要记录解决问题的步骤)先看看这个万恶之源_spspmm.so的加载路径:
readelf -d .../lib/python3.6/site-packages/torch_sparse/_spspmm.so | grep PATH
得到:
0x000000000000000f (RPATH) Library rpath: [/home/travis/miniconda3/envs/test/lib]
或者直接readelf -d _spspmm.so看更细致的信息:
Dynamic section at offset 0x53b08 contains 38 entries:
Tag Type Name/Value
0x0000000000000001 (NEEDED) Shared library: [libc10.so]
0x0000000000000001 (NEEDED) Shared library: [libtorch.so]
0x0000000000000001 (NEEDED) Shared library: [libtorch_cpu.so]
0x0000000000000001 (NEEDED) Shared library: [libtorch_python.so]
0x0000000000000001 (NEEDED) Shared library: [libcudart.so.10.1]
0x0000000000000001 (NEEDED) Shared library: [libc10_cuda.so]
0x0000000000000001 (NEEDED) Shared library: [libtorch_cuda.so]
0x0000000000000001 (NEEDED) Shared library: [libcusparse.so.10]
0x0000000000000001 (NEEDED) Shared library: [libstdc++.so.6]
0x0000000000000001 (NEEDED) Shared library: [libm.so.6]
0x0000000000000001 (NEEDED) Shared library: [libgcc_s.so.1]
0x0000000000000001 (NEEDED) Shared library: [libpthread.so.0]
0x0000000000000001 (NEEDED) Shared library: [libc.so.6]
0x0000000000000001 (NEEDED) Shared library: [ld-linux-x86-64.so.2]
0x000000000000000f (RPATH) Library rpath: [/home/travis/miniconda3/envs/test/lib]
0x000000000000000c (INIT) 0x1e000
0x000000000000000d (FINI) 0x422dc
0x0000000000000019 (INIT_ARRAY) 0x54618
0x000000000000001b (INIT_ARRAYSZ) 40 (bytes)
0x000000000000001a (FINI_ARRAY) 0x54640
0x000000000000001c (FINI_ARRAYSZ) 8 (bytes)
0x000000006ffffef5 (GNU_HASH) 0x298
0x0000000000000005 (STRTAB) 0x8400
0x0000000000000006 (SYMTAB) 0x1cf8
0x000000000000000a (STRSZ) 60275 (bytes)
0x000000000000000b (SYMENT) 24 (bytes)
0x0000000000000003 (PLTGOT) 0x55000
0x0000000000000002 (PLTRELSZ) 17424 (bytes)
0x0000000000000014 (PLTREL) RELA
0x0000000000000017 (JMPREL) 0x18e60
0x0000000000000007 (RELA) 0x17990
0x0000000000000008 (RELASZ) 5328 (bytes)
0x0000000000000009 (RELAENT) 24 (bytes)
0x000000006ffffffe (VERNEED) 0x17810
0x000000006fffffff (VERNEEDNUM) 7
0x000000006ffffff0 (VERSYM) 0x16f74
0x000000006ffffff9 (RELACOUNT) 26
0x0000000000000000 (NULL) 0x0
再ldd _spspmm.so
./_spspmm.so: /usr/local/cuda/lib64/libcusparse.so.10: version `libcusparse.so.10' not found (required by ./_spspmm.so)
linux-vdso.so.1 => (0x00007ffdb0732000)
libc10.so => not found
libtorch.so => not found
libtorch_cpu.so => not found
libtorch_python.so => not found
libcudart.so.10.1 => not found
libc10_cuda.so => not found
libtorch_cuda.so => not found
libcusparse.so.10 => /usr/local/cuda/lib64/libcusparse.so.10 (0x00007fcba70c1000)
libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007fcba6d3f000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007fcba6a36000)
libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007fcba6820000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007fcba6603000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fcba6239000)
/lib64/ld-linux-x86-64.so.2 (0x00007fcbaab29000)
librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007fcba6031000)
libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007fcba5e2d000)
然后利用patchelf修改路径:
patchelf 的功能以及使用 patchelf 修改 rpath 以解决动态库问题blog.csdn.netpatchelf --set-rpath /usr/local/cuda/lib64/ _spspmm.so
然后检查一下是否修改成功了:
再看看,发现还是不行:
(kg2text) lzk@admin:~/anaconda3/envs/kg2text/lib/python3.6/site-packages/torch_sparse$ ldd _spspmm.so
./_spspmm.so: /usr/local/cuda/lib64/libcusparse.so.10: version `libcusparse.so.10' not found (required by ./_spspmm.so)
linux-vdso.so.1 => (0x00007ffc105f7000)
libc10.so => not found
libtorch.so => not found
libtorch_cpu.so => not found
libtorch_python.so => not found
libcudart.so.10.1 => not found
libc10_cuda.so => not found
libtorch_cuda.so => not found
libcusparse.so.10 => /usr/local/cuda/lib64/libcusparse.so.10 (0x00007f9169e8f000)
libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007f9169b0d000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f9169804000)
libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f91695ee000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f91693d1000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f9169007000)
/lib64/ld-linux-x86-64.so.2 (0x00007f916d8f7000)
librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007f9168dff000)
libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f9168bfb000)
然后出现了更严重的问题,就是改完之后,连torch都不能正常import了:
不得已只能卸载torch_sparse,明哲保身。