2 安装失败_写bug日记2:PYTORCH GEOMETRIC安装失败的问题(未解决)

在尝试安装PYTORCH GEOMETRIC时遇到import torch_sparse错误,作者排查了版本、环境、软链等问题,发现是_spspmm.so间接加载libcusparse.so.10失败。尝试使用patchelf修改动态库路径未成功,反而导致torch无法import。解决方案仍在探索中。
摘要由CSDN通过智能技术生成

报错信息:

import torch_sparse

返回

OSError: libcusparse.so.10: cannot open shared object file: No such file or directory

背景

根据官方文档安装PYTORCH GEOMETRIC,但是在import torch_sparse这个环节失败了。

翻遍了全网debug依然没能成功解决问题,下面记录一下自己的分析思路。

参考博客,debug网址

水流兄:No module named torch_sparse, 及pytorch-geometric安装

Install error · Issue #43 · rusty1s/pytorch_geometric

Please help me with OSError: libcusparse.so.10: cannot open shared object file: No such file or directory · Issue #1125 · rusty1s/pytorch_geometric

大多都是版本的问题,按照教程和作者的回复,大多都能装好,我也按照教程开始装包:

pip install torch-scatter
pip install torch-sparse  # 这是我出问题的包,其他三个依赖包都能成功import
pip install torch-cluster
pip install torch-spline-conv
pip install torch-geometric

然后开始排查问题。

首先换版本(torch 1.4 1.5 1.7)以及对应的cu101版本全部试遍,换了一台服务器等,均不奏效(排除了版本问题,因为同样的安装方式,其他三个包都能import)

然后再anaconda下的base环境,新建环境均尝试,也都报错。查看一下报错信息:

Python 3.6.12 |Anaconda, Inc.| (default, Sep  8 2020, 23:10:56) 
[GCC 7.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch_sparse
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/lzk/anaconda3/envs/kg2text/lib/python3.6/site-packages/torch_sparse/__init__.py", line 12, in <module>
    library, [osp.dirname(__file__)]).origin)
  File "/home/lzk/anaconda3/envs/kg2text/lib/python3.6/site-packages/torch/_ops.py", line 105, in load_library
    ctypes.CDLL(path)
  File "/home/lzk/anaconda3/envs/kg2text/lib/python3.6/ctypes/__init__.py", line 350, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: /usr/local/cuda/lib64/libcusparse.so.10: version `libcusparse.so.10' not found (required by /home/lzk/anaconda3/envs/kg2text/lib/python3.6/site-packages/torch_sparse/_spspmm.so)

证明是libcusparse.so.10出了问题:

a8ce3214050abd64dfa8a16cf335b167.png

顺着路径去找,发现这个东西确实存在:

661c9a01331f62c1610067e58e1258d1.png

各种路径都加了,都是按照作者的指示来的:

$ export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
$ echo $LD_LIBRARY_PATH
>>> /usr/local/cuda/lib64:...

$ export DYLD_LIBRARY_PATH=/usr/local/cuda/lib:$DYLD_LIBRARY_PATH
$ echo $DYLD_LIBRARY_PATH
>>> /usr/local/cuda/lib:...

有没有可能是软链的问题?没有可能,因为能上的软链都上了,soso.10so.10.0等等

然后去我的python的包环境下找答案,顺着:

/home/lzk/anaconda3/envs/kg2text/lib/python3.6/site-packages/torch_sparse/__init__.py 

去看源码,发现这是一个torch加载library的命令:

65f2e63537fe363d82eeb2ebc70535c5.png

既然说找不到libcusparse.so.10,那就先试试直接加载这个so文件有没有问题(直接复制torch_sparse/__init__.py 里面尝试)

6611bfdc12e1fa9928e63294c1ea472d.png

此时定位问题出在_spspmm.so间接加载libcusparse.so.10失败。并且这是一个在$LD_LIBRARY_PATH配置了都无法解决的问题。然后我开始尝试参考博客:

关于程序运行时加载动态库失败的解决方法_sylalak123的博客-CSDN博客_主框架动态库加载出错​blog.csdn.net
21d0180a215fa13e9802b25bbf61ddc6.png

加入/etc/ld.so.conf文件,但是依然找不到。于是开始做2手准备:

  • 把PyTorch Geometric换成dgl(亚马逊开发的另一个图神经网络框架,虽然也有些问题,例如在dp模式下会报错,参考我上一次写bug日记)
  • 发票圈求救

经过新权大佬指点,得知了一种可以强制让so文件加rpath的工具patchelf 。于是下载该工具(github地址):

(小插曲:可能会提示有些依赖还没装好,例如我的autoconf ,诶真的我这么命硬全靠有sudo权限,不然早就凉凉了)

sudo apt-get install autoconf automake libtool

然后把下载好的zip拖到服务器解压,进入文件夹:

. bootstrap.sh
./configure
make
sudo make install

(关于linux链接动态库的资料有很多,这里主要记录解决问题的步骤)先看看这个万恶之源_spspmm.so的加载路径:

readelf -d .../lib/python3.6/site-packages/torch_sparse/_spspmm.so | grep PATH

得到:

0x000000000000000f (RPATH)   Library rpath: [/home/travis/miniconda3/envs/test/lib]

或者直接readelf -d _spspmm.so看更细致的信息:

Dynamic section at offset 0x53b08 contains 38 entries:
  Tag        Type                         Name/Value
 0x0000000000000001 (NEEDED)             Shared library: [libc10.so]
 0x0000000000000001 (NEEDED)             Shared library: [libtorch.so]
 0x0000000000000001 (NEEDED)             Shared library: [libtorch_cpu.so]
 0x0000000000000001 (NEEDED)             Shared library: [libtorch_python.so]
 0x0000000000000001 (NEEDED)             Shared library: [libcudart.so.10.1]
 0x0000000000000001 (NEEDED)             Shared library: [libc10_cuda.so]
 0x0000000000000001 (NEEDED)             Shared library: [libtorch_cuda.so]
 0x0000000000000001 (NEEDED)             Shared library: [libcusparse.so.10]
 0x0000000000000001 (NEEDED)             Shared library: [libstdc++.so.6]
 0x0000000000000001 (NEEDED)             Shared library: [libm.so.6]
 0x0000000000000001 (NEEDED)             Shared library: [libgcc_s.so.1]
 0x0000000000000001 (NEEDED)             Shared library: [libpthread.so.0]
 0x0000000000000001 (NEEDED)             Shared library: [libc.so.6]
 0x0000000000000001 (NEEDED)             Shared library: [ld-linux-x86-64.so.2]
 0x000000000000000f (RPATH)              Library rpath: [/home/travis/miniconda3/envs/test/lib]
 0x000000000000000c (INIT)               0x1e000
 0x000000000000000d (FINI)               0x422dc
 0x0000000000000019 (INIT_ARRAY)         0x54618
 0x000000000000001b (INIT_ARRAYSZ)       40 (bytes)
 0x000000000000001a (FINI_ARRAY)         0x54640
 0x000000000000001c (FINI_ARRAYSZ)       8 (bytes)
 0x000000006ffffef5 (GNU_HASH)           0x298
 0x0000000000000005 (STRTAB)             0x8400
 0x0000000000000006 (SYMTAB)             0x1cf8
 0x000000000000000a (STRSZ)              60275 (bytes)
 0x000000000000000b (SYMENT)             24 (bytes)
 0x0000000000000003 (PLTGOT)             0x55000
 0x0000000000000002 (PLTRELSZ)           17424 (bytes)
 0x0000000000000014 (PLTREL)             RELA
 0x0000000000000017 (JMPREL)             0x18e60
 0x0000000000000007 (RELA)               0x17990
 0x0000000000000008 (RELASZ)             5328 (bytes)
 0x0000000000000009 (RELAENT)            24 (bytes)
 0x000000006ffffffe (VERNEED)            0x17810
 0x000000006fffffff (VERNEEDNUM)         7
 0x000000006ffffff0 (VERSYM)             0x16f74
 0x000000006ffffff9 (RELACOUNT)          26
 0x0000000000000000 (NULL)               0x0

再ldd _spspmm.so

./_spspmm.so: /usr/local/cuda/lib64/libcusparse.so.10: version `libcusparse.so.10' not found (required by ./_spspmm.so)
linux-vdso.so.1 =>  (0x00007ffdb0732000)
libc10.so => not found
libtorch.so => not found
libtorch_cpu.so => not found
libtorch_python.so => not found
libcudart.so.10.1 => not found
libc10_cuda.so => not found
libtorch_cuda.so => not found
libcusparse.so.10 => /usr/local/cuda/lib64/libcusparse.so.10 (0x00007fcba70c1000)
libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007fcba6d3f000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007fcba6a36000)
libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007fcba6820000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007fcba6603000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fcba6239000)
/lib64/ld-linux-x86-64.so.2 (0x00007fcbaab29000)
librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007fcba6031000)
libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007fcba5e2d000)

然后利用patchelf修改路径:

patchelf 的功能以及使用 patchelf 修改 rpath 以解决动态库问题​blog.csdn.net
patchelf --set-rpath /usr/local/cuda/lib64/  _spspmm.so

然后检查一下是否修改成功了:

4a4fc7858c62629c061ae0ebbc994299.png

再看看,发现还是不行:

(kg2text) lzk@admin:~/anaconda3/envs/kg2text/lib/python3.6/site-packages/torch_sparse$ ldd _spspmm.so
./_spspmm.so: /usr/local/cuda/lib64/libcusparse.so.10: version `libcusparse.so.10' not found (required by ./_spspmm.so)
	linux-vdso.so.1 =>  (0x00007ffc105f7000)
	libc10.so => not found
	libtorch.so => not found
	libtorch_cpu.so => not found
	libtorch_python.so => not found
	libcudart.so.10.1 => not found
	libc10_cuda.so => not found
	libtorch_cuda.so => not found
	libcusparse.so.10 => /usr/local/cuda/lib64/libcusparse.so.10 (0x00007f9169e8f000)
	libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007f9169b0d000)
	libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f9169804000)
	libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f91695ee000)
	libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f91693d1000)
	libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f9169007000)
	/lib64/ld-linux-x86-64.so.2 (0x00007f916d8f7000)
	librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007f9168dff000)
	libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f9168bfb000)

然后出现了更严重的问题,就是改完之后,连torch都不能正常import了:

94c105195995783aa336b8ac571245e9.png

不得已只能卸载torch_sparse,明哲保身。

86ab091cc1de0942156f0a935c9c5b56.png
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值