fused_adam.so: cannot open shared object file: No such file or directory

最近使用分布式训练框架deepspeed进行训练,安装后报错,如下所示

 File "**/site-packages/deepspeed/ops/adam/fused_adam.py", line 94, in __init__
    fused_adam_cuda = FusedAdamBuilder().load()
    op_module = load(name=self.name,  File "**/site-packages/deepspeed/ops/op_builder/builder.py", line 479, in load

  File "**/site-packages/torch/utils/cpp_extension.py", line 1284, in load
    return self.jit_load(verbose)
  File "**/site-packages/deepspeed/ops/op_builder/builder.py", line 523, in jit_load
    op_module = load(name=self.name,
    return _jit_compile(  File "**/site-packages/torch/utils/cpp_extension.py", line 1284, in load

  File "**/site-packages/torch/utils/cpp_extension.py", line 1535, in _jit_compile
    return _jit_compile(
  File "**/site-packages/torch/utils/cpp_extension.py", line 1535, in _jit_compile
    return _import_module_from_library(name, build_directory, is_python_module)
  File "**/site-packages/torch/utils/cpp_extension.py", line 1929, in _import_module_from_library
    return _import_module_from_library(name, build_directory, is_python_module)
  File "**/site-packages/torch/utils/cpp_extension.py", line 1929, in _import_module_from_library
    module = importlib.util.module_from_spec(spec)
  File "<frozen importlib._bootstrap>", line 556, in module_from_spec
  File "<frozen importlib._bootstrap_external>", line 1166, in create_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
ImportError: /root/.cache/torch_extensions/py38_cu117/fused_adam/fused_adam.so: cannot open shared object file: No such file or directory
    module = importlib.util.module_from_spec(spec)
  File "<frozen importlib._bootstrap>", line 556, in module_from_spec
  File "<frozen importlib._bootstrap_external>", line 1166, in create_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
ImportError: /root/.cache/torch_extensions/py38_cu117/fused_adam/fused_adam.so: cannot open shared object file: No such file or directory

检查版本如下,操作系统centos7,内核3.10.0-1160.92.1.el7.x86_64 ,python 3.8,显卡驱动对应的版本11.2,torch 版本2.0.1+cu117,nvcc 版本11.2,deepspeed 版本0.13.4
ds_resport 输出如下

[2024-03-06 17:33:52,658] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
fused_adam ............. [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_lion ............... [NO] ....... [OKAY]
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
evoformer_attn ......... [NO] ....... [NO]
fused_lamb ............. [NO] ....... [OKAY]
fused_lion ............. [NO] ....... [OKAY]
inference_core_ops ..... [NO] ....... [OKAY]
cutlass_ops ............ [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
ragged_device_ops ...... [NO] ....... [OKAY]
ragged_ops ............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.0
 [WARNING]  using untested triton version (2.0.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['**/site-packages/torch']
torch version .................... 2.0.1+cu117
deepspeed install path ........... ['**/site-packages/deepspeed']
deepspeed info ................... 0.13.4, unknown, unknown
torch cuda version ............... 11.7
torch hip version ................ None
nvcc version ..................... 11.2
deepspeed wheel compiled w. ...... torch 1.10, cuda 10.2
shared memory (/dev/shm) size .... 125.87 GB
 

其中fused_adam ............. [NO] ....... [OKAY] 显示未安装,其中torch cuda 版本为11.7,但nvcc 版本为11.2;更换一台机器,ubuntu 20.04,python 3.10,显卡驱动对应的cuda 12.2nvcc 版本为11.8,torch 版本2.1.2+cu121,deepspeed 0.13.5

运行deepspeed 报错同上

运行ds_report

[2024-03-06 17:47:43,578] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
fused_adam ............. [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_lion ............... [NO] ....... [OKAY]
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
evoformer_attn ......... [NO] ....... [NO]
fused_lamb ............. [NO] ....... [OKAY]
fused_lion ............. [NO] ....... [OKAY]
inference_core_ops ..... [NO] ....... [OKAY]
cutlass_ops ............ [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
ragged_device_ops ...... [NO] ....... [OKAY]
ragged_ops ............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.1
 [WARNING]  using untested triton version (2.1.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['**/site-packages/torch']
torch version .................... 2.1.2+cu121
deepspeed install path ........... ['**/site-packages/deepspeed']
deepspeed info ................... 0.13.5, unknown, unknown
torch cuda version ............... 12.1
torch hip version ................ None
nvcc version ..................... 11.8
deepspeed wheel compiled w. ...... torch 2.1, cuda 12.1
shared memory (/dev/shm) size .... 125.76 GB其中torch cuda 版本12.1,但nvcc 版本为11.8,此时按照参考资料2的内容,复制github上deepspeed目录,进入deepspeed目录,并执行DS_BUILD_FUSED_ADAM=1 pip3 install .,报如下错误:

      Traceback (most recent call last):
        File "<string>", line 2, in <module>
        File "<pip-setuptools-caller>", line 34, in <module>
        File "**/DeepSpeed/setup.py", line 196, in <module>
          ext_modules.append(builder.builder())
        File "**/DeepSpeed/op_builder/builder.py", line 633, in builder
          assert_no_cuda_mismatch(self.name)
        File "**/DeepSpeed/oop_builder/builder.py", line 101, in assert_no_cuda_mismatch
          raise CUDAMismatchException(
      op_builder.builder.CUDAMismatchException: >- DeepSpeed Op Builder: Installed CUDA version 11.8 does not match the version torch was compiled with 12.1, unable to compile cuda/cpp extensions without a matching cuda version.
      DS_BUILD_OPS=0
       [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
       [WARNING]  async_io: please install the libaio-dev package with apt
       [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
显示cuda 版本11.8和12.1版本的torch 不匹配,因此升级cuda 为12.2,然后再进入deepspeed目录,DS_BUILD_FUSED_ADAM=1 pip3 install . ,安装后ds_report,发现fused_adam已安装上,fused_adam ............. [YES] ...... [OKAY],再次执行训练,发现已经不在报上述错误,注意DS_BUILD_FUSED_ADAM=1 pip3 install deepspeed 是不起作用的,安装不上fused_adam

由此得出结论:torch 的cuda 版本要和nvcc 的版本一致才可以,至少torch 的cuda 版本不能比nvcc 的版本低才行(也不能太高)

参考资料:

fused_adam.so: cannot open shared object file: No such file or directory问题排查与解决-CSDN博客fused_adam.so: cannot open shared object file: No such file or directory · Issue #119 · databrickslabs/dolly · GitHub

3  【工程实践】解决 nvcc: command not found_nvcc -v 提示未找到命令-CSDN博客

4   https://github.com/stanford-crfm/mistral/issues/196

  • 27
    点赞
  • 17
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
这个错误通常是由于缺少共享对象文件导致的。共享对象文件是一种可执行文件,它包含被多个程序共享的代码和数据。根据你提供的引用内容,这个错误可能是由于缺少名为"fused_adam.so"或"libavformat.so.58"的共享对象文件引起的。这些文件可能在你的系统中不存在或无法找到。 解决这个问题的方法是确保这些共享对象文件存在并且可以被正确加载。你可以尝试以下几个步骤来解决这个问题: 1. 检查文件路径:首先,确认这些共享对象文件的路径是否正确。你可以使用命令`ls`来检查文件是否存在,并使用`ldd`命令来检查共享对象文件的依赖关系。 2. 安装缺失的依赖项:如果缺少共享对象文件的依赖项,你可以尝试安装这些依赖项。根据你提供的引用内容,你可能需要安装ffmpeg或其他相关的库文件。你可以使用包管理器来安装这些依赖项,例如在Ubuntu上使用apt-get命令,或在CentOS上使用yum命令。 3. 更新软件包:如果你已经安装了依赖项,但仍然遇到问题,可能是因为软件包版本不兼容。尝试更新软件包到最新版本,以确保所有依赖项都是兼容的。 4. 重新编译或重新安装:如果以上步骤都没有解决问题,你可以尝试重新编译或重新安装相关的软件包。确保按照正确的步骤和选项进行编译和安装。 请注意,具体的解决方法可能因系统和软件环境的不同而有所不同。如果你仍然遇到困难,建议查阅相关软件的官方文档或寻求专业支持。 #### 引用[.reference_title] - *1* [fused_adam.so: cannot open shared object file: No such file or directory问题排查与解决](https://blog.csdn.net/qq_35284646/article/details/125785970)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v91^insert_down1,239^v3^insert_chatgpt"}} ] [.reference_item] - *2* [cannot open shared object file: No such file or directory解决方法](https://blog.csdn.net/qq_32077121/article/details/109725714)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v91^insert_down1,239^v3^insert_chatgpt"}} ] [.reference_item] - *3* [cannot open shared object file: No such file or directory如何解决](https://blog.csdn.net/joshuaxx316/article/details/50553185)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v91^insert_down1,239^v3^insert_chatgpt"}} ] [.reference_item] [ .reference_list ]

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值