paddle-gpu安装避坑指南

问题1

File "/usr/local/lib/python3.6/dist-packages/cv2/__init__.py", line 8, in <module>
    from .cv2 import *
ImportError: libGL.so.1: cannot open shared object file: No such file or directory

解决方案

RUN apt update
# Dependency for opencv-python (cv2). `import cv2` raises ImportError: libGL.so.1: cannot open shared object file: No such file or directory
# Solution from https://askubuntu.com/a/1015744
RUN apt install -y libgl1-mesa-glx

问题2

Traceback (most recent call last):
  File "/home/jovyan/.conda/envs/paddle/bin/paddleocr", line 8, in <module>
    sys.exit(main())
  File "/home/jovyan/.conda/envs/paddle/lib/python3.8/site-packages/paddleocr/paddleocr.py", line 673, in main
    result = engine.ocr(img_path,
  File "/home/jovyan/.conda/envs/paddle/lib/python3.8/site-packages/paddleocr/paddleocr.py", line 555, in ocr
    dt_boxes, rec_res, _ = self.__call__(img, cls)
  File "/home/jovyan/.conda/envs/paddle/lib/python3.8/site-packages/paddleocr/tools/infer/predict_system.py", line 71, in __call__
    dt_boxes, elapse = self.text_detector(img)
  File "/home/jovyan/.conda/envs/paddle/lib/python3.8/site-packages/paddleocr/tools/infer/predict_det.py", line 244, in __call__
    self.input_tensor.copy_from_cpu(img)
  File "/home/jovyan/.conda/envs/paddle/lib/python3.8/site-packages/paddle/fluid/inference/wrapper.py", line 38, in tensor_copy_from_cpu
    self.copy_from_cpu_bind(data)
RuntimeError: (PreconditionNotMet) Cannot load cudnn shared library. Cannot invoke method cudnnGetVersion.
  [Hint: cudnn_dso_handle should not be null.] (at /paddle/paddle/phi/backends/dynload/cudnn.cc:60)

解决方案
step 1: 在终端中输入ls /usr/lib |grep lib,可以看到shared library中并没有libcudnn.so和libcublas.so。

(base) jovyan@wangzy-p2-0:/usr/lib$ ls /usr/lib |grep lib
libcublas.so
libcudnn.so

step 2: 找到libcudnn.so和libcublas.so的位置 , 安装apt-get install mlocate

(base) jovyan@wangzy-p2-0:/usr/lib$ locate libcublas
/opt/conda/pkgs/cudatoolkit-10.2.89-h713d32c_10/lib/libcublas.so
/opt/conda/pkgs/cudatoolkit-10.2.89-h713d32c_10/lib/libcublas.so.10
/opt/conda/pkgs/cudatoolkit-10.2.89-h713d32c_10/lib/libcublas.so.10.2.2.89
/opt/conda/pkgs/cudatoolkit-10.2.89-h713d32c_10/lib/libcublasLt.so
/opt/conda/pkgs/cudatoolkit-10.2.89-h713d32c_10/lib/libcublasLt.so.10
/opt/conda/pkgs/cudatoolkit-10.2.89-h713d32c_10/lib/libcublasLt.so.10.2.2.89
/opt/conda/pkgs/cudatoolkit-11.7.0-hd8887f6_11/lib/libcublas.so
/opt/conda/pkgs/cudatoolkit-11.7.0-hd8887f6_11/lib/libcublas.so.11
/opt/conda/pkgs/cudatoolkit-11.7.0-hd8887f6_11/lib/libcublas.so.11.10.1.25
/opt/conda/pkgs/cudatoolkit-11.7.0-hd8887f6_11/lib/libcublasLt.so
/opt/conda/pkgs/cudatoolkit-11.7.0-hd8887f6_11/lib/libcublasLt.so.11
/opt/conda/pkgs/cudatoolkit-11.7.0-hd8887f6_11/lib/libcublasLt.so.11.10.1.25

(base) jovyan@wangzy-p2-0:/usr/lib$ locate libcudnn.so
/opt/conda/pkgs/cudnn-7.6.5-cuda10.2_0/lib/libcudnn.so
/opt/conda/pkgs/cudnn-7.6.5-cuda10.2_0/lib/libcudnn.so.7
/opt/conda/pkgs/cudnn-7.6.5-cuda10.2_0/lib/libcudnn.so.7.6.5
/opt/conda/pkgs/cudnn-8.4.1.50-hed8a83a_0/lib/libcudnn.so
/opt/conda/pkgs/cudnn-8.4.1.50-hed8a83a_0/lib/libcudnn.so.8
/opt/conda/pkgs/cudnn-8.4.1.50-hed8a83a_0/lib/libcudnn.so.8.4.1

step 3: 在shared library中创建libcudnn.so和libcublas.so

cd /usr/lib
sudo ln -s /opt/conda/pkgs/cudnn-8.4.1.50-hed8a83a_0/lib/libcudnn.so.8.4.1 libcudnn.so
sudo ln -s /opt/conda/pkgs/cudatoolkit-11.7.0-hd8887f6_11/lib/libcublasLt.so.11.10.1.25 libcublas.so

问题3:

--------------------------------------
C++ Traceback (most recent call last):
--------------------------------------
No stack trace in paddle, may be caused by external reasons.

----------------------
Error Message Summary:
----------------------
FatalError: `Segmentation fault` is detected by the operating system.
  [TimeInfo: *** Aborted at 1676881286 (unix time) try "date -d @1676881286" if you are using GNU date ***]
  [SignalInfo: *** SIGSEGV (@0x0) received by PID 19715 (TID 0x7f0da97c4740) from PID 0 ***]

解决方案

  • 终端:添加conda虚拟环境变量
export LD_LIBRARY_PATH=/home/jovyan/.conda/envs/paddle/lib:$PATH

其中,paddle为虚拟环境变量名称,需要修改为自己对应的虚拟环境

  • Jupyter:添加环境变量
%env LD_LIBRARY_PATH=/home/jovyan/.conda/envs/paddle/lib:$PATH

问题4

  File "/home/jovyan/vol-1/github/PaddleOCR/test.py", line 8, in <module>
    result = ocr.ocr(img_path, cls=True)
  File "/home/jovyan/vol-1/github/PaddleOCR/paddleocr.py", line 523, in ocr
    img = check_img(img)
  File "/home/jovyan/vol-1/github/PaddleOCR/paddleocr.py", line 431, in check_img
    img, flag_gif, flag_pdf = check_and_read(image_file)
  File "/home/jovyan/vol-1/github/PaddleOCR/ppocr/utils/utility.py", line 93, in check_and_read
    for pg in range(0, pdf.pageCount):
AttributeError: 'Document' object has no attribute 'pageCount'

解决方案

pip install pymupdf==1.18.14 -i https://pypi.tuna.tsinghua.edu.cn/simple

问题5

ImportError: libcudart.so.10.2: cannot open shared object file: No such file or directory

解决方案

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/conda/envs/paddle/lib

问题6

Exception has occurred: OSError
In user code:

    File "tools/export_model.py", line 172, in <module>
      main()
    File "tools/export_model.py", line 165, in main
      sub_model_save_path, logger)
    File "tools/export_model.py", line 99, in export_single_model
      paddle.jit.save(model, save_path)
    File "<decorator-gen-101>", line 2, in save
      
    File "/usr/local/lib/python3.7/site-packages/paddle/fluid/wrapped_decorator.py", line 25, in __impl__
      return wrapped_func(*args, **kwargs)
    File "/usr/local/lib/python3.7/site-packages/paddle/fluid/dygraph/base.py", line 51, in __impl__
      return func(*args, **kwargs)
    File "/usr/local/lib/python3.7/site-packages/paddle/fluid/dygraph/jit.py", line 744, in save
      inner_input_spec)
    File "/usr/local/lib/python3.7/site-packages/paddle/fluid/dygraph/dygraph_to_static/program_translator.py", line 517, in concrete_program_specify_input_spec
      *desired_input_spec)
    File "/usr/local/lib/python3.7/site-packages/paddle/fluid/dygraph/dygraph_to_static/program_translator.py", line 427, in get_concrete_program
      concrete_program, partial_program_layer = self._program_cache[cache_key]
    File "/usr/local/lib/python3.7/site-packages/paddle/fluid/dygraph/dygraph_to_static/program_translator.py", line 723, in __getitem__
      self._caches[item] = self._build_once(item)
    File "/usr/local/lib/python3.7/site-packages/paddle/fluid/dygraph/dygraph_to_static/program_translator.py", line 714, in _build_once
      **cache_key.kwargs)
    File "<decorator-gen-99>", line 2, in from_func_spec
      
    File "/usr/local/lib/python3.7/site-packages/paddle/fluid/wrapped_decorator.py", line 25, in __impl__
      return wrapped_func(*args, **kwargs)
    File "/usr/local/lib/python3.7/site-packages/paddle/fluid/dygraph/base.py", line 51, in __impl__
      return func(*args, **kwargs)
    File "/usr/local/lib/python3.7/site-packages/paddle/fluid/dygraph/dygraph_to_static/program_translator.py", line 662, in from_func_spec
      outputs = static_func(*inputs)
    File "/paddle/debug/PaddleOCR/ppocr/modeling/architectures/base_model.py", line 79, in forward
      x = self.backbone(x)
    File "/usr/local/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py", line 917, in __call__
      return self._dygraph_call_func(*inputs, **kwargs)
    File "/usr/local/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py", line 907, in _dygraph_call_func
      outputs = self.forward(*inputs, **kwargs)
    File "/paddle/debug/PaddleOCR/ppocr/modeling/backbones/det_mobilenet_v3.py", line 146, in forward
      x = self.conv(x)
    File "/usr/local/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py", line 917, in __call__
      return self._dygraph_call_func(*inputs, **kwargs)
    File "/usr/local/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py", line 907, in _dygraph_call_func
      outputs = self.forward(*inputs, **kwargs)
    File "/paddle/debug/PaddleOCR/ppocr/modeling/backbones/det_mobilenet_v3.py", line 179, in forward
      x = self.conv(x)
    File "/usr/local/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py", line 917, in __call__
      return self._dygraph_call_func(*inputs, **kwargs)
    File "/usr/local/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py", line 907, in _dygraph_call_func
      outputs = self.forward(*inputs, **kwargs)
    File "/usr/local/lib/python3.7/site-packages/paddle/nn/layer/conv.py", line 677, in forward
      use_cudnn=self._use_cudnn)
    File "/usr/local/lib/python3.7/site-packages/paddle/nn/functional/conv.py", line 148, in _conv_nd
      type=op_type, inputs=inputs, outputs=outputs, attrs=attrs)
    File "/usr/local/lib/python3.7/site-packages/paddle/fluid/layer_helper.py", line 43, in append_op
      return self.main_program.current_block().append_op(*args, **kwargs)
    File "/usr/local/lib/python3.7/site-packages/paddle/fluid/framework.py", line 3184, in append_op
      attrs=kwargs.get("attrs", None))
    File "/usr/local/lib/python3.7/site-packages/paddle/fluid/framework.py", line 2224, in __init__
      for frame in traceback.extract_stack():

    ExternalError: CUDNN error(4), CUDNN_STATUS_INTERNAL_ERROR. 
      [Hint: 'CUDNN_STATUS_INTERNAL_ERROR'.  An internal cuDNN operation failed.  ] (at ../paddle/phi/backends/gpu/gpu_resources.cc:285)
      [operator < conv2d_fusion > error]
  File "/home/jovyan/vol-1/github/PaddleOCR2/PaddleOCR/tools/infer/predict_det.py", line 243, in __call__
    self.predictor.run()
  File "/home/jovyan/vol-1/github/PaddleOCR2/PaddleOCR/tools/infer/predict_system.py", line 76, in __call__
    dt_boxes, elapse = self.text_detector(img)
  File "/home/jovyan/vol-1/github/PaddleOCR2/PaddleOCR/paddleocr.py", line 556, in ocr
    dt_boxes, rec_res, _ = self.__call__(img, cls)
  File "/home/jovyan/vol-1/github/PaddleOCR2/PaddleOCR/ppocr/data/imaug/label_ops.py", line 1148, in _load_ocr_info
    ocr_result = self.ocr_engine.ocr(data['image'], cls=False)[0]
  File "/home/jovyan/vol-1/github/PaddleOCR2/PaddleOCR/ppocr/data/imaug/label_ops.py", line 1016, in __call__
    ocr_info = self._load_ocr_info(data)
  File "/home/jovyan/vol-1/github/PaddleOCR2/PaddleOCR/ppocr/data/imaug/__init__.py", line 56, in transform
    data = op(data)
  File "/home/jovyan/vol-1/github/PaddleOCR2/PaddleOCR/tools/infer_kie_token_ser.py", line 106, in __call__
    batch = transform(data, self.ops)
  File "/home/jovyan/vol-1/github/PaddleOCR2/PaddleOCR/tools/infer_kie_token_ser.py", line 149, in <module>
    result, _ = ser_engine(data)

解决方案

[root@pkm-05 ~]# nvidia-smi 
Fri Aug  4 19:12:16 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01    Driver Version: 515.65.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:5E:00.0 Off |                    0 |
| N/A   53C    P0    26W /  70W |   2139MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla T4            Off  | 00000000:5F:00.0 Off |                    0 |
| N/A   48C    P8    15W /  70W |     46MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla T4            Off  | 00000000:86:00.0 Off |                    0 |
| N/A   48C    P8    15W /  70W |     12MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla T4            Off  | 00000000:D8:00.0 Off |                    0 |
| N/A   51C    P8    16W /  70W |      4MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

查看显卡显存占用情况,关闭无用的僵尸进程,具体参考https://blog.csdn.net/qq_39698985/article/details/130111562
查看用户进程详情: ps aux | grep 用户进程

问题7

AttributeError: module ‘paddle‘ has no attribute ‘utils‘

解决方案
重新安装paddle

pip uninstall paddlepaddle 
python -m pip install paddlepaddle -i https://mirror.baidu.com/pypi/simple

问题8

1. 问题描述:

WARNING:root:PaddlePaddle meets some problem with 4 GPUs. This may be caused by:
 1. There is not enough GPUs visible on your system
 2. Some GPUs are occupied by other process now
 3. NVIDIA-NCCL2 is not installed correctly on your system. Please follow instruction on https://github.com/NVIDIA/nccl-tests 
 to test your NCCL, or reinstall it following https://docs.nvidia.com/deeplearning/sdk/nccl-install-guide/index.html
WARNING:root:
 Original Error is: (PreconditionNotMet) The third-party dynamic library (libnccl.so) that Paddle depends on is not configured correctly. (error code is libnccl.so: cannot open shared object file: No such file or directory)
  Suggestions:
  1. Check if the third-party dynamic library (e.g. CUDA, CUDNN) is installed correctly and its version is matched with paddlepaddle you installed.
  2. Configure third-party dynamic library environment variables as follows:
  - Linux: set LD_LIBRARY_PATH by `export LD_LIBRARY_PATH=...`
  - Windows: set PATH by `set PATH=XXX; (at /paddle/paddle/phi/backends/dynload/dynamic_loader.cc:305)

2. 查看系统版本

(paddle) jovyan@wangzy-p3-0:~/vol-1/soft$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 20.04.2 LTS
Release:        20.04
Codename:       focal

3.下载安装包
下载地址:https://developer.nvidia.com/nccl/nccl-legacy-downloads
在这里插入图片描述
4. 安装包

(paddle) jovyan@wangzy-p3-0:~/vol-1/soft$ sudo dpkg -i nccl-local-repo-ubuntu2004-2.14.3-cuda11.7_1.0-1_amd64.deb 
Selecting previously unselected package nccl-local-repo-ubuntu2004-2.14.3-cuda11.7.
(Reading database ... 25987 files and directories currently installed.)
Preparing to unpack nccl-local-repo-ubuntu2004-2.14.3-cuda11.7_1.0-1_amd64.deb ...
Unpacking nccl-local-repo-ubuntu2004-2.14.3-cuda11.7 (1.0-1) ...
Setting up nccl-local-repo-ubuntu2004-2.14.3-cuda11.7 (1.0-1) ...

The public nccl-local-repo-ubuntu2004-2.14.3-cuda11.7 GPG key does not appear to be installed.
To install the key, run this command:
sudo cp /var/nccl-local-repo-ubuntu2004-2.14.3-cuda11.7/nccl-local-44000BE4-keyring.gpg /usr/share/keyrings/
sudo cp /var/nccl-local-repo-ubuntu2004-2.14.3-cuda11.7/nccl-local-44000BE4-keyring.gpg /usr/share/keyrings/
sudo apt install libnccl2 libnccl-dev

5. 测试安装效果

(paddle) jovyan@wangzy-p3-0:~/vol-1/soft$ python
Python 3.8.13 (default, Oct 21 2022, 23:50:54) 
[GCC 11.2.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import paddle
[4pdvGPU Msg(31110:140474696726336:libvgpu.c:805)]: Initializing...
[4pdvGPU Msg(31110:140474696726336:context.c:120)]: vdevices_pci=0000:5e:00.0
[4pdvGPU Msg(31110:140474696726336:context.c:120)]: vdevices_pci=0000:5f:00.0
[4pdvGPU Msg(31110:140474696726336:context.c:120)]: vdevices_pci=0000:86:00.0
[4pdvGPU Msg(31110:140474696726336:context.c:120)]: vdevices_pci=0000:d8:00.0
[4pdvGPU Msg(31110:140474696726336:libvgpu.c:823)]: Initialized
>>> paddle.utils.run_check()
Running verify PaddlePaddle program ... 
W0928 06:53:31.457798 31110 gpu_resources.cc:61] Please NOTE: device: 0, GPU Compute Capability: 7.5, Driver API Version: 11.7, Runtime API Version: 11.7
W0928 06:53:31.472273 31110 gpu_resources.cc:91] device: 0, cuDNN Version: 8.4.
PaddlePaddle works well on 1 GPU.
/home/jovyan/.conda/envs/paddle/lib/python3.8/site-packages/paddle/fluid/executor.py:1583: UserWarning: Standalone executor is not used for data parallel
  warnings.warn(
W0928 06:53:37.109949 31110 fuse_all_reduce_op_pass.cc:79] Find all_reduce operators: 2. To make the speed faster, some all_reduce ops are fused during training, after fusion, the number of all_reduce ops is 2.
PaddlePaddle works well on 4 GPUs.
PaddlePaddle is installed successfully! Let's start deep learning with PaddlePaddle now.

参考资料:
https://www.freesion.com/article/7014941903/
https://stackoverflow.com/questions/72365190/libgl-so-1-cannot-open-shared-object-file-no-such-file-or-directory-even-whe

  • 1
    点赞
  • 4
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值