深度学习环境搭建部署(DeepLearning 神经网络)

工作环境

系统:Ubuntu 16.04.5 LTS
显卡:GPU
NVIDIA驱动:410.93
CUDA:10.0
Python:3.x

 

CUDA以及NVIDIA驱动安装,详见https://www.cnblogs.com/orzs/p/10951473.html

需要部署的软件

conda环境
nccl2环境
openmpi环境
horovod环境

 

 

1. 创建conda环境

官网下载地址:https://www.anaconda.com/distribution/#download-section

下载合适的安装文件,然后运行。

1 cd init
2 sudo wget https://repo.anaconda.com/archive/Anaconda3-2019.03-Linux-x86_64.sh
3 bash Anaconda3-2019.03-Linux-x86_64.sh

根据提示操作,并选择安装目录,默认安装在~/anaconda3/ 目录下。

注:初始化操作

1、如果默认不初始化,则安装之后,没有conda命令,需要手动初始化

 注:为避免用户名泄露,此处的用户名均已$USER替代

installation finished.
Do you wish the installer to initialize Anaconda3
by running conda init? [yes|no]
[no] >>>

 
  

You have chosen to not have conda modify your shell scripts at all.
To activate conda's base environment in your current shell session:

 
  

eval "$(/home/$USER/anaconda3/bin/conda shell.YOUR_SHELL_NAME hook)"

 
  

To install conda's shell functions for easier access, first activate, then:

 
  

conda init

 
  

If you'd prefer that conda's base environment not be activated on startup,
set the auto_activate_base parameter to false:

 
  

conda config --set auto_activate_base false

 
  

Thank you for installing Anaconda3!

 
  

===========================================================================

 
  

Anaconda and JetBrains are working together to bring you Anaconda-powered
environments tightly integrated in the PyCharm IDE.

 
  

PyCharm for Anaconda is available at:
https://www.anaconda.com/pycharm

 
  

 

2、如果选择初始化,则会修改~/.bashrc文件,并创建conda命令

installation finished. Do you wish the installer to initialize Anaconda3 by running conda init
? [yes|no] "deeplearning" 105L, 3558C written installation finished. Do you wish the installer to initialize Anaconda3 by running conda init? [yes|no] [no] >>> yes WARNING: The conda.compat module is deprecated and will be removed in a future release. no change /home/$USER/anaconda3/condabin/conda no change /home/$USER/anaconda3/bin/conda no change /home/$USER/anaconda3/bin/conda-env no change /home/$USER/anaconda3/bin/activate no change /home/$USER/anaconda3/bin/deactivate no change /home/$USER/anaconda3/etc/profile.d/conda.sh no change /home/$USER/anaconda3/etc/fish/conf.d/conda.fish no change /home/$USER/anaconda3/shell/condabin/Conda.psm1 no change /home/$USER/anaconda3/shell/condabin/conda-hook.ps1 no change /home/$USER/anaconda3/lib/python3.7/site-packages/xonsh/conda.xsh no change /home/$USER/anaconda3/etc/profile.d/conda.csh modified /home/$USER/.bashrc ==> For changes to take effect, close and re-open your current shell. <== If you'd prefer that conda's base environment not be activated on startup, set the auto_activate_base parameter to false: conda config --set auto_activate_base false Thank you for installing Anaconda3! =========================================================================== Anaconda and JetBrains are working together to bring you Anaconda-powered environments tightly integrated in the PyCharm IDE. PyCharm for Anaconda is available at: https://www.anaconda.com/pycharm

执行以下命令,使conda环境生效

1 source ~/.bashrc

 

 

2. 进入conda py3.6

1 conda create -n py36 python=3.6
2 conda activate py36


3. 安装必要包


#修改清华的pip源

1 mkdir ~/.pip
2 touch ~/.pip/pip.conf

#pip.conf中写入以下内容

[global]
index-url = https://pypi.tuna.tsinghua.edu.cn/simple

安装包

1 pip install numpy==1.16.2
2 pip install opencv-python==4.1.0.25
3 pip install keras==2.1.4
4 pip install tensorflow-gpu==1.13.1

 


4. 安装nccl2

下载地址:https://docs.nvidia.com/deeplearning/sdk/nccl-install-guide/index.html

根据系统和cuda版本下载对应的nccl2

1 sudo dpkg -i nccl-repo-ubuntu1604-2.4.7-ga-cuda10.0_1-1_amd64.deb
2 sudo apt-key add /var/nccl-repo-2.4.7-ga-cuda10.0/7fa2af80.pub(根据提示执行)
3 sudo apt update
4 sudo apt install libnccl2=2.4.7-1+cuda10.0 libnccl-dev=2.4.7-1+cuda10.0

5、安装libcudnn

根据版本,下载对应的文件:https://developer.nvidia.com/rdp/cudnn-download

1 sudo dpkg -i libcudnn7_7.6.0.64-1+cuda10.0_amd64.deb
2 sudo dpkg -i libcudnn7-dev_7.6.0.64-1+cuda10.0_amd64.deb

 

6. 安装openmpi

下载地址:https://www.open-mpi.org/faq/?category=building#easy-build

1 sudo wget https://download.open-mpi.org/release/open-mpi/v4.0/openmpi-4.0.1.tar.gz
2 gunzip -c openmpi-4.0.1.tar.gz | tar xf -
3 cd openmpi-4.0.1/
4 sudo ./configure --prefix=/usr/local
5 sudo make all install

 

7. 安装horovod

文档说明:https://github.com/horovod/horovod/blob/master/docs/gpus.rst

1 HOROVOD_GPU_ALLREDUCE=NCCL pip install --no-cache-dir horovod

注:HOROVOD_WITH_TENSORFLOW=1  可开启debug模式。 

 

至此,深度学习环境安装完成,接下来即可做深度训练。

 

conda环境常用命令

如何默认不使用conda环境
1 conda config --set auto_activate_base false
退出conda环境
1 conda deactivate
进入conda环境
1 conda activate

 

 

 

安装过程中可能出现的问题:

1、

ImportError: libcudnn.so.7: cannot open shared object file: No such file or directory

原因:cudann未安装或者版本错误

解决:根据版本,下载对应的文件:https://developer.nvidia.com/rdp/cudnn-download

1 sudo dpkg -i libcudnn7_7.6.0.64-1+cuda10.0_amd64.deb
2 sudo dpkg -i libcudnn7-dev_7.6.0.64-1+cuda10.0_amd64.deb

2、

ImportError: libcuda.so.1: cannot open shared object file: No such file or directory

原因:一般是cuda版本不对导致

解决:安装对应的cuda版本即可

3、

ImportError: libcublas.so.10.0: cannot open shared object file: No such file or directory

原因:一般情况是cuda链接库的问题

解决:执行以下命令即可

1 sudo ldconfig /usr/local/cuda/lib64

 4、奇葩问题:

ModuleNotFoundError: No module named 'cv2'

如果未安装opencv-python,直接执行以下命令安装即可

1 pip install opencv-python==4.1.0.25

如果已经安装,依然错误提示,我遇到的情况是,Python被劫持

执行命令

1 which python

回显提示

~/anaconda3/envs/py36/bin/python

执行

1 ~/anaconda3/envs/py36/bin/python

看到的版本是3.6.8

但是直接python看到是3.6.6

原因:python被劫持

解决:将~/.bashrc里的python环境变量清除即可

# alias python=/usr/bin/python3.6

 5、执行以下命令报错

1 conda create -n py36 python=3.6
WARNING: The conda.compat module is deprecated and will be removed in a future release.
Collecting package metadata: failed

UnavailableInvalidChannel: The channel is not accessible or is invalid.
  channel name: anaconda/pkgs/free
  channel url: https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free
  error code: 404

You will need to adjust your conda configuration to proceed.
Use `conda config --show channels` to view your configuration's current state,
and use `conda config --show-sources` to view config file locations.

检查conda配置(以前曾经安装过conda)

1 conda config --show-sources
channels:
  - https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main/
  - https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free/
  - defaults
show_channel_urls: True

原因:conda已经不支持外部源

解决:删除清华的源即可

1 conda config --remove channels 'https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main/'
2 conda config --remove channels 'https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free/'

 6、

[$USER-nmg-22:33834] mca_base_component_repository_open: unable to open mca_oob_ud: libibverbs.so.1: cannot open shared object file: No such file or directory (ignored)
[$USER-nmg-22:33733] mca_base_component_repository_open: unable to open mca_oob_ud: libibverbs.so.1: cannot open shared object file: No such file or directory (ignored)
[$USER-nmg-22:33733] mca_base_component_repository_open: unable to open mca_btl_openib: libibverbs.so.1: cannot open shared object file: No such file or directory (ignored)

原因:缺少libibverbs.so.1导致

解决:安装libibverbs1即可

1 apt-cache search libibverbs
2 sudo apt-get install libibverbs1

7、

python: symbol lookup error: /usr/local/lib/openmpi/mca_coll_cuda.so: undefined symbol: opal_cuda_check_bufs

 

原因:openmpi安装有问题或者版本冲突导致

解决:卸载并重新安装openmpi即可。

1 cd /where/your/old_mpi/sources/are   //进入其他版本的安装目录
2 sudo make uninstall
3 sudo rm -rf /usr/local/lib/openmpi /usr/local/lib/libmca* /usr/local/lib/libmpi* /usr/local/lib/libompitrace* /usr/local/lib/libopen* /usr/local/lib/liboshmem* /usr/local/lib/mpi_*
4 cd /where/your/mpi/sources/are   //进入需要安装的版本的目录
5 sudo ./configure --prefix=/usr/local
6 sudo make all install

 

 8、

tensorflow.python.framework.errors_impl.UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.

我遇到的情况是,已经安装了对应的版本(cuda10.0、libcudnn7-dev_7.6.0.64、tensorflow-gpu-1.13.1),但是被/usr/local/cuda-9.0/空目录影响到了,删除此目录即可。

1 sudo rm -rf /usr/local/cuda-9.0/

 

转载于:https://www.cnblogs.com/orzs/p/10943164.html

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值