一、虚拟环境配置
1、创建虚拟环境并进入:
conda create -n *****(虚拟环境名) python=3.7 #创建虚拟环境
source activate *****(虚拟环境名) #进入虚拟环境
2、在虚拟环境中安装pytorch:
(1)使用如下指令查看cuda版本。
nvcc --version
(2)到pytorch官网选择与cuda版本匹配的pytorch:
pytorch官网:https://pytorch.org/get-started/previous-versions/
参考:https://blog.csdn.net/xuzhichao123456/article/details/109218835
我选择的是pytorch1.2.0(对应于cuda=10.0):(conda安装慢就用pip)
#cuda10.0可选择:
pip install torch==1.2.0 torchvision==0.4.0
#cuda10.1可选择:
pip install torch==1.8.1+cu101 torchvision==0.9.1+cu101 torchaudio==0.8.1 -f https://download.pytorch.org/whl/torch_stable.html
conda install pytorch==1.7.0 torchvision==0.8.0 torchaudio==0.7.0 cudatoolkit=10.1 -c pytorch
3、安装cv2、numpy等需要的模块(包):
(1)常用的安装和卸载指令:
sudo apt-get install #卸载: sudo apt-get remove
pip install #卸载: pip uninstall
conda install #卸载: conda uninstall
(2)某些模块(包/库)的安装:
source activate env_name #先激活进入虚拟环境再安装
conda install scipy matplotlib tensorboard h5py tqdm
pip install timm=0.3.2
pip install numpy scipy matplotlib tensorboard h5py tqdm
#或:
pip install numpy #在pytorch中有自带的numpy
pip install scipy
pip install matplotlib
pip install tensorboard
pip install h5py
pip install tqdm
conda install scikit-image
python -m pip install opencv-python
(3)安装apex——用于改变数据格式来减小模型显存占用的工具。
cd PATH_TO_INSTALL
git clone https://github.com/NVIDIA/apex
cd apex
conda activate env_name #进入虚拟环境
git reset --hard 4ef930c1c884fdca5f472ab2ce7cb9b505d26c1a
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
二、同一机器上复制虚拟环境:
conda create -n new_env_name --clone old_env_name
将虚拟环境old_env_name复制到new_env_name
三、复制conda环境(不同机器之间也适用):
conda activate env_name #激活环境
conda env export --file env_name.yml #导出环境,得到.yml文件
cd path_dir_of_yml #进入.yml文件所在的目录
conda env create -f env_name.yml #导入环境
四、复制pip模块(不同机器之间也适用):
conda activate env_name #激活环境
pip freeze > requirements.txt #导出环境,得到.txt文件
cd path_env_name.yml #进入.txt文件所在的目录
source activate env_name # 激活新建的虚拟环境
pip install -r requirements.txt #安装.txt文件中包含的全部模块
五、九天毕昇平台上复制环境整合
(1)导出环境与导入环境
conda activate env_name #激活环境
conda env export --file env_name.yml #导出环境,得到.yml文件
cd path_dir_of_yml #进入.yml文件所在的目录
conda env create -f env_name.yml #导入环境
(2)安装pytorch
source activate env_name #先激活进入虚拟环境再安装
conda install pytorch==1.7.0 torchvision==0.8.0 torchaudio==0.7.0 cudatoolkit=10.1 -c pytorch
(3)安装环境中没有的其他库
source activate env_name #先激活进入虚拟环境再安装
pip install timm
(4)安装apex用于多GPU训练
cd PATH_TO_INSTALL
git clone https://github.com/NVIDIA/apex
cd apex
git reset --hard 4ef930c1c884fdca5f472ab2ce7cb9b505d26c1a
source activate env_name #进入虚拟环境
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
如果报错:
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/root/share/nlspn_nyu/apex/setup.py", line 152, in <module>
check_cuda_torch_binary_vs_bare_metal(torch.utils.cpp_extension.CUDA_HOME)
File "/root/share/nlspn_nyu/apex/setup.py", line 106, in check_cuda_torch_binary_vs_bare_metal
"https://github.com/NVIDIA/apex/pull/323#discussion_r287021798. "
RuntimeError: Cuda extensions are being compiled with a version of Cuda that does not match the version used to compile Pytorch binaries. Pytorch binaries were compiled with Cuda 10.2.
In some cases, a minor-version mismatch will not cause later errors: https://github.com/NVIDIA/apex/pull/323#discussion_r287021798. You can try commenting out this check (at your own risk).
Running setup.py install for apex ... error
ERROR: Command errored out with exit status 1: /root/.local/conda/envs/lsntorch/bin/python -u -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/root/share/nlspn_nyu/apex/setup.py'"'"'; __file__='"'"'/root/share/nlspn_nyu/apex/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' --cpp_ext --cuda_ext install --record /tmp/pip-record-lewx3715/install-record.txt --single-version-externally-managed --user --prefix= --compile --install-headers /root/.local/include/python3.6m/apex Check the logs for full command output.
则分别使用pip list
和conda list
查看torch和torchvision的版本,如果不匹配,就将不想要的版本卸载:
pip uninstall torch
pip uninstall torchvision
再执行:
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./