代码地址:
https://github.com/jeonsworld/ViT-pytorch
1.报错
Traceback (most recent call last):
File "train.py", line 16, in <module>
from torch.utils.tensorboard import SummaryWriter
File "/mnt/public/users/lig/anaconda/envs/vit4/lib/python3.6/site-packages/torch/utils/tensorboard/__init__.py", line 4, in <module>
LooseVersion = distutils.version.LooseVersion
AttributeError: module 'distutils' has no attribute 'version'
网上说的是对的,“setuptools版本问题”,换一个较低的版本
pip uninstall setuptools
conda install setuptools==58.0.4
2.报错
File "train.py", line 17, in <module>
from apex import amp
ModuleNotFoundError: No module named 'apex'
搜索发现,可能是Python版本问题,原环境为Python2.7,重新创建Python3.7的环境(一开始设的3.6,后面出问题说要至少3.7)
conda create -n vit4 python=3.7
在该环境下安装apex
conda activate vit4
git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
或者:python setup.py install [--cuda_ext] [--cpp_ext]
运行代码又报错
AttributeError: module 'torch.distributed' has no attribute '_all_gather_base'
重新安装torch(之前指定了版本),依旧报错,甚至无法安装apex,应该是cuda版本不对应的问题
OSError: /mnt/public/users/lig/anaconda/envs/vit6/lib/python3.7/site-packages/torch/lib/../../nvidia/cublas/lib/libcublas.so.11: symbol cublasLtHSHMatmulAlgoInit, version libcublasLt.so.11 not defined in file libcublasLt.so.11 with link time reference
重新安装回指定torch==1.6,解决第一个问题,重新下1.8版本,我的是11.1
应该还是apex的问题,按照apex安装常见的三个报错并成功解决(亲测有效)_weixin_59726951的博客-CSDN博客_apex安装错误第四个问题的解决方案试一试,有效