目的
在Tesla T4上编译nvidia 官方的量化工程retinanet
deb & run conflicting
Unable to determine the device handle for GPU 0000:B3:00.0:
- 是设备和主机断开了连接,探究了很久,发现是过热导致的。Tesla整个都是被动散热,youtube上有一些的Tesla散热的视频方法,我是直接加装了笔记本用抽风机,现在看来散热效率不够,温度稳定在84˚。
- https://blog.csdn.net/junmuzi/article/details/80707343
- update-grub
uninstall
安装apex
- 按照官方的流程安装就可以
安装pycuda
- 参照安装教程
- 配置configure参数,我的环境时ubuntu18,默认安装的python3.6. 我使用的python3.5,cuda10.1
./configure.py --python-exe=/usr/bin/python --cuda-root=/usr/local/cuda --cudadrv-lib-dir=/usr/lib --boost-inc-dir=/usr/include --boost-lib-dir=/usr/lib/x86_64-linux-gnu --boost-python-libname=boost_python-py36 --no-use-shipped-boost
- 注意,这儿编译源码是python setup.py install后,又有pip install, pytorch编译源码的时候直接setup.py install后就可以。编译安装这儿的弯弯绕还是有好多的啊
docker
- unable to evaluate symlinks in Dockerfile path:按流程办事就好 https://github.com/NVIDIA/retinanet-examples
- sudo docker build -t retinanet .
安装dali
安装cocoapi
- cocoapi pycocotools/_mask.c: No such file or directory
- sudo pip install cython
- fatal error: pybind11/pybind11.h: No such file or directory
# Use the Python interpreter to find the libs.
if(PythonLibsNew_FIND_REQUIRED)
find_package(PythonInterp 3.5 REQUIRED )
else()
find_package(PythonInterp ${PythonLibsNew_FIND_VERSION})
endif()
- sudo make install
run
- git clone https://github.com/nvidia/retinanet-examples
- docker build -t retinanet:latest retinanet/
- sudo docker run --gpus ‘“device=1”’ --name=nv_retian --ipc=host -it retinanet:latest
- retinanet infer retinanet_rn50fpn.pth --images /home/user/datasets/coco/val2017/ --annotations /home/user/datasets/coco/annotations/instances_val2017.json
undefined symbol: _ZN2cv8fastFreeEPv (cv::fastFree(void*))
- 怀疑是opencv的锅
- https://github.com/NVIDIA/retinanet-examples/issues/38
- 安装python-opencv后错误会发生改变
E: unable to locate package
- 可参考url
- /etc/apt/sources.list.d 或者在此路径下仿写一个相应的文件
glogs gflags gcc
- ubuntu18 默认gcc-7;通过sudo apt install gcc-5来安装
- gflags.cc.o: `stderr@@GLIBC_2.2.5’ — — 需要编译gflags时,生成shared文件。这里可以参考一下cmake时超参数的传递。这里是原理,这里是做法. 完整安装过程:https://blog.csdn.net/Amazingren/article/details/81873514
- 接上面,出现‘gflags’ has not been declared。是因为gflags编译是命名空间是google
cannot find -lopencv_xfeatures2d
-
x需要opencv_contrib,how to install opencv & opencv_contrib以及这个csdn的教程
-
No package ‘gtk±3.0’ found : sudo apt-get install build-essential libgtk-3-dev
-
No package ‘gstreamer-base-1.0’ found: with the information you gave me I was able to google it. the package that was missing on my system was libgstreamer-plugins-base1.0-dev.
-
missing: JAVA_INCLUDE_PATH JAVA_INCLUDE_PATH2 JAVA_AWT_INCLUDE_PATH: https://www.digitalocean.com/community/tutorials/how-to-install-java-with-apt-get-on-ubuntu-16-04
-
Duplicated modules NAMES has been found,contib没有切换分支
-
补充说明:contirb下载了一些神经网络,比如vgg、face-landmark还有机器学习的库boost
docker: Error response from daemon: could not select device driver “” with capabilities: [[gpu]].
- docker 和nvidia-docker有点区别,docker现在支持了很多nvidia-docker里的东西,所以直接用docker命令就可以。然鹅,毕竟后者是做了一点点调整的(特别是docker19后的二者深度交融),就需要按照流程办事了.
- 另外,用了docker后,很快的想法就是怎么移动docker的位置。方法在网上都能找的到,就是软链接。但是链接完之后,记得service stop docker再重启一下,不然有其他的问题。nvidia的用户,还需要按上面流程再走一遍。
pytorch的问题
使用pip安装pytorch二进制文件的时候,目前(2019年8月21日)能找到的版本是cuda10.0的pytorch1.1.0(没打算装1.2.0 :-)),实际上呢,如果编译源码的话是,是可以让pytorch支持cuda10.1的。
1. Install Dependencies
pip install numpy pyyaml mkl mkl-include setuptools cmake cffi typing
2. Get the PyTorch Source
git clone --recursive https://github.com/pytorch/pytorch
cd pytorch
pip install scikit-build --user
pip install ninja --user
git submodule update --init
pip install -U setuptools
pip install -r requirements.txt
3. Install PyTorch
export USE_CUDA=0 USE_CUDNN=0 USE_MKLDNN=1
cd ~/pytorch
python setup.py build
python setup.py install
4. import torch时报from torch._C import *错误
我的情况是编译目录下也有一个torch,估计是安装的时候把路径指向了这里。我猜测重启应该也有效果。我是重命名了这个文件夹之后就可以了。
现在的安装情况是:
torch.version.cuda ‘10.1.243’
torch 1.1.0
test
retinanet infer retinanet_rn50fpn.pth --images /datasets/coco/val2017/ --annotations /datasets/coco/annotations/instances_val2017.json
train
retinanet train retinanet_rn50fpn.pth --backbone ResNet50FPN \
--images /datasets/coco/train2017/ --annotations /datasets/coco/annotations/instances_train2017.json \
--val-images /datasets/coco/val2017/ --val-annotations /datasets/coco/annotations/instances_val2017.json
- Downloading: “https://download.pytorch.org/models/resnet50-19c8e357.pth” to /root/.torch/models/resnet50-19c8e357.pth
docker
启动docker
sudo docker run --gpus ‘”device=0”’ --name=nv_retian --ipc=host -it -v /home/user/datasets/:/datasets retinanet:latest
挂载本地目录
- v /host:/docker’s
- docker 如何删除none镜像:
RuntimeError: Failed to export an ONNX attribute, since it’s not constant, please try to make things (e.g., kernel size) static if possible #137
- 需要修改upsample_nearest2d函数。路径是site-packages/torch/onnx/symbolic.py. 如果找不到python第三方库的路径,可以查看这个帖子。
- 如何修改这个文件参考github上的这个问答,也就是这个链接: pytorch/pytorch@11845cf