docker下配置EfficientLO

AlexZYZ418

已于 2023-11-27 11:37:42 修改

阅读量130

点赞数

分类专栏： Efficient-LO学习文章标签： docker 容器运维

于 2023-11-17 09:33:56 首次发布

本文链接：https://blog.csdn.net/m0_63629923/article/details/134385329

版权

Efficient-LO学习专栏收录该内容

4 篇文章 0 订阅

订阅专栏

0.先决条件

官网上对EfficientLO配置为

python 3.6.8
CUDA 9.0
TensorFlow 1.12.0
numpy 1.16.1

1、环境配置

有nvidia的docker环境需要用nvidia-docker run来进行启动

【杂谈】如何应对烦人的开源库版本依赖-做一个心平气和的程序员？

链接的第三点

从这篇文章学到了dockerfile的配置：

dockercuda9 - 老白网络

但在dockerfile的build这里出现了一点小问题

在键入

docker build -t efflotry .

这个命令之后（efflotry是image的名字）

出现报错

解决方法：

【Ubuntu docker运行dockerfile时报错】GPG error_尽量不拖延的小王的博客-CSDN博客

在dockerfile报错的话之前加入：

RUN apt-key adv --keyserver keyserver.ubuntu.com --recv-keys A4B469963BF863CC

最后的docerfile为：

FROM nvidia/cuda:9.0-cudnn7-runtime
RUN apt-key adv --keyserver keyserver.ubuntu.com --recv-keys A4B469963BF863CC
RUN apt-get update && \
apt-get install -y --no-install-recommends \
libcudnn7=7.0.5.15-1+cuda9.0 && \
apt-mark hold libcudnn7 && \
rm -rf /var/lib/apt/lists/*
RUN pip install tensorflow-gpu==1.12.0

但是仍然报错

如下图
error: failed to solve: process "/bin/sh *************" did not complete successfully: exit code: 100
然后搜索得到

linux-qemu: uncaught target signal 11 (Segmentation fault) - 糯米PHP

https://medium.com/@pranjaldoshi96/building-arm-jetson-image-on-x86-linux-machine-using-docker-buildx-752293ce9c90

可采用buildx的方法解决，在编译qemu的时候出现问题，

ERROR: glib-2.56 gthread-2.0 is required to compile QEMU

解决方法：

[xv6] xv6 的运行环境搭建 - 知乎
sudo apt install libglib2.0-dev

在这之后采用buildx命令来进行image的构建

docker buildx build . --platform linux/amd64 -t effilotry:v1

以上都是无用功。。。

折腾了老半天在这篇文章的引领下搞出来了docker下的nvidia-smi

Ubuntu下 NVIDIA Container Runtime 安装与使用_nvidia-container-runtime 安装_MAVER1CK的博客-CSDN博客

使用 NVIDIA_VISIBLE_DEVICES 启用所有的GPU
docker run --rm --runtime=nvidia \
    -e NVIDIA_VISIBLE_DEVICES=all nvidia/cuda nvidia-smi

到后面需要的docker容器

docker run -it -p 5900:5900 --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=all -v /home/zyz/EfficientLO_docker:/home/test --name effilotry nvidia/cuda:9.0-cudnn7-runtime

但是由于我自己的电脑装了nvidia-525的显卡，貌似cuda版本对不上？打算后面再看看，因为我的镜像就是下载的cuda9.0的。

接下来先安装python：

搬运自安装python3.6.8步骤

一、下载python3.6.8的源码压缩文件(下载源码目录随意，安装路径推荐/usr/local/python3)

1.创建安装路径
    mkdir -p /usr/local/python3
2.下载python源码
    #操作路径 /home/worker/${name}
    #官方源下载慢可以使用 https://registry.npmmirror.com/-/binary/python/3.6.8/Python-3.6.8.tgz
    wget https://www.python.org/ftp/python/3.6.8/Python-3.6.8.tgz
3.解压源码压缩包
    #操作路径 /home/worker/${name}
    tar -zxvf Python-3.6.8.tgz
二、安装python3.6.8

1.编译安装环境
    #操作路径 /home/worker/${name}/Python-3.6.8/
    ./configure --prefix=/usr/local/python3
2.进行安装python3.6.8
    #操作路径 /home/worker/${name}/Python-3.6.8
    make && make install
3.创建软连接
    ln -s /usr/local/python3/bin/python3 /usr/bin/python3
    ln -s /usr/local/python3/bin/pip3 /usr/bin/pip3
4.验证是否安装成功
    python3 -V 输出下方信息，证明成功了 Python 3.6.8

安装tensorflow

先安装pip

linux下提示：pip未找到命令（bash: pip: command not found）_-bash: pip: command not found-CSDN博客

tensorflow-gpu · PyPI

pip install tensorflow_gpu-1.12.0-cp36-cp36m-manylinux1_x86_64.whl

还是从最开始的ubuntu18来配置吧我无语了。。。

爆详细Ubuntu18.04,CUDA9.0,OpenCV3.1,Tensorflow完全配置指南-CSDN博客

想偷懒用别人配好的

docker pull pytorch/pytorch:0.4.1-cuda9-cudnn7-runtime

可以用这个人的（侵权请告知）

docker pull muyeby/py3.6.8cu90torch1.0.0:latest

感谢老天，最后是dockerhub上的这个人救了我：

docker pull pengjl929/cu90-py36-tf112-snt129:v0.1

按照他的pull下来，然后运行下面这句话。

docker run -it -p 5900:5900 --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=all -v /home/zyz/EfficientLO_docker:/home/test --name effilotry pengjl929/cu90-py36-tf112-snt129:v0.1

后面试了试这句话（最终版本，就用这个了）：

docker run -t -i --gpus all -p 5900:5900 -v /home/zyz/EfficientLO_docker:/home/test --name effilotry pengjl929/cu90-py36-tf112-snt129:v0.1

靠下面这些话进行tensorflow的验证和cuda版本的验证

# tensorflow版本查看
pip list |grep tenso

# cuda版本查看
nvcc -V

验证tensorflow结果：

import tensorflow as tf 
 
sess = tf.Session() 
a = tf.constant(1) 
b = tf.constant(1) 
print(sess.run(a+b))

中间加载gpu有个过程的，我还以为是报错了还折腾了一会，后面等以下就好了。

最后安装numpy 1.16.1 ：

pip install numpy==1.16.1

测试一下gpu：

要用这两句话先看gpu名称

from tensorflow.python.client import device_lib
device_lib.list_local_devices()

输出：

[name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 2141397179551110845
, name: "/device:XLA_GPU:0"
device_type: "XLA_GPU"
memory_limit: 17179869184
locality {
}
incarnation: 17703738136983886232
physical_device_desc: "device: XLA_GPU device"
, name: "/device:XLA_CPU:0"
device_type: "XLA_CPU"
memory_limit: 17179869184
locality {
}
incarnation: 15381415839974514119
physical_device_desc: "device: XLA_CPU device"
, name: "/device:GPU:0"
device_type: "GPU"
memory_limit: 7270442599
locality {
  bus_id: 1
  links {
  }
}
incarnation: 5021056047332816420
physical_device_desc: "device: 0, name: NVIDIA GeForce RTX 4060 Laptop GPU, pci bus id: 0000:01:00.0, compute capability: 8.9"
]

证明gpu还是有信息的

然后测试一下看看有没有运行

import tensorflow as tf
 
with tf.device('/cpu:0'):
    a = tf.constant([1.0,2.0,3.0],shape=[3],name='a')
    b = tf.constant([1.0,2.0,3.0],shape=[3],name='b')
with tf.device('/gpu:0'):
    c = a+b
   
#注意：allow_soft_placement=True表明：计算设备可自行选择，如果没有这个参数，会报错。
#因为不是所有的操作都可以被放在GPU上，如果强行将无法放在GPU上的操作指定到GPU上，将会报错。
sess = tf.Session(config=tf.ConfigProto(allow_soft_placement=True,log_device_placement=True))
#sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
sess.run(tf.global_variables_initializer())
print(sess.run(c))

2023-11-16 16:39:44.495512: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties:
name: NVIDIA GeForce RTX 4060 Laptop GPU major: 8 minor: 9 memoryClockRate(GHz): 1.89
pciBusID: 0000:01:00.0
totalMemory: 7.73GiB freeMemory: 6.71GiB
2023-11-16 16:39:44.495520: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2023-11-16 16:39:44.813628: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2023-11-16 16:39:44.813650: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0
2023-11-16 16:39:44.813654: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N
2023-11-16 16:39:44.813722: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6453 MB memory) -> physical GPU (device: 0, name: NVIDIA GeForce RTX 4060 Laptop GPU, pci bus id: 0000:01:00.0, compute capability: 8.9)
Device mapping:
/job:localhost/replica:0/task:0/device:XLA_GPU:0 -> device: XLA_GPU device
/job:localhost/replica:0/task:0/device:XLA_CPU:0 -> device: XLA_CPU device
/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: NVIDIA GeForce RTX 4060 Laptop GPU, pci bus id: 0000:01:00.0, compute capability: 8.9
2023-11-16 16:39:44.814869: I tensorflow/core/common_runtime/direct_session.cc:307] Device mapping:
/job:localhost/replica:0/task:0/device:XLA_GPU:0 -> device: XLA_GPU device
/job:localhost/replica:0/task:0/device:XLA_CPU:0 -> device: XLA_CPU device
/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: NVIDIA GeForce RTX 4060 Laptop GPU, pci bus id: 0000:01:00.0, compute capability: 8.9

add: (Add): /job:localhost/replica:0/task:0/device:GPU:0
2023-11-16 16:39:44.815850: I tensorflow/core/common_runtime/placer.cc:927] add: (Add)/job:localhost/replica:0/task:0/device:GPU:0
init: (NoOp): /job:localhost/replica:0/task:0/device:GPU:0
2023-11-16 16:39:44.815863: I tensorflow/core/common_runtime/placer.cc:927] init: (NoOp)/job:localhost/replica:0/task:0/device:GPU:0
a: (Const): /job:localhost/replica:0/task:0/device:CPU:0
2023-11-16 16:39:44.815868: I tensorflow/core/common_runtime/placer.cc:927] a: (Const)/job:localhost/replica:0/task:0/device:CPU:0
b: (Const): /job:localhost/replica:0/task:0/device:CPU:0
2023-11-16 16:39:44.815871: I tensorflow/core/common_runtime/placer.cc:927] b: (Const)/job:localhost/replica:0/task:0/device:CPU:0
[2. 4. 6.]

输出以上这些东西大概就代表ok了。

至此环境以配置完成，下面开始代码运行。

2、代码配置

2.1 代码文件结构

不管怎么样，先对这个东西进行编译。

cd ./tf_ops/2d_conv_random_k
sh fused_conv.sh
cd ../2d_conv_select_k
sh fused_conv.sh
cd ..

2.2 training

在正式训练之前，首先先去kitti上下数据

The KITTI Vision Benchmark Suite https://s3.eu-central-1.amazonaws.com/avg-kitti/data_odometry_velodyne.zip

可使用迅雷下载更快。

下载完数据之后，按照对应要求组织以下数据

data_root
├── 00
│ ├── velodyne
│ ├── calib.txt
├── 01
├── ...

在工作路径下点进去command_train.sh这个文件

先要确定以下参数：

mode(train)

GPU

model(path to PWCLONet model)

data_root

log_dir

train_list(sequences for training)

val_list(sequences for validation)

训练结果和最佳模型会存放在log_dir.

有个巨坑的点，需要把 command_train.sh 中的GPU改为你自己对应GPU的编号，我笔记本因为只有一个GPU所以一开始跑不出来。然后数据路径data_root也要改。

python main.py \

--mode train \

--gpu 0 \

--model pwclo_model \

--data_root /home/test/kittidata/data_root \

--checkpoint_path ./pretrained_model/pretrained_model.ckpt \

--log_dir Efficient-LOnet_log_ \

--result_dir result \

--train_list 0 1 2 3 4 5 6 \

--val_list 7 8 9 10 \

--test_list 0 1 2 3 4 5 6 7 8 9 10 \

--num_H_input 64 \

--num_W_input 1800 \

--max_epoch 1000 \

--learning_rate 0.001 \

--batch_size 8 \

> Efficient-LOnet_log.txt 2>&1 &

紧接着再运行：

sh command_train.sh

到这里successful load GPU了

2023-11-16 16:37:31.599521: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2023-11-16 16:37:31.674593: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:964] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-11-16 16:37:31.674679: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties:
name: NVIDIA GeForce RTX 4060 Laptop GPU major: 8 minor: 9 memoryClockRate(GHz): 1.89

pciBusID: 0000:01:00.0
totalMemory: 7.73GiB freeMemory: 6.84GiB
2023-11-16 16:37:31.674687: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2023-11-16 16:37:32.067483: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2023-11-16 16:37:32.067517: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0
2023-11-16 16:37:32.067522: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N
2023-11-16 16:37:32.067607: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6560 MB memory) -> physical GPU (device: 0, name: NVIDIA GeForce RTX 4060 Laptop GPU, pci bus id: 0000:01:00.0, compute capability: 8.9)
pid: 46820

但报错

**** EPOCH 000 ****
Traceback (most recent call last):
File "main.py", line 605, in <module>
main(MODE)
File "main.py", line 224, in main
train_one_epoch(sess, ops, train_writer, train_list = TRAIN_LIST)
TypeError: train_one_epoch() got an unexpected keyword argument 'train_list'

后来根据报错信息，发现是作者定义函数的时候出错了：

在main函数里面，这句话：

def train_one_epoch(sess, ops, train_writer):

少给了一个参数，应改为：

def train_one_epoch(sess, ops, train_writer, train_list):

但其实这里加不加 train_list 这个参数也无妨，因为反正在这个函数里面也没用到。

在运行的过程中，发现自己电脑显存不够，出现了resource exhauted的问题，可能得去服务器上跑一下。

2023.11.27在服务器上跑了，但发现还是内存爆炸，用的是RTX4090的显卡跑的，于是看代码分析是哪里出现了内存问题。

按照报错提示，从main函数的

train_one_epoch的feed_dict变量开始进行分析。

是有个tensor，维度为[8,3600,6,64]的出了问题

但从变量来看并不是ops和feed_dict里面的变量出了问题。