0.先决条件
官网上对EfficientLO配置为
python 3.6.8
CUDA 9.0
TensorFlow 1.12.0
numpy 1.16.1
1、环境配置
有nvidia的docker环境需要用
nvidia-docker run来进行启动
【杂谈】如何应对烦人的开源库版本依赖-做一个心平气和的程序员?
链接的第三点
从这篇文章学到了dockerfile的配置:
但在dockerfile的build这里出现了一点小问题
在键入
docker build -t efflotry .
这个命令之后(efflotry是image的名字)
出现报错
解决方法:
【Ubuntu docker运行dockerfile时报错】GPG error_尽量不拖延的小王的博客-CSDN博客
在dockerfile报错的话之前加入:
RUN apt-key adv --keyserver keyserver.ubuntu.com --recv-keys A4B469963BF863CC
最后的docerfile为:
FROM nvidia/cuda:9.0-cudnn7-runtime
RUN apt-key adv --keyserver keyserver.ubuntu.com --recv-keys A4B469963BF863CC
RUN apt-get update && \
apt-get install -y --no-install-recommends \
libcudnn7=7.0.5.15-1+cuda9.0 && \
apt-mark hold libcudnn7 && \
rm -rf /var/lib/apt/lists/*
RUN pip install tensorflow-gpu==1.12.0
但是仍然报错
如下图
error: failed to solve: process "/bin/sh *************" did not complete successfully: exit code: 100
然后搜索得到
linux-qemu: uncaught target signal 11 (Segmentation fault) - 糯米PHP
可采用buildx的方法解决,在编译qemu的时候出现问题,
ERROR: glib-2.56 gthread-2.0 is required to compile QEMU
解决方法:
sudo apt install libglib2.0-dev
在这之后采用buildx命令来进行image的构建
docker buildx build . --platform linux/amd64 -t effilotry:v1
以上都是无用功。。。
折腾了老半天在这篇文章的引领下搞出来了docker下的nvidia-smi
Ubuntu下 NVIDIA Container Runtime 安装与使用_nvidia-container-runtime 安装_MAVER1CK的博客-CSDN博客
使用 NVIDIA_VISIBLE_DEVICES 启用所有的GPU
docker run --rm --runtime=nvidia \ -e NVIDIA_VISIBLE_DEVICES=all nvidia/cuda nvidia-smi
到后面需要的docker容器
docker run -it -p 5900:5900 --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=all -v /home/zyz/EfficientLO_docker:/home/test --name effilotry nvidia/cuda:9.0-cudnn7-runtime
但是由于我自己的电脑装了nvidia-525的显卡,貌似cuda版本对不上?打算后面再看看,因为我的镜像就是下载的cuda9.0的。
接下来先安装python:
一、下载python3.6.8的源码压缩文件(下载源码目录随意,安装路径推荐/usr/local/python3)
1.创建安装路径
mkdir -p /usr/local/python3
2.下载python源码
#操作路径 /home/worker/${name}
#官方源下载慢可以使用 https://registry.npmmirror.com/-/binary/python/3.6.8/Python-3.6.8.tgz
wget https://www.python.org/ftp/python/3.6.8/Python-3.6.8.tgz
3.解压源码压缩包
#操作路径 /home/worker/${name}
tar -zxvf Python-3.6.8.tgz
二、安装python3.6.81.编译安装环境
#操作路径 /home/worker/${name}/Python-3.6.8/
./configure --prefix=/usr/local/python3
2.进行安装python3.6.8
#操作路径 /home/worker/${name}/Python-3.6.8
make && make install
3.创建软连接
ln -s /usr/local/python3/bin/python3 /usr/bin/python3
ln -s /usr/local/python3/bin/pip3 /usr/bin/pip3
4.验证是否安装成功
python3 -V 输出下方信息,证明成功了 Python 3.6.8
安装tensorflow
先安装pip
linux下提示:pip未找到命令(bash: pip: command not found)_-bash: pip: command not found-CSDN博客
pip install tensorflow_gpu-1.12.0-cp36-cp36m-manylinux1_x86_64.whl
还是从最开始的ubuntu18来配置吧我无语了。。。
想偷懒用别人配好的
docker pull pytorch/pytorch:0.4.1-cuda9-cudnn7-runtime
可以用这个人的(侵权请告知)
docker pull muyeby/py3.6.8cu90torch1.0.0:latest
感谢老天,最后是dockerhub上的这个人救了我:
docker pull pengjl929/cu90-py36-tf112-snt129:v0.1
按照他的pull下来,然后运行下面这句话。
docker run -it -p 5900:5900 --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=all -v /home/zyz/EfficientLO_docker:/home/test --name effilotry pengjl929/cu90-py36-tf112-snt129:v0.1
后面试了试这句话(最终版本,就用这个了):
docker run -t -i --gpus all -p 5900:5900 -v /home/zyz/EfficientLO_docker:/home/test --name effilotry pengjl929/cu90-py36-tf112-snt129:v0.1
靠下面这些话进行tensorflow的验证和cuda版本的验证
# tensorflow版本查看
pip list |grep tenso
# cuda版本查看
nvcc -V
验证tensorflow结果:
import tensorflow as tf
sess = tf.Session()
a = tf.constant(1)
b = tf.constant(1)
print(sess.run(a+b))
中间加载gpu有个过程的,我还以为是报错了还折腾了一会,后面等以下就好了。
最后安装numpy 1.16.1 :
pip install numpy==1.16.1
测试一下gpu:
要用这两句话先看gpu名称
from tensorflow.python.client import device_lib
device_lib.list_local_devices()
输出:
[name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 2141397179551110845
, name: "/device:XLA_GPU:0"
device_type: "XLA_GPU"
memory_limit: 17179869184
locality {
}
incarnation: 17703738136983886232
physical_device_desc: "device: XLA_GPU device"
, name: "/device:XLA_CPU:0"
device_type: "XLA_CPU"
memory_limit: 17179869184
locality {
}
incarnation: 15381415839974514119
physical_device_desc: "device: XLA_CPU device"
, name: "/device:GPU:0"
device_type: "GPU"
memory_limit: 7270442599
locality {
bus_id: 1
links {
}
}
incarnation: 5021056047332816420
physical_device_desc: "device: 0, name: NVIDIA GeForce RTX 4060 Laptop GPU, pci bus id: 0000:01:00.0, compute capability: 8.9"
]
证明gpu还是有信息的
然后测试一下看看有没有运行
import tensorflow as tf
with tf.device('/cpu:0'):
a = tf.constant([1.0,2.0,3.0],shape=[3],name='a')
b = tf.constant([1.0,2.0,3.0],shape=[3],name='b')
with tf.device('/gpu:0'):
c = a+b
#注意:allow_soft_placement=True表明:计算设备可自行选择,如果没有这个参数,会报错。
#因为不是所有的操作都可以被放在GPU上,如果强行将无法放在GPU上的操作指定到GPU上,将会报错。
sess = tf.Session(config=tf.ConfigProto(allow_soft_placement=True,log_device_placement=True))
#sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
sess.run(tf.global_variables_initializer())
print(sess.run(c))
2023-11-16 16:39:44.495512: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties:
name: NVIDIA GeForce RTX 4060 Laptop GPU major: 8 minor: 9 memoryClockRate(GHz): 1.89
pciBusID: 0000:01:00.0
totalMemory: 7.73GiB freeMemory: 6.71GiB
2023-11-16 16:39:44.495520: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2023-11-16 16:39:44.813628: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2023-11-16 16:39:44.813650: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0
2023-11-16 16:39:44.813654: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N
2023-11-16 16:39:44.813722: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6453 MB memory) -> physical GPU (device: 0, name: NVIDIA GeForce RTX 4060 Laptop GPU, pci bus id: 0000:01:00.0, compute capability: 8.9)
Device mapping:
/job:localhost/replica:0/task:0/device:XLA_GPU:0 -> device: XLA_GPU device
/job:localhost/replica:0/task:0/device:XLA_CPU:0 -> device: XLA_CPU device
/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: NVIDIA GeForce RTX 4060 Laptop GPU, pci bus id: 0000:01:00.0, compute capability: 8.9
2023-11-16 16:39:44.814869: I tensorflow/core/common_runtime/direct_session.cc:307] Device mapping:
/job:localhost/replica:0/task:0/device:XLA_GPU:0 -> device: XLA_GPU device
/job:localhost/replica:0/task:0/device:XLA_CPU:0 -> device: XLA_CPU device
/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: NVIDIA GeForce RTX 4060 Laptop GPU, pci bus id: 0000:01:00.0, compute capability: 8.9add: (Add): /job:localhost/replica:0/task:0/device:GPU:0
2023-11-16 16:39:44.815850: I tensorflow/core/common_runtime/placer.cc:927] add: (Add)/job:localhost/replica:0/task:0/device:GPU:0
init: (NoOp): /job:localhost/replica:0/task:0/device:GPU:0
2023-11-16 16:39:44.815863: I tensorflow/core/common_runtime/placer.cc:927] init: (NoOp)/job:localhost/replica:0/task:0/device:GPU:0
a: (Const): /job:localhost/replica:0/task:0/device:CPU:0
2023-11-16 16:39:44.815868: I tensorflow/core/common_runtime/placer.cc:927] a: (Const)/job:localhost/replica:0/task:0/device:CPU:0
b: (Const): /job:localhost/replica:0/task:0/device:CPU:0
2023-11-16 16:39:44.815871: I tensorflow/core/common_runtime/placer.cc:927] b: (Const)/job:localhost/replica:0/task:0/device:CPU:0
[2. 4. 6.]
输出以上这些东西大概就代表ok了。
至此环境以配置完成,下面开始代码运行。
2、代码配置
2.1 代码文件结构
不管怎么样,先对这个东西进行编译。
cd ./tf_ops/2d_conv_random_k
sh fused_conv.sh
cd ../2d_conv_select_k
sh fused_conv.sh
cd ..
2.2 training
在正式训练之前,首先先去kitti上下数据
The KITTI Vision Benchmark Suitehttps://s3.eu-central-1.amazonaws.com/avg-kitti/data_odometry_velodyne.zip
可使用迅雷下载更快。
下载完数据之后,按照对应要求组织以下数据
data_root
├── 00
│ ├── velodyne
│ ├── calib.txt
├── 01
├── ...
在工作路径下点进去command_train.sh这个文件
先要确定以下参数:
mode
(train)
GPU
model
(path to PWCLONet model)
data_root
log_dir
train_list
(sequences for training)
val_list
(sequences for validation)训练结果和最佳模型会存放在
log_dir
.
有个巨坑的点,需要把 command_train.sh 中的GPU改为你自己对应GPU的编号,我笔记本因为只有一个GPU所以一开始跑不出来。然后数据路径data_root也要改。
python main.py \
--mode train \
--gpu 0 \
--model pwclo_model \
--data_root /home/test/kittidata/data_root \
--checkpoint_path ./pretrained_model/pretrained_model.ckpt \
--log_dir Efficient-LOnet_log_ \
--result_dir result \
--train_list 0 1 2 3 4 5 6 \
--val_list 7 8 9 10 \
--test_list 0 1 2 3 4 5 6 7 8 9 10 \
--num_H_input 64 \
--num_W_input 1800 \
--max_epoch 1000 \
--learning_rate 0.001 \
--batch_size 8 \
> Efficient-LOnet_log.txt 2>&1 &
紧接着再运行:
sh command_train.sh
到这里successful load GPU了
2023-11-16 16:37:31.599521: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2023-11-16 16:37:31.674593: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:964] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-11-16 16:37:31.674679: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties:
name: NVIDIA GeForce RTX 4060 Laptop GPU major: 8 minor: 9 memoryClockRate(GHz): 1.89pciBusID: 0000:01:00.0
totalMemory: 7.73GiB freeMemory: 6.84GiB
2023-11-16 16:37:31.674687: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2023-11-16 16:37:32.067483: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2023-11-16 16:37:32.067517: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0
2023-11-16 16:37:32.067522: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N
2023-11-16 16:37:32.067607: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6560 MB memory) -> physical GPU (device: 0, name: NVIDIA GeForce RTX 4060 Laptop GPU, pci bus id: 0000:01:00.0, compute capability: 8.9)
pid: 46820
但报错
**** EPOCH 000 ****
Traceback (most recent call last):
File "main.py", line 605, in <module>
main(MODE)
File "main.py", line 224, in main
train_one_epoch(sess, ops, train_writer, train_list = TRAIN_LIST)
TypeError: train_one_epoch() got an unexpected keyword argument 'train_list'
后来根据报错信息,发现是作者定义函数的时候出错了:
在main函数里面,这句话:
def train_one_epoch(sess, ops, train_writer):
少给了一个参数,应改为:
def train_one_epoch(sess, ops, train_writer, train_list):
但其实这里加不加 train_list 这个参数也无妨,因为反正在这个函数里面也没用到。
在运行的过程中,发现自己电脑显存不够,出现了resource exhauted的问题,可能得去服务器上跑一下。
2023.11.27在服务器上跑了,但发现还是内存爆炸,用的是RTX4090的显卡跑的,于是看代码分析是哪里出现了内存问题。
按照报错提示,从main函数的
train_one_epoch的feed_dict变量开始进行分析。
是有个tensor,维度为[8,3600,6,64]的出了问题
但从变量来看并不是ops和feed_dict里面的变量出了问题。