LARC DL笔记（二）训练自己的img

最新推荐文章于 2021-04-24 15:27:55 发布

This is bill

最新推荐文章于 2021-04-24 15:27:55 发布

阅读量1.2k

点赞数 2

分类专栏：机器学习文章标签：脚本 deep-learning caffe

本文链接：https://blog.csdn.net/Scythe666/article/details/76611084

版权

机器学习专栏收录该内容

321 篇文章 17 订阅

订阅专栏

继看完贺完结！CS231n官方笔记
上一次已经成功跑起caffe自带的例程，mnist和cifar10
但是终归用的是里面写好的脚本，于是打算训练自己的img

〇、目标

准备好food图片3类（出于数据安全考虑，使用food101公开数据集）

这里写图片描述

每一类都是没有resize的1000张图片

这里写图片描述

现在的任务就是：

将这三类food分类

通过这个小任务应该可以熟练caffe使用

小问题列表：

（1）这个后面的数字只要不一样就行了吧，用于表示类别？

这里写图片描述

答：ML中类别都是用数字表示的，而且必须是连续的，这是softmax函数所决定的，不连续就没法算了

且必须从0开始，如果只有三类就是0，1，2

（2）train val test，比如我有三类food，那文件夹结构应该是怎样，或者说文件夹路径无所谓，在txt里面指明路径就行？

无所谓，txt制定就行

（3）如果train 25%，val 25%， test 75%，是这三类对每一类都是这样的比例吧，比如上面那个food，apple pie是train 25%，val 25%， test 75%

无所谓，不一定要严格成比例

一、createList

fileList=`ls`
for file in $fileList;do
echo $file
done

期间碰到shell在服务器一直跑不起来的情况，用vim直接在服务器编辑就可以了，张嘉良推荐的解决方案：https://stackoverflow.com/questions/5491634/shell-script-error-expecting-do

test.txt应不应该带label

答：你如果要算准确率肯定要有label，格式和train的一样就行，具体如何你直接看对应数据读取的cpp，如果是txt里读取，对应的应该是个叫data layer的cpp

有时候数据读取也自己写的，数据比较复杂的情况，比如做检测的时候，并不是一个图片一个label，而是有很多矩形框，替换data layer的cpp

生成train.txt val.txt test.txt的shell如下，但是不具有一般性，而且这里写入txt的图片路径不应该是绝对路径，应该是与后面的create_lmdb连用的路径

# /usr/bin/env sh
# by Bill

DATA=`pwd`
echo "Create train.txt..."
fileList0=`ls $DATA/train/apple_pie`
fileList1=`ls $DATA/train/baby_back_ribs`
fileList2=`ls $DATA/train/caesar_salad`
for file in $fileList0;do
echo apple_pie/$file | sed "s/$/ 0/">>$DATA/train.txt
done
for file in $fileList1;do
echo baby_back_ribs/$file | sed "s/$/ 1/">>$DATA/train.txt
done
for file in $fileList2;do
echo caesar_salad/$file | sed "s/$/ 2/">>$DATA/train.txt
done

echo "Create val.txt..."
fileList0=`ls $DATA/val/apple_pie`
fileList1=`ls $DATA/val/baby_back_ribs`
fileList2=`ls $DATA/val/caesar_salad`
for file in $fileList0;do
echo apple_pie/$file | sed "s/$/ 0/">>$DATA/val.txt
done
for file in $fileList1;do
echo baby_back_ribs/$file | sed "s/$/ 1/">>$DATA/val.txt
done
for file in $fileList2;do
echo caesar_salad/$file | sed "s/$/ 2/">>$DATA/val.txt
done

echo "Create test.txt..."
fileList0=`ls $DATA/test/apple_pie`
fileList1=`ls $DATA/test/baby_back_ribs`
fileList2=`ls $DATA/test/caesar_salad`
for file in $fileList0;do
echo apple_pie/$file | sed "s/$/ 0/">>$DATA/test.txt
done
for file in $fileList1;do
echo baby_back_ribs/$file | sed "s/$/ 1/">>$DATA/test.txt
done
for file in $fileList2;do
echo caesar_salad/$file | sed "s/$/ 2/">>$DATA/test.txt
done

二、将img转为lmdb

这里写图片描述

修改于examples/imagenet/create_imagenet.sh

#!/usr/bin/env sh
# Create the imagenet lmdb inputs
# By Bill
set -e

DBNAME=.
ListPath=.
TOOLS=/home/hwang/hwang/caffe-master/build/tools

TRAIN_DATA_ROOT=/home/hwang/hwang/dataset/whFoodTrainTest/train/
VAL_DATA_ROOT=/home/hwang/hwang/dataset/whFoodTrainTest/val/

# Set RESIZE=true to resize the images to 256x256. Leave as false if images have
# already been resized using another tool.
RESIZE=true
if $RESIZE; then
  RESIZE_HEIGHT=256
  RESIZE_WIDTH=256
else
  RESIZE_HEIGHT=0
  RESIZE_WIDTH=0
fi

if [ ! -d "$TRAIN_DATA_ROOT" ]; then
  echo "Error: TRAIN_DATA_ROOT is not a path to a directory: $TRAIN_DATA_ROOT"
  echo "Set the TRAIN_DATA_ROOT variable in create_imagenet.sh to the path" \
       "where the ImageNet training data is stored."
  exit 1
fi

if [ ! -d "$VAL_DATA_ROOT" ]; then
  echo "Error: VAL_DATA_ROOT is not a path to a directory: $VAL_DATA_ROOT"
  echo "Set the VAL_DATA_ROOT variable in create_imagenet.sh to the path" \
       "where the ImageNet validation data is stored."
  exit 1
fi

echo "Creating train lmdb..."

GLOG_logtostderr=1 $TOOLS/convert_imageset \
    --resize_height=$RESIZE_HEIGHT \
    --resize_width=$RESIZE_WIDTH \
    --shuffle \
    $TRAIN_DATA_ROOT \
    $ListPath/train.txt \
    $DBNAME/whAlexNet_train_lmdb

echo "Creating val lmdb..."

GLOG_logtostderr=1 $TOOLS/convert_imageset \
    --resize_height=$RESIZE_HEIGHT \
    --resize_width=$RESIZE_WIDTH \
    --shuffle \
    $VAL_DATA_ROOT \
    $ListPath/val.txt \
    $DBNAME/whAlexNet_val_lmdb

echo "Done."

实则调用了caffe 的 convert_imageset [FLAGS] ROOTFOLDER/ LISTFILE DB_NAME [2]
需要带四个参数：

FLAGS: 图片参数组

ROOTFOLDER/: 图片存放的绝对路径，从linux系统根目录开始

LISTFILE: 图片文件列表清单，一般为一个txt文件，一行一张图片

DB_NAME: 最终生成的db文件存放目录

FLAGS这个参数组，有些什么内容：

-gray: 是否以灰度图的方式打开图片。程序调用opencv库中的imread()函数来打开图片，默认为false

-shuffle: 是否随机打乱图片顺序。默认为false

-backend:需要转换成的db文件格式，可选为leveldb或lmdb,默认为lmdb

-resize_width/resize_height: 改变图片的大小。在运行中，要求所有图片的尺寸一致，因此需要改变图片大小。程序调用opencv库的resize（）函数来对图片放大缩小，默认为0，不改变

-check_size: 检查所有的数据是否有相同的尺寸。默认为false,不检查

-encoded: 是否将原图片编码放入最终的数据中，默认为false

-encode_type: 与前一个参数对应，将图片编码为哪一个格式：‘png’,’jpg’……

三、计算均值并保存

图片减去均值再训练，会提高训练速度和精度。因此，一般都会有这个操作。

caffe程序提供了一个计算均值的文件compute_image_mean.cpp，我们直接使用就可以了

# sudo build/tools/compute_image_mean examples/myfile/img_train_lmdb examples/myfile/mean.binaryproto

compute_image_mean带两个参数，第一个参数是lmdb训练数据位置，第二个参数设定均值文件的名字及保存路径。
运行成功后，会在 examples/myfile/ 下面生成一个mean.binaryproto的均值文件。

模型需要从每张图片减去均值，所以我们需要获取training images的均值，用tools/compute_image_mean.cpp实现．这个cpp是一个很好的例子去熟悉如何操作多个组件，例如协议的缓冲区，leveldb,登陆等．
下面的shell代码修改自examples/imagenet/make_imagenet_mean.sh

#!/usr/bin/env sh
# Compute the mean image from the imagenet training lmdb
# By Bill

DBNAME=.
ListPath=.
TOOLS=/home/hwang/hwang/caffe-master/build/tools

$TOOLS/compute_image_mean $DBNAME/whAlexNet_train_lmdb \
  $ListPath/whAlexNet_mean.binaryproto

echo "Done."

这里写图片描述

四、定义网络

AlexNet模型定义于文件：models/bvlc_alexnet/train_val.prototxt，注意需将文件中的训练数据集和测试数据集的地址更改为服务器中实际存放的地址。
训练参数定义于文件：models/bvlc_alexnet/solver.prototxt

主要是修改各数据层的文件路径．如下图：

这里写图片描述

如果细心观察train_val.prototxt的train部分和val部分，可以发现他们除了数据来源和最后一层不同以外，其他基本相似．在training时，我们用softmax－loss层计算损失函数和初始化反向传播，而在验证时，我们使用精度层检测精度．

还有一个运行协议solver.prototxt，复制过来，将第一行路径改为我们自己的路径net:”examples/mydata/train_val.prototxt”.　从里面可以观察到，我们将运行256批次，迭代4500000次（90期），每1000次迭代，我们测试学习网络验证数据，我们设置初始的学习率为0.01，每100000（20期）次迭代减少学习率，显示一次信息，训练的weight_decay为0.0005，每10000次迭代，我们显示一下当前状态。
以上是教程的，实际上，以上需要耗费很长时间，因此，我们稍微改一下
test_iter: 1000是指测试的批次，我们就10张照片，设置10就可以了。
test_interval: 1000是指每1000次迭代测试一次，我们改成500次测试一次。
base_lr: 0.01是基础学习率，因为数据量小，0.01就会下降太快了，因此改成0.001
lr_policy: “step”学习率变化
gamma: 0.1学习率变化的比率
stepsize: 100000每100000次迭代减少学习率
display: 20每20层显示一次
max_iter: 450000最大迭代次数，
momentum: 0.9学习的参数，不用变
weight_decay: 0.0005学习的参数，不用变
snapshot: 10000每迭代10000次显示状态，这里改为2000次
solver_mode: GPU末尾加一行，代表用GPU进行

粘贴另一篇博客的说明：

net: "examples/my_simple_image/cifar/cifar10_quick_train_test.prototxt"   #网络文件路径
test_iter: 20        #测试执行的迭代次数
test_interval: 10    #迭代多少次进行测试
base_lr: 0.001       #迭代速率，这里我们改小了一个数量级，因为数据比较少
momentum: 0.9
weight_decay: 0.004
lr_policy: "fixed"   #采用固定学习速率的模式display: 1           #迭代几次就显示一下信息，这里我为了及时跟踪效果，改成1
max_iter: 4000       #最大迭代次数
snapshot: 1000       #迭代多少次生成一次快照
snapshot_prefix: "examples/my_simple_image/cifar/cifar10_quick"     #快照路径和前缀
solver_mode: CPU     #CPU或者GPU

超参数设置可以参考：https://zhuanlan.zhihu.com/p/27905191

熊伟说：

（1）base_lr: 0.01
gamma: 0.1
stepsize: 100000
这三个参数是最重要的

（2）在conv层，这两个数字设为0，梯度就不会变（是乘在梯度前面的），那就可以单独训练固定的层
param {
lr_mult: 1
decay_mult: 1
}

（3）25%做train，25%val，50%test？感觉这个有问题

（4）iteration怎么设置：总的datasize/batchsize=>一个epoch需要的iteration。一般10个epoch就可以

对于我的训练数据，从[7]得到的启发：

示例: caffe/examples/mnist/lenet_solver.prototxt 
# The train/test net protocol buffer definition
net: "examples/mnist/lenet_train_test.prototxt"
# test_iter specifies how many forward passes the test should carry out.
# In the case of MNIST, we have test batch size 100 and 100 test iterations,
# covering the full 10,000 testing images.
test_iter: 100
# Carry out testing every 500 training iterations.
test_interval: 500
# The base learning rate, momentum and the weight decay of the network.
base_lr: 0.01
momentum: 0.9
weight_decay: 0.0005
# The learning rate policy
lr_policy: "inv"
gamma: 0.0001
power: 0.75
# Display every 100 iterations
display: 100
# The maximum number of iterations
max_iter: 10000
# snapshot intermediate results
snapshot: 5000
snapshot_prefix: "examples/mnist/lenet"
# solver mode: CPU or GPU
solver_mode: CPU

我自己的设置，train数据是3个类别，每个类别20张food图片（一共60张train）；val有（10*3=）30张；test有（10*3=）30张

batch_size设置的16

data_param {
    source: "/home/hwang/hwang/dataset/whFoodTrainTest/whAlexNet_train_lmdb"
    batch_size: 16
    backend: LMDB
  }

solver

net: "./train_val.prototxt"
test_iter: 4
test_interval: 500
base_lr: 0.001
lr_policy: "step"
gamma: 0.1
stepsize: 100000
display: 20
max_iter: 1000
momentum: 0.9
weight_decay: 0.0005
snapshot: 2000
snapshot_prefix: "snapshot_prefix/caffe_alexnet_train"
solver_mode: GPU

我自己的设置2
1500（3*500 = ）张train数据

16的batchsize
solver如下：

net: "./train_val.prototxt"
test_iter: 32
test_interval: 500
base_lr: 0.001
lr_policy: "step"
gamma: 0.1
stepsize: 100000
display: 20
max_iter: 10000
momentum: 0.9
weight_decay: 0.0005
snapshot: 2000
snapshot_prefix: "snapshot_prefix/caffe_alexnet_train"
solver_mode: GPU

五、开始train

#!/usr/bin/env sh
#By Bill
set -e

/home/hwang/hwang/caffe-master/build/tools/caffe train \
    --solver=./solver.prototxt $@

把所有训练数据轮一次是一个epoch

1500张train的food图片跑的结果如下（loss很低，但是accuracy上不去的原因，熊伟说是因为数据量太少，overfitting了）

这里写图片描述

六、后续

将solver改成如下，即lr变小（因为发现loss跳的很大），stepsize变少让其真的有衰减，最后的accuracy变成66.7%

net: "./train_val.prototxt"
test_iter: 32
test_interval: 500
base_lr: 0.0005
lr_policy: "step"
gamma: 0.1
stepsize: 5000
display: 20
max_iter: 10000
momentum: 0.9
weight_decay: 0.0005
snapshot: 2000
snapshot_prefix: "snapshot_prefix/caffe_alexnet_train"
solver_mode: GPU