深度学习框架（一）caffe实战流程_深度学习，caffe-CSDN博客

本文链接：https://blog.csdn.net/huohu123456789/article/details/90666147

深度学习框架（一）caffe实战流程

数据生成 .txt .jpg
文件处理 .lmdb .binaryproto
架构建立 .prototxt
训练模型 .caffemodel .solverstate
脚本指令

数据生成

数据集
在这里插入图片描述
处理脚本

txt文件生成

DATA=data/train
WORK=data
echo "Create train.txt..."
rm -rf $DATA/train.txt
find $DATA -name cat.*.jpg | cut -d '/' -f3 | sed "s/$/ 0/">>$DATA/train.txt
find $DATA -name dog.*.jpg | cut -d '/' -f3 | sed "s/$/ 1/">>$DATA/tmp.txt
cat $DATA/tmp.txt>>$DATA/train.txt
rm -rf $DATA/tmp.txt
mv $DATA/train.txt $WORK/
echo "Done.."

在这里插入图片描述

文件处理

处理脚本

lmdb文件生成

#!/usr/bin/env sh
DATA=data/test
WORK=data
echo "Create catdog_train_lmdb..."
rm -rf $DATA/catdog_test_lmdb
./build/tools/convert_imageset --shuffle --resize_height=128 --resize_width=128 \
$DATA/test/   
$WORK/test.txt 
$DATA/catdog_test_lmdb
echo "Done.."

在这里插入图片描述

#!/usr/bin/env sh
#以下三个路径跟生成lmdb脚本中的一致，可以直接copy过来
EXAMPLE=data
DATA=data
TOOLS=./build/tools
$TOOLS/compute_image_mean $EXAMPLE/catdog_train_lmdb $DATA/catdog_train_mean.binaryproto 
#传入训练lmdb文件夹,生成的均值文件名称，后缀为binaryproto
echo "Done."

在这里插入图片描述

架构建立

catdognet_train_test.prototxt

网络的结构文件,描述了网络输入输出和层间结构，狭义上就是指的设计这个文件的内容。

name: "CatDogNet"
layer {
  name: "catdog"
  type: "Data"
  top: "data"
  top: "label"
  include {
    phase: TRAIN
  }
  transform_param {
    scale: 0.00390625
  }
  data_param {
    source: "/data/catdog_train_lmdb"
    batch_size: 16
    backend: LMDB
  }
}
layer {
  name: "catdog"
  type: "Data"
  top: "data"
  top: "label"
  include {
    phase: TEST
  }
  transform_param {
    scale: 0.00390625
  }
  data_param {
    source: "/data/catdog_test_lmdb"
    batch_size: 16
    backend: LMDB
  }
}
layer {
  name: "conv1"
  type: "Convolution"
  bottom: "data"
  top: "conv1"
  param {
    lr_mult: 1
  }
  param {
    lr_mult: 2
  }
  convolution_param {
    num_output: 32
	pad: 1
    kernel_size: 3
    stride: 1
    weight_filler {
      type: "xavier"
    }
    bias_filler {
      type: "constant"
    }
  }
}
layer {
  name: "relu1"
  type: "ReLU"
  bottom: "conv1"
  top: "conv1"
}

......
......
......

layer {
  name: "drop2"
  type: "Dropout"
  bottom: "ip2"
  top: "ip2"
  dropout_param {
    dropout_ratio: 0.5
  }
}
layer {
  name: "ip3"
  type: "InnerProduct"
  bottom: "ip2"
  top: "ip3"
  param {
    lr_mult: 1
  }
  param {
    lr_mult: 2
  }
  inner_product_param {
    num_output: 2
    weight_filler {
      type: "xavier"
    }
    bias_filler {
      type: "constant"
    }
  }
}
layer {
  name: "loss"
  type: "SoftmaxWithLoss"
  bottom: "ip3"
  bottom: "label"
  top: "loss"
}
layer {
  name: "accuracy"
  type: "Accuracy"
  bottom: "ip3"
  bottom: "label"
  top: "accuracy"
  include {
    phase: TEST
  }
}

catdognet_deploy.prototxt

这个文件是根据train_val.prototxt编写的，里面结构大体一样，只是把训练的参数去掉，是在测试时使用的文件。

catdognet_solver.prototxt

这个文件就是配置一下训练时的规则，这个文件里面的参数也是模型质量的关键。

# The train/test net protocol buffer definition
net: "/media/huo/U/SF/project/dogcat/catdognet_train_test.prototxt"
# test_iter specifies how many forward passes the test should carry out.
# In the case of MNIST, we have test batch size 100 and 100 test iterations,
# covering the full 10,000 testing images.
test_iter: 50
# Carry out testing every 500 training iterations.
test_interval: 500
# The base learning rate, momentum and the weight decay of the network.
base_lr: 0.0001
momentum: 0.9
weight_decay: 0.0005
# The learning rate policy
lr_policy: "fixed"
gamma: 0.0001
power: 0.75
# Display every 100 iterations
display: 100
# The maximum number of iterations
max_iter: 40000
# snapshot intermediate results
snapshot: 5000
snapshot_prefix: "./dogcat/catdognet"
# solver mode: CPU or GPU
solver_mode: GPU
#solver_mode: CPU

参数解释（转自 Caffe–solver.prototxt配置文件参数设置及含义）

训练样本
总共:121368个
batch_szie:256
将所有样本处理完一次（称为一代，即epoch)需要：121368/256=475 次迭代才能完成
所以这里将test_interval设置为475，即处理完一次所有的训练数据后，才去进行测试。所以这个数要大于等于475.
如果想训练100代，则最大迭代次数为47500；
测试样本
同理，如果有1000个测试样本，batch_size为25，那么需要40次才能完整的测试一次。所以test_iter为40；这个数要大于等于40.
学习率
学习率变化规律我们设置为随着迭代次数的增加，慢慢变低。总共迭代47500次，我们将变化5次，所以stepsize设置为47500/5=9500，即每迭代9500次，我们就降低一次学习率。
参数含义
net: “examples/AAA/train_val.prototxt” #训练或者测试配置文件
test_iter: 40 #完成一次测试需要的迭代次数
test_interval: 475 #测试间隔
base_lr: 0.01 #基础学习率
lr_policy: “step” #学习率变化规律
gamma: 0.1 #学习率变化指数
stepsize: 9500 #学习率变化频率
display: 20 #屏幕显示间隔
max_iter: 47500 #最大迭代次数
momentum: 0.9 #动量
weight_decay: 0.0005 #权重衰减
snapshot: 5000 #保存模型间隔
snapshot_prefix: “models/A1/caffenet_train” #保存模型的前缀
solver_mode: GPU #是否使用GPU
stepsize不能太小，如果太小会导致学习率再后来越来越小，达不到充分收敛的效果。

训练模型

#!/usr/bin/env sh

./build/tools/caffe train --solver=./dogcat/catdognet_solver1.prototxt

./build/tools/caffe train --solver=./dogcat/catdognet_solver2.prototxt --snapshot=./dogcat/catdognet_iter_100000.solverstate

在这里插入图片描述
caffe训练过程中会生成.caffemodel和.solverstate文件，其中caffemodel为模型训练文件，可用于参数解析，solverstate为中间状态文件。
当训练过程由于断电等因素中断时，可用solverstate文件继续执行，具体运行脚本和训练脚本类似，只需添加snapshot状态参数即可。

脚本指令

c++接口命令行 (转自命令行解析）

caffe程序的命令行执行格式如下：

caffe 　<command> 　<args>
例:
sudo sh ./build/tools/caffe train --solver=examples/mnist/train_lenet.sh

其中<command>有这样四种：

train----训练或finetune模型（model),
test-----测试模型
device_query—显示gpu信息
time-----显示程序执行时间

其中<args>参数有：

-solver：必选参数。一个protocol buffer类型的文件，即模型的配置文件。如：

./build/tools/caffe train -solver examples/mnist/lenet_solver.prototxt
-gpu: 可选参数。该参数用来指定用哪一块gpu运行，根据gpu的id进行选择，如果设置为’-gpu all’则使用所有的gpu运行。如使用第二块gpu运行：

./build/tools/caffe train -solver examples/mnist/lenet_solver.prototxt -gpu 2
-snapshot:可选参数。该参数用来从快照（snapshot)中恢复训练。可以在solver配置文件设置快照，保存solverstate。如：

./build/tools/caffe train -solver examples/mnist/lenet_solver.prototxt -snapshot examples/mnist/lenet_iter_5000.solverstate
-weights:可选参数。用预先训练好的权重来fine-tuning模型，需要一个caffemodel，不能和-snapshot同时使用。如：

./build/tools/caffe train -solver examples/finetuning_on_flickr_style/solver.prototxt -weights models/bvlc_reference_caffenet/bvlc_reference_caffenet.caffemodel
-iterations: 可选参数，迭代次数，默认为50。如果在配置文件文件中没有设定迭代次数，则默认迭代50次。
-model:可选参数，定义在protocol buffer文件中的模型。也可以在solver配置文件中指定。
-sighup_effect：可选参数。用来设定当程序发生挂起事件时，执行的操作，可以设置为snapshot, stop或none, 默认为snapshot
-sigint_effect: 可选参数。用来设定当程序发生键盘中止事件时（ctrl+c), 执行的操作，可以设置为snapshot, stop或none, 默认为stop

此外，除了caffe.cpp外，convert_imageset.cpp, train_net.cpp, test_net.cpp等也放在tools文件夹内

test参数用在测试阶段，用于最终结果的输出，要模型配置文件中我们可以设定需要输入accuracy还是loss. 假设我们要在验证集中验证已经训练好的模型，就可以这样写
./build/tools/caffe test -model examples/mnist/lenet_train_test.prototxt -weights examples/mnist/lenet_iter_10000.caffemodel -gpu 0 -iterations 100

这个例子比较长，不仅用到了test参数，还用到了-model, -weights, -gpu和-iteration四个参数。意思是利用训练好了的权重（-weight)，输入到测试模型中(-model)，用编号为0的gpu(-gpu)测试100次(-iteration)。
time参数用来在屏幕上显示程序运行时间。如：
./build/tools/caffe time -model examples/mnist/lenet_train_test.prototxt -iterations 10

这个例子用来在屏幕上显示lenet模型迭代10次所使用的时间。包括每次迭代的forward和backward所用的时间，也包括每层forward和backward所用的平均时间。

./build/tools/caffe time -model examples/mnist/lenet_train_test.prototxt -gpu 0

这个例子用来在屏幕上显示lenet模型用gpu迭代50次所使用的时间。

./build/tools/caffe time -model examples/mnist/lenet_train_test.prototxt -weights examples/mnist/lenet_iter_10000.caffemodel -gpu 0 -iterations 10

利用给定的权重，利用第一块gpu，迭代10次lenet模型所用的时间。
device_query参数用来诊断gpu信息。

./build/tools/caffe device_query -gpu 0

最后，我们来看两个关于gpu的例子

./build/tools/caffe train -solver examples/mnist/lenet_solver.prototxt -gpu 0,1
./build/tools/caffe train -solver examples/mnist/lenet_solver.prototxt -gpu all

这两个例子表示：用两块或多块GPU来平行运算，这样速度会快很多。但是如果你只有一块或没有gpu, 就不要加-gpu参数了，加了反而慢。

最后，在linux下，本身就有一个time命令，因此可以结合进来使用，因此我们运行mnist例子的最终命令是(一块gpu)：

$ sudo time ./build/toos/caffe train -solver examples/mnist/lenet_solver.prototxt