- 执行方法:
This is a post summarizing how to resume training on caffe using snapshots.
First, you need to generate snapshot files. You can do this by specify in solver.prototxt file. Of course, the name of the solver file is different for different models, usually like cifar10_quick_solver.prototxt
# snapshot intermediate results
snapshot: 500
This means that it will take a snapshot every 500 iterations. NOT that it will only take a snapshot at the 500th iteration.
Once you have the snapshots, you will see two files, model_iter_xxx.caffemodel and model_iter_xxx.solverstate (for example, cifar10_quick_iter_3000.solverstate). The prefix of the filename can be customized in the prototxt file.
Once you have the snapshot, you can specify to use the snapshot in the training script, for cifar10, you can specify in the train_quick.sh with the
option –snapshot=cifar10_quick_iter_3000.solverstate.
This will start the training at the 3000th iteration, a note can be found here http://caffe.berkeleyvision.org/gathered/examples/imagenet.html for imagine.
Despite the fact that you only specified the cifar10_quick_iter_3000.solverstate file, to get it actually running, you ALSO NEED the cifar10_quick_iter_3000.caffemodel file in the directory.
THERE IS ONE TRICK HERE, the options snapshots and solver have to be specified ON THE SAME LINE, that is don’t miss the “\” after the solver option
$TOOLS/caffe train \
–solver=examples/cifar10/cifar10_quick_solver.prototxt \
–snapshot=examples/cifar10/cifar10_quick_iter_3000.solverstate
OTHERWISE, it WILL NOT start from the snapshot and it won’t tell you what the problem is.
- 以下为我的个人总结(已在mnist上lenet训练中验证):
上文中恢复训练的方式中的
1)迭代次数:既试用于1、初始训练时指定了训练次数,中途人为中断,来继续训练;也适合于2、训练完初始指定的次数后,在solover中重新指定训练次数,继续训练。
2)学习率 :如果不改变初始训练时solver中的学习方式和学习率的话,继续训练的模型的学习率会按照恢复时的学习率状态继续训练,如果改变了学习方式和学习率的值,则继续训练的模型学习率会跟着改变。
总之在恢复训练时,solver中如果除了迭代次数,其他都没变,则模型将按照原来的所有状态继续训练;如果更改了其他参数,那该参数将重新加载到训练中。