SGD求解器的学习率a和遗忘因子u的设置原则

SGD

Stochastic gradient descent (type: "SGD") updates the weights  W W by a linear combination of the negative gradient  L(W) ∇L(W) and the previous weight update  Vt Vt. The learning rate  α α is the weight of the negative gradient. The momentum  μ μ is the weight of the previous update.

Formally, we have the following formulas to compute the update value  Vt+1 Vt+1 and the updated weights  Wt+1 Wt+1 at iteration  t+1 t+1, given the previous weight update  Vt Vt and current weights  Wt Wt:

Vt+1=μVtαL(Wt) Vt+1=μVt−α∇L(Wt)
Wt+1=Wt+Vt+1 Wt+1=Wt+Vt+1

The learning “hyperparameters” ( α α and  μ μ) might require a bit of tuning for best results. If you’re not sure where to start, take a look at the “Rules of thumb” below, and for further information you might refer to Leon Bottou’s Stochastic Gradient Descent Tricks [1].

[1] L. Bottou. Stochastic Gradient Descent TricksNeural Networks: Tricks of the Trade: Springer, 2012.

Rules of thumb for setting the learning rate 阿法 α

momentum   miu

A good strategy for deep learning with SGD is to initialize the learning rate  α α to a value around  α0.01=102 α≈0.01=10−2, and dropping it by a constant factor (e.g., 10) throughout training when the loss begins to reach an apparent “plateau”, repeating this several times. Generally, you probably want to use a momentum  μ=0.9 μ=0.9 or similar value. By smoothing the weight updates across iterations, momentum tends to make deep learning with SGD both stabler and faster.

This was the strategy used by Krizhevsky et al. [1] in their famously winning CNN entry to the ILSVRC-2012 competition, and Caffe makes this strategy easy to implement in a SolverParameter, as in our reproduction of [1] at ./examples/imagenet/alexnet_solver.prototxt.

To use a learning rate policy like this, you can put the following lines somewhere in your solver prototxt file:

base_lr: 0.01     # begin training at a learning rate of 0.01 = 1e-2

lr_policy: "step" # learning rate policy: drop the learning rate in "steps"
                  # by a factor of gamma every stepsize iterations

gamma: 0.1        # drop the learning rate by a factor of 10
                  # (i.e., multiply it by a factor of gamma = 0.1)

stepsize: 100000  # drop the learning rate every 100K iterations

max_iter: 350000  # train for 350K iterations total

momentum: 0.9

Under the above settings, we’ll always use momentum  μ=0.9 μ=0.9. We’ll begin training at a base_lr of  α=0.01=102 α=0.01=10−2 for the first 100,000 iterations, then multiply the learning rate by gamma ( γ γ) and train at  α=αγ=(0.01)(0.1)=0.001=103 α′=αγ=(0.01)(0.1)=0.001=10−3 for iterations 100K-200K, then at  α′′=104 α″=10−4 for iterations 200K-300K, and finally train until iteration 350K (since we havemax_iter: 350000) at  α′′′=105 α‴=10−5.

Note that the momentum setting  μ μ effectively multiplies the size of your updates by a factor of  11μ 11−μ after many iterations of training, so if you increase  μ μ, it may be a good idea to decrease  α αaccordingly (and vice versa).

For example, with  μ=0.9 μ=0.9, we have an effective update size multiplier of  110.9=10 11−0.9=10. If we increased the momentum to  μ=0.99 μ=0.99, we’ve increased our update size multiplier to 100, so we should drop  α α (base_lr) by a factor of 10.

Note also that the above settings are merely guidelines, and they’re definitely not guaranteed to be optimal (or even work at all!) in every situation. If learning diverges (e.g., you start to see very large or NaN or inf loss values or outputs), try dropping the base_lr (e.g., base_lr: 0.001) and re-training, repeating this until you find a base_lr value that works.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值