Course2-week2-optimization algorithm

optimization algorithms

1 - mini-batch gradient descent

vectorization allows you to efficiently compute on m examples.But if m is large then it can be very slow. With the implement of graident descent on the whole training set, what we have to do that we process entire training set before we take one little step of gradient descent.And we have to process the entire training set before we take another step of gradient descent.

What we can do is split the giant training set into many baby subset, called mini-batch.

if m = 5,000,000,

X{1}=(x(1),x(2),,x(1000)),,X{5000}=(x(4999000),x(4999001),,x(5000,000)) X { 1 } = ( x ( 1 ) , x ( 2 ) , ⋯ , x ( 1000 ) ) , ⋯ , X { 5000 } = ( x ( 4999000 ) , x ( 4999001 ) , ⋯ , x ( 5000 , 000 ) )

similiarly do the same thing for Y Y , also split up the training data for Y accordingly.

so mini-batch t is comprised of X{t} X { t } and Y{t} Y { t } .

Note:

  • x(i) x ( i ) is the i^{th} training example
  • z[l] z [ l ] refer to the z z value of the layer l
  • X{t} X { t } and Y{t} Y { t } denote the tth t t h mini-batch


这里写图片描述

Let’s see how mini-batch gradient descent work?


这里写图片描述

The code here is also called doing one Epoch of training set, Epoch is a word that means a single pass through the training set. So with the batch gradient descent, a Epoch allow you take only one gradient descent step, with mini-batch gradient descent, a single pass through the training set, that is one Epoch, allow you to take 5000 gradient descent steps.

When we have a lot training set, mini-batch run much faster than batch gradient descent.

2 - understanding mini-batch gradient descent

We will learn the detail how to implement the mini-bathc gradient descent and gain a better understanding of what it’s doing and why is work.


这里写图片描述

One of the size of parameters we need to choose is the size of mini-batch. let’s set m is the training set size.

  • On one extreme, if the mini-batch is m, then we just end up with batch gradient descent.
  • The another extreme would be if set mini-batch equals to 1, called stochastic gradient descent, here every examples is its own mini-batch, and looking just one training example to do one step of gradient descent.

Stochastic gradient descent won’t ever converge. it’ll always just kind of oscillate and wander around the region of the minimum.


这里写图片描述

If using batch gradient descent, we are processing a huge training set on every iteration, the main disadvantage of this is that it takes too much time too long per iteration before take one step of gradient descent.

If we go to the opposite, using the stochastic gradient descent, the huge disadvantage is we will loss the speed up from vectorization. since we are processing a single training example at a time.

So work best in practice is somethings in between, when we have some mini-batch size not too big or too small, this will give us the faster learning. The advantage is:

  • speed up from vectorization.
  • without needing to wait till scan entire training set. On each epoch allows us to run many times gradient descent steps.

how do choose the mini-bathc size?

  • if we have a small training set, just use batch gradient descent.
    • less than 2000
  • if we have a large training set
    • mini-batch: 64, 128, 256, 215

3 - expononentially weighted average

We will talk about a few optimization algorithms, they are faster than gradient descent.


这里写图片描述

If we want to compute the trend, the local average or a moveing average of the temperatures span last year, θ1=40F,θ2=49F,,θ180=56F, θ 1 = 40 F , θ 2 = 49 F , ⋯ , θ 180 = 56 F , ⋯ Here is what we can do:

V0=0V1=0.9V0+0.1θ1V2=0.9V1+0.1θ2Vt=0.9Vt1+0.1θt V 0 = 0 V 1 = 0.9 V 0 + 0.1 θ 1 V 2 = 0.9 V 1 + 0.1 θ 2 ⋯ V t = 0.9 V t − 1 + 0.1 θ t

In this way, we can get a exponentially weighted averages of the daily temperature show in the red line.

Vt=βVt1+(1β)θt V t = β V t − 1 + ( 1 − β ) θ t

And when we compute this we can think of Vt V t as approximately average over 11β 1 1 − β daily temperature.

  • β=0.9: β = 0.9 : , averaging over rougly 10 daily temperature
  • β=0.98: β = 0.98 : , averaging over rougly 50 daily temperature, green line, adapts slowly to temperature changes
  • β=0.5: β = 0.5 : , averaging over rougly 2 daily temperature, yellow line, adapts quickly to temperature changes


这里写图片描述

4 - understanding exponentially weighted average

We talked about exponentially weighted average, this will turn out to be a key compontent of several optimization algorithms.
Here is the key euqation for implementing the exponentially weighted average,

Vt=βVt1+(1β)θt V t = β V t − 1 + ( 1 − β ) θ t

Let’s look a bit more than that to understand how this is computing averages of the daily temperature.

v100=0.9v99+0.1θ100v99=0.9v98+0.1θ99v98=0.9v97+0.1θ98 v 100 = 0.9 v 99 + 0.1 θ 100 v 99 = 0.9 v 98 + 0.1 θ 99 v 98 = 0.9 v 97 + 0.1 θ 98 ⋯

v100=0.1θ100+0.9(0.1θ99+0.9(0.1θ98+0.9v97)) v 100 = 0.1 θ 100 + 0.9 ( 0.1 θ 99 + 0.9 ( 0.1 θ 98 + 0.9 v 97 ) )

v100=0.1θ100+0.10.9θ99+0.1(0.9)2θ98+0.1(0.9)3θ97+ v 100 = 0.1 θ 100 + 0.1 ∗ 0.9 θ 99 + 0.1 ∗ ( 0.9 ) 2 θ 98 + 0.1 ∗ ( 0.9 ) 3 θ 97 + ⋯


这里写图片描述

So it’s really taking the daily temperature multiply with this exponentially decay function, and then summing it up, and this become v100 v 100

How many daily temperature is the averaging over?

(1ϵ)1ϵ=1e0.35(1) (1) ( 1 − ϵ ) 1 ϵ = 1 e ≈ 0.35

β11β=1e0.35 β 1 1 − β = 1 e ≈ 0.35

And so in other words, when β=0.9 β = 0.9 , after 10(110.9) 10 ( 1 1 − 0.9 ) days, the weight decays to less than 13 1 3 of the weight of the current day. And if β=0.98 β = 0.98 , turn to that 0.98501e 0.98 50 ≈ 1 e . So we get the formule that we are averaging over 11β 1 1 − β day.

implemention exponontially weighted averages:


这里写图片描述

One of the advantage of this exponontially weighted averages is that it take very little memory, we just need to keep just one row number Vθ V θ in computer memory and keep on overwriting it. And it’s really this reason, it just takes up one line of code basically and storage and memory for a single row number to compute this exponontially weighted averages.

It’s really not the best way, not the most accurate way to compute an average, if you were to compute a moving window, you can explicity sum over the last 10 days temperature just divide by 10 and that usually gives you a better estimate, but this disadvantage of that is need explicity keep all the temperatures and sum the last 10 days, it’s require more memory, and more complicated to implement.

So when we need to compute the average of a lot of variables, this is a very efficient way to do so both from the computation and the memory point of view.

5 - bias correction in exponentially weighted averages

We have learned how to implement exponentially weighted averages, There’s one technique detail called bias correction that can make your computation of these averages more accurately.

Vt=βVt1+(1β)θt V t = β V t − 1 + ( 1 − β ) θ t


这里写图片描述

V0=0V1=0.02θ1V2=0.98V1+0.02θ2=0.0196θ1+0.02θ2 V 0 = 0 V 1 = 0.02 θ 1 V 2 = 0.98 V 1 + 0.02 θ 2 = 0.0196 θ 1 + 0.02 θ 2

so V1 V 1 and V2 V 2 are not very good estimate of the daily temperature for first and second day. These is a way to modify this estimate to make it much better, more accurate especially during the initial phase of your estimate.

Vt=βVt1+(1β)θt1βt V t = β V t − 1 + ( 1 − β ) θ t 1 − β t

We notice that as t becomes large, βt β t will approach 0, which is why t is large enough, the bias correction make almost no different. And during the initial phase of learning, when we are still warming up estimates, bias correction can help get a better estimate.

In machine learning, most implementations of the exponentially weighted average we don’t bother to implement bias corrections.

6 - gradient descent with momentum

Momentum almost always works faster than the standard gradient descent. The basic idea is to compute an exponentially weighted averages of gradients, and then use that gradients to update the weights instead.


这里写图片描述

This up and down oscillations slows down gradient descent and prevent you from using a much larger learning rate. On the vertical axis we want step to be a bit slower, because we do not want those oscillations, but on the horizontal axis, we want faster learning. Here is we can do if we implement gradient descent with momentum.

Momentum:

on iteration t
1. compute dW, db on mini-batch
2. VdW=βVdW+(1β)dW V d W = β V d W + ( 1 − β ) d W
3. Vdb=βVdb+(1β)db V d b = β V d b + ( 1 − β ) d b
4. W=WαVdW,b=bαVdb W = W − α V d W , b = b − α V d b

What this does is smooth the step of gradient descent, because we average over the oscillater gradient in the vertical direction so it’s become close to 0 by average positive and negative numbers. Whereas in the horizontal direction, the average in the horizontal will be pretty big. So our algorithm will take a more straightforward path.


Let’s look at detail how to implement:

thers are two hyperparameters, α,β α , β , the most common value for β β is 0.9, rougly average over the last 10 gradients.

Vdw=0,Vdb=0 V d w = 0 , V d b = 0

on iteration t
1. compute dW d W , db d b on mini-batch
2. VdW=βVdW+(1β)dW V d W = β V d W + ( 1 − β ) d W
3. Vdb=βVdb+(1β)db V d b = β V d b + ( 1 − β ) d b
4. W=WαVdW,b=bαVdb W = W − α V d W , b = b − α V d b

7 - RMSprop

We have seen how using momentum can speed up gradient descent. There is another algorithm called RMSprop, which stand for root mean square prop.


这里写图片描述

In order provides intuition of this algorithms, let’s say the vertical axis is the parameters b, and the horizontal axis is the parameter W, for the sake for intuition.So we want to slow down the learning in b direction, and speed up, or at least not slow down the learning in W direction.

Following is what RMSprop algorithm does to accomplish this:

on iteration t

1. compute dW,db d W , d b on current mini-batch
2. Sdw=βSdw+(1β)dW2,Sdb=βSdb+(1β)db2 S d w = β S d w + ( 1 − β ) d W 2 , S d b = β S d b + ( 1 − β ) d b 2
3. W=WαdWSdw+ϵ,b=bαdbSdb+ϵ W = W − α d W S d w + ϵ , b = b − α d b S d b + ϵ

Let’s gain some ituition about how this works. In the W direction, we want learning to go pretty fast, whereas in the b direction we want to slow down the oscillations, so what we hoping is that Sdw S d w will be small, whereas Sdb S d b will be relatively large. And indeed, the derivatives are much larger in the b direction than in the W direction. **So db2 d b 2 will be relatively large, so Sdb S d b will be relatively large, whereas compared to dW d W will be smaller, Sdw S d w will be relatively smaller. So the effect of this is the update in b direction becomes small so that will help damp out the oscillations. **So you can therefore use a larger learning rate to get faster learning, speed up your learning algorithm.

8 - Adam optimization algorithm

The Adam optimization algorithm is basically taking momentum and RMSporp and putting them together.

Adam optimization algorithm

Vdw=0,Vdb=0,Sdw=0,Sdb=0 V d w = 0 , V d b = 0 , S d w = 0 , S d b = 0

on iterations t:
1. compute dw,db d w , d b using mini-batch
2. Vdw=β1Vdw+(1β1)dw, Vdb=β1Vdb+(1β1)db V d w = β 1 V d w + ( 1 − β 1 ) d w ,   V d b = β 1 V d b + ( 1 − β 1 ) d b
3. Sdw=β2Sdw+(1β2)dw2, Sdb=β2Sdb+(1β2)db2 S d w = β 2 S d w + ( 1 − β 2 ) d w 2 ,   S d b = β 2 S d b + ( 1 − β 2 ) d b 2
4. Vcorrecteddw=Vdw1βt1, Vcorrecteddb=Vdb1βt1 V d w c o r r e c t e d = V d w 1 − β 1 t ,   V d b c o r r e c t e d = V d b 1 − β 1 t
5. Scorrecteddw=Sdw1βt2, Scorrecteddb=Sdb1βt2 S d w c o r r e c t e d = S d w 1 − β 2 t ,   S d b c o r r e c t e d = S d b 1 − β 2 t
6. W=WαVcorrecteddwScorrecteddw+ϵ,b=bαVcorrecteddbScorrecteddb+ϵ W = W − α V d w c o r r e c t e d S d w c o r r e c t e d + ϵ , b = b − α V d b c o r r e c t e d S d b c o r r e c t e d + ϵ

This is a commonly used learning algorithm is proven to be very effective for many different neural networks of a very variety of architecture.

hyperparameters choice:

  • α α : need to be tune
  • β1=0.9 β 1 = 0.9 : dw d w
  • β2=0.999 β 2 = 0.999 : dw2 d w 2
  • ϵ=10e8 ϵ = 10 e − 8

Why does the term “Adam” comes from? “Adam” stands for Adaptive Momentum Estimation.

9 - learning rate decay

One of the things may speed up learning algorithm is to slowly reduce learning rate over time.

Suppose we are implementing mini-batch gradient descent, with a reasonably small mini-batch, maybe just 64 example, and as iterate, step will be a little bit noisy. The algorithm might just end up wandering around and never really converge, because we are using some fix value of α α and there is some noise in different mini-batch.


这里写图片描述

but if we were to slowly reduce learning rate, during the initial phases can have relatively faster learning, but then as α α get smaller, the steps will be take smaller. So we can end up oscillating in a tightly region around the minimum rather than wandering far away. So the intuition behind slowly reduce α α is that maybe during the initial steps of learning, we could afford to take much bigger steps, but as learning approach converges, need a slower learning rate to take smaller steps.

α=11+decay-rateepochα0 α = 1 1 + decay-rate ∗ epoch α 0

when α0=0.2,decay-rate=1 α 0 = 0.2 , decay-rate = 1


这里写图片描述

There are a few way other people use:

α=0.95epochα0 α = 0.95 epoch α 0

α=kepochα0 or ktα0 α = k epoch α 0   or   k t α 0

or discrete learning rate.

Learning rate decay is a little bit lower down list in terms of the things we would to tuning.

10 - the problem of local optima

In the first picture, it looks like there are a lot of local optima, and it be easy for gradient descent or one of the other algorithm to get stuck in a local optimum rather than find its way to global optimum. It turn out that if you are plotting a figure like this in two dimensions, it’s easy to create plot like this with a lot of different local optima. And there low dimensional plots used to guide our intuition, but this intuition is not actually correct. It turn out if we create a neural network, most point of zero gradients are not local optima, instead the most point of zero gradient are saddle point. **So one of the lessons we learned from this is that a lot of intuition about low-dimensional spaces don’t transfer to very high-dimensional spaces. **Becasue we have 20000 parameters, and we much more likely to see saddle point than local optima.


这里写图片描述

It turns out the plateaus can really slow down learning.


这里写图片描述

Plateaus is a region where the derivative is close to zero for a long time. So we need take away:

  1. Unlikely to get stuck in a bad local optima, so long as we training a reasonably large neural network. so the cost function J is defined over a high dimensional space.
  2. the plateaus can make learning slow, and this is where algorithms like momentum or RMSprop or Adam can really help learning algorithm and actually speed up the rate at which we could move down the plateaus and get off the plateaus.
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值