2.1 Mini-batch 梯度下降

这里写图片描述

Applying machine learning is a highly empirical process, is highly iterative process. In which you just had to train a lot of models to find one that works really well. So, it really helps to really train models quickly.

It turns out that you can get a faster algorithm if you let gradient descent start to make some progress even before you finish processing your entire , your giant training sets of 5 million examples. In particular, here’s what you can do.

m m training samples:
这里写图片描述

X:(nX,m),Y:(1,m)

Let’s say that you split up your training set into smaller, little baby training sets and these baby training sets are called mini-batches. And let’s say each of your baby training sets have just 1000 examples each. So, you take X1 through X1000 (记作 x{1} x { 1 } )and you call that your first little baby training set, also call the mini-batch. And then you take home the next 1000 examples. X1001 through X2000 ( x{2} x { 2 } ) and then X1000 examples and come next one and so on. Altogether you would have 5,000 of these mini batches.

You would also split up your training data for Y accordingly.

X{t}:(nX,1000),Y:(1,1000) X { t } : ( n X , 1000 ) , Y : ( 1 , 1000 )

下标含义:

  • x(i) x ( i ) : 第i个样本
  • z[l] z [ l ] : 第 l l 层的隐藏单元
  • x{t},y{t}: 第t个mini-batch

To explain the name of this algorithm, batch gradient descent, refers to the gradient descent algorithm we have been talking about previously. Where you process your entire training set all at the same time. And the name comes from viewing that as processing your entire batch of training samples all at the same time. I know it’s not a great name but that’s just what it’s called. Mini-batch gradient descent in contrast, refers to algorithm which we’ll talk about on the next slide and which you process is single mini batch XT, YT at the same time rather than processing your entire training set XY the same time.

这里写图片描述

伪代码:
repeat {
for t=1, , 5000:

  • Forward prop on X{t} X { t } (Vectorized implementation on 1000 examples):
    Z[1]=W[1]X{t}+b[l]A[1]=g[1](Z[1])A[l]=g[l](Z[l]) Z [ 1 ] = W [ 1 ] X { t } + b [ l ] A [ 1 ] = g [ 1 ] ( Z [ 1 ] ) ⋮ A [ l ] = g [ l ] ( Z [ l ] )
  • Compute cost

    J{t}=11000iL(y^(i),y(i))+λ21000l||w[l]||2 J { t } = 1 1000 ∑ i L ( y ^ ( i ) , y ( i ) ) + λ 2 ∗ 1000 ∑ l | | w [ l ] | | 2

  • Back prop to compute gradients on J{t} J { t } (using X{t},Y{t} X { t } , Y { t } )

  • W[l]=W[l]αdW[l],,b[l]=b[l]αdb[l] W [ l ] = W [ l ] − α d W [ l ] , , b [ l ] = b [ l ] − α d b [ l ]
    }

This is one pass through your training set using mini-batch gradient descent. The code I have written down here is also called doing one epoch of training and epoch is a word that means a single pass through the training set. Whereas with batch gradient descent, a single pass through the training allows you to take only one gradient descent step. With mini-batch gradient descent, a single pass through the training set, that is one epoch, allows you to take 5,000 gradient descent steps. Now of course you want to take multiple passes through the training set which you usually want to, you might want another for loop for another while loop out there. So you keep taking passes through the training set until hopefully you converge with approximately converge. When you have a lost training set, mini-batch gradient descent runs much faster than batch gradient descent and that’s pretty much what everyone in Deep Learning will use when you’re training on a large data set.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值