SGD的batch size多大是合适的?

The "sample size" you're talking about is referred to as batch size B . The batch size parameter is just one of the hyper-parameters you'll be tuning when you train a neural network with mini-batch Stochastic Gradient Descent (SGD) and is data dependent. The most basic method of hyper-parameter search is to do a grid search over the learning rate and batch size to find a pair which makes the network converge.

To understand what the batch size should be, it's important to see the relationship between batch gradient descent, online SGD, and mini-batch SGD. Here's the general formula for the weight update step in mini-batch SGD, which is a generalization of all three types. [2]

θt+1θtϵ(t)1Bb=0B1L(θ,mb)θ

  1. Batch gradient descent,  B=|x|
  2. Online stochastic gradient descent:  B=1
  3. Mini-batch stochastic gradient descent:  B>1  but  B<|x| .

Note that with 1, the loss function is no longer a random variable and is not a stochastic approximation.

SGD converges faster than normal "batch" gradient descent because it updates the weights after looking at a randomly selected subset of the training set. Let  x  be our training set and let  mx . The batch size  B  is just the cardinality of  m B=|m| .

Batch gradient descent updates the weights  θ  using the gradients of the entire dataset  x ; whereas SGD updates the weights using an average of the gradients for a mini-batch  m . (Using the average as opposed to a sum prevents the algorithm from taking steps that are too large if the dataset is very large. Otherwise, you would need to adjust your learning rate based on the size of the dataset.) The expected value of this stochastic approximation of the gradient used in SGD is equal to the deterministic gradient used in batch gradient descent.  E[LSGD(θ,m)]=L(θ,x) .

Each time we take a sample and update our weights it is called a mini-batch. Each time we run through the entire dataset, it's called an epoch.

Let's say that we have some data vector  x:RD , an initial weight vector that parameterizes our neural network,  θ0:RS , and a loss function  L(θ,x):RSRDRS  that we are trying to minimize. If we have  T  training examples and a batch size of  B , then we can split those training examples into C mini-batches:

C=T/B

For simplicity we can assume that D is evenly divisible by B. Although, when this is not the case, as it often is not, proper weight should be assigned to each mini-batch as a function of its size.

An iterative algorithm for SGD with  M  epochs is given below:

twhile tθt+1t0<Mθtϵ(t)1Bb=0B1L(θ,mb)θt+1

Note: in real life we're reading these training example data from memory and, due to cache pre-fetching and other memory tricks done by your computer, your algorithm will run faster if the memory accesses are coalesced, i.e. when you read the memory in order and don't jump around randomly. So, most SGD implementations shuffle the dataset and then load the examples into memory in the order that they'll be read.

The major parameters for the vanilla (no momentum) SGD described above are:

  1. Learning Rate:  ϵ

I like to think of epsilon as a function from the epoch count to a learning rate. This function is called the learning rate schedule.

ϵ(t):NR

If you want to have the learning rate fixed, just define epsilon as a constant function.

  1. Batch Size

Batch size determines how many examples you look at before making a weight update. The lower it is, the noisier the training signal is going to be, the higher it is, the longer it will take to compute the gradient for each step.

Citations & Further Reading:

  1. Introduction to Gradient Based Learning
  2. Practical recommendations for gradient-based training of deep architectures
  3. Efficient Mini-batch Training for Stochastic Optimization
  • 0
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值