Class2-Week2 Optimization Algorithm

Mini-batch Gradient

A variant of this is Stochastic Gradient Descent (SGD), which is equivalent to mini-batch gradient descent where each mini-batch has just 1 example. The update rule that you have just implemented does not change. What changes is that you would be computing gradients on just one training example at a time, rather than on the whole training set. The code examples below illustrate the difference between stochastic gradient descent and (batch) gradient descent.

  • (Batch) Gradient Descent:
X = data_input
Y = labels
parameters = initialize_parameters(layers_dims)
for i in range(0, num_iterations):
    # Forward propagation
    a, caches = forward_propagation(X, parameters)
    # Compute cost.
    cost = compute_cost(a, Y)
    # Backward propagation.
    grads = backward_propagation(a, caches, parameters)
    # Update parameters.
    parameters = update_parameters(parameters, grads)
        
  • Stochastic Gradient Descent:
X = data_input
Y = labels
parameters = initialize_parameters(layers_dims)
for i in range(0, num_iterations):
    for j in range(0, m):
        # Forward propagation
        a, caches = forward_propagation(X[:,j], parameters)
        # Compute cost
        cost = compute_cost(a, Y[:,j])
        # Backward propagation
        grads = backward_propagation(a, caches, parameters)
        # Update parameters.
        parameters = update_parameters(parameters, grads)

In Stochastic Gradient Descent, you use only 1 training example before updating the gradients. When the training set is large, SGD can be faster. But the parameters will “oscillate” toward the minimum rather than converge smoothly. Here is an illustration of this:
在这里插入图片描述

**SGD vs GD**
"+" denotes a minimum of the cost. SGD leads to many oscillations to reach convergence. But each step is a lot faster to compute for SGD than for GD, as it uses only one training example (vs. the whole batch for GD).

Note also that implementing SGD requires 3 for-loops in total:

  1. Over the number of iterations
  2. Over the m m m training examples
  3. Over the layers (to update all parameters, from ( W [ 1 ] , b [ 1 ] ) (W^{[1]},b^{[1]}) (W[1],b[1]) to ( W [ L ] , b [ L ] ) (W^{[L]},b^{[L]}) (W[L],b[L]))

In practice, you’ll often get faster results if you do not use neither the whole training set, nor only one training example, to perform each update. Mini-batch gradient descent uses an intermediate number of examples for each step. With mini-batch gradient descent, you loop over the mini-batches instead of looping over individual training examples.
在这里插入图片描述

**SGD vs Mini-Batch GD**
"+" denotes a minimum of the cost. Using mini-batches in your optimization algorithm often leads to faster optimization.
**What you should remember**:
  • The difference between gradient descent, mini-batch gradient descent and stochastic gradient descent is the number of examples you use to perform one update step.
  • You have to tune a learning rate hyperparameter α \alpha α.
  • With a well-turned mini-batch size, usually it outperforms either gradient descent or stochastic gradient descent (particularly when the training set is large).

Implement

There are two steps to get mini-batches:

  • Shuffle: Create a shuffled version of the training set (X, Y) as shown below. Each column of X and Y represents a training example. Note that the random shuffling is done synchronously between X and Y. Such that after the shuffling the i t h i^{th} ith column of X is the example corresponding to the i t h i^{th} ith label in Y. The shuffling step ensures that examples will be split randomly into different mini-batches.
    在这里插入图片描述
  • Partition: Partition the shuffled (X, Y) into mini-batches of size mini_batch_size (here 64). Note that the number of training examples is not always divisible by mini_batch_size. The last mini batch might be smaller, but you don’t need to worry about this. When the final mini-batch is smaller than the full mini_batch_size, it will look like this:
    在这里插入图片描述
def randomMiniBatches(X, Y, mini_batch_size=64, seed=0):
    """
    Creates a list of random minibatches from (X, Y)
    
    Arguments:
        X -- input data, of shape (input size, number of examples)
        Y -- true "label" vector (1 for blue dot / 0 for red dot), of shape (1, number of examples)
        mini_batch_size -- size of the mini-batches, integer
    
    Returns:
        mini_batches -- list of synchronous (mini_batch_X, mini_batch_Y)
    """

    np.random.seed(seed)
    m = X.shape[1]
    mini_batches = []
    
    permutation = list(np.random.permutation(m))
    X_shuffled = X[:, permutation]
    Y_shuffled = Y[:, permutation]

    num_complete_minbatches = math.floor(m / mini_batch_size)    
    for i in range(num_complete_minbatches):
        X_mini_batch = X_shuffled[:, i*(mini_batch_size) : (i+1)*mini_batch_size]
        Y_mini_batch = Y_shuffled[:, i*(mini_batch_size) : (i+1)*mini_batch_size]
        mini_batch = (X_mini_batch, Y_mini_batch)
        mini_batches.append(mini_batch)
    
    if m % mini_batch_size != 0:
        X_mini_batch = X_shuffled[:, (i+1)*(mini_batch_size) : m]
        Y_mini_batch = Y_shuffled[:, (i+1)*(mini_batch_size) : m]
        mini_batch = (X_mini_batch, Y_mini_batch)
        mini_batches.append(mini_batch)
    return mini_batches

Understanding mini-batch gradient descent

The gradient descent diagram among different mini_batch_sizes below:
在这里插入图片描述

  • If the mini-batch size is m, you end up with batch gradient descent, which has to process the whole training set before making progress.
  • If the mini-batch size is 1, you lose the benefits of vectorization across examples in the mini-batch.
  • If mini_batch_size = (64,128,256,512(Recommanded)), Vectorization & make progress without needing to wait till you process entire training set.

Exponentially Weighted Average

  • Formular
    v t = β v t − 1 + ( 1 − β ) v t v_{t} = \beta v_{t-1} + (1-\beta)v_{t} vt=βvt1+(1β)vt

  • Bias Correction (This makes it more accurate, especially during this initial phase of your estimate. )
    v t = β v t − 1 + ( 1 − β ) v t 1 − β t v_{t} = \frac{\beta v_{t-1} + (1-\beta)v_{t}}{1-\beta^{t}} vt=1βtβvt1+(1β)vt


Gradient Descent with Momentum

Because mini-batch gradient descent makes a parameter update after seeing just a subset of examples, the direction of the update has some variance, and so the path taken by mini-batch gradient descent will “oscillate” toward convergence. Using momentum can reduce these oscillations.

Momentum takes into account the past gradients to smooth out the update. We will store the ‘direction’ of the previous gradients in the variable v v v. Formally, this will be the exponentially weighted average of the gradient on previous steps.

“If you average out these gradients, you find that the oscillations in the vertical direction will tend to average out to something closer to zero. So, in the vertical direction, where you want to slow things down, this will average out positive and negative numbers, so the average will be close to zero. Whereas, on the horizontal direction, all the derivatives are pointing to the right of the horizontal direction, so the average in the horizontal direction will still be pretty big. So that’s why with this algorithm, with a few iterations you find that the gradient descent with momentum ends up eventually just taking steps that are much smaller oscillations in the vertical direction, but are more directed to just moving quickly in the horizontal direction. And so this allows your algorithm to take a more straightforward path, or to damp out the oscillations in this path to the minimum.”

在这里插入图片描述
The red arrows shows the direction taken by one step of mini-batch gradient descent with momentum. The blue points show the direction of the gradient (with respect to the current mini-batch) on each step. Rather than just following the gradient, we let the gradient influence v v v and then take a step in the direction of v v v.

  • Formula
    { v d W [ l ] = β v d W [ l ] + ( 1 − β ) d W [ l ] W [ l ] = W [ l ] − α v d W [ l ] \begin{cases} v_{dW^{[l]}} = \beta v_{dW^{[l]}} + (1 - \beta) dW^{[l]} \\ W^{[l]} = W^{[l]} - \alpha v_{dW^{[l]}} \end{cases} {vdW[l]=βvdW[l]+(1β)dW[l]W[l]=W[l]αvdW[l]

{ v d b [ l ] = β v d b [ l ] + ( 1 − β ) d b [ l ] b [ l ] = b [ l ] − α v d b [ l ] \begin{cases} v_{db^{[l]}} = \beta v_{db^{[l]}} + (1 - \beta) db^{[l]} \\ b^{[l]} = b^{[l]} - \alpha v_{db^{[l]}} \end{cases} {vdb[l]=βvdb[l]+(1β)db[l]b[l]=b[l]αvdb[l]

where L is the number of layers, β \beta β is the momentum and α \alpha α is the learning rate. All parameters should be stored in the parameters dictionary.

def initializeVelocity(parameters):
    """
    Initializes the velocity as a python dictionary with:
                - keys: "dW1", "db1", ..., "dWL", "dbL" 
                - values: numpy arrays of zeros of the same shape as the corresponding gradients/parameters.
    Arguments:
        parameters -- python dictionary containing your parameters.
                        parameters['W' + str(l)] = Wl
                        parameters['b' + str(l)] = bl
    
    Returns:
        v -- python dictionary containing the current velocity.
                        v['dW' + str(l)] = velocity of dWl
                        v['db' + str(l)] = velocity of dbl
    """

    L = len(parameters) // 2
    v = {}
    for l in range(L):
        v["dW"+str(l+1)] = np.zeros((parameters["W"+str(l+1)].shape))
        v["db"+str(l+1)] = np.zeros((parameters["b"+str(l+1)].shape))
    return v

def updateParametersWithMomentum(parameters, grads, v, beta, learning_rate):
    """
    Update parameters using Momentum
    
    Arguments:
        parameters -- python dictionary containing your parameters:
                        parameters['W' + str(l)] = Wl
                        parameters['b' + str(l)] = bl
        grads -- python dictionary containing your gradients for each parameters:
                        grads['dW' + str(l)] = dWl
                        grads['db' + str(l)] = dbl
        v -- python dictionary containing the current velocity:
                        v['dW' + str(l)] = ...
                        v['db' + str(l)] = ...
        beta -- the momentum hyperparameter, scalar
        learning_rate -- the learning rate, scalar
    
    Returns:
        parameters -- python dictionary containing your updated parameters 
        v -- python dictionary containing your updated velocities
    """

    L = len(parameters) // 2

    for l in range(L):
        v["dW"+str(l+1)] = beta * v["dW"+str(l+1)] + (1-beta) * grads["dW"+str(l+1)]
        v["db"+str(l+1)] = beta * v["db"+str(l+1)] + (1-beta) * grads["db"+str(l+1)]

        parameters["W"+str(l+1)] = parameters["W"+str(l+1)] - learning_rate * v["dW"+str(l+1)]
        parameters["b"+str(l+1)] = parameters["b"+str(l+1)] - learning_rate * v["db"+str(l+1)]
    return parameters, v

Note that:

  • The velocity is initialized with zeros. So the algorithm will take a few iterations to “build up” velocity and start to take bigger steps.
  • If β = 0 \beta = 0 β=0, then this just becomes standard gradient descent without momentum.

How do you choose β \beta β?

  • The larger the momentum β \beta β is, the smoother the update because the more we take the past gradients into account. But if β \beta β is too big, it could also smooth out the updates too much.
  • Common values for β \beta β range from 0.8 to 0.999. If you don’t feel inclined to tune this, β = 0.9 \beta = 0.9 β=0.9 is often a reasonable default.
  • Tuning the optimal β \beta β for your model might need trying several values to see what works best in term of reducing the value of the cost function J J J.
**What you should remember**:
  • Momentum takes past gradients into account to smooth out the steps of gradient descent. It can be applied with batch gradient descent, mini-batch gradient descent or stochastic gradient descent.
  • You have to tune a momentum hyperparameter β \beta β and a learning rate α \alpha α.

RMSProp(Root Mean Square Prop)

在这里插入图片描述

From the above images(the ellipse in the figure represents the contour), we can get the convergence speed of cost function along the horizontal axis is the fastest, and the oscillations in the vertical direction may slow down our speed. So we want to increase the step size on the horizontal axis(assume as w1) and decrease the step size on the vertical axis(assume as W2). We can achieve this by doing this:

{ s 1 = β s 1 + ( 1 − β ) d w 1 2 s 2 = β s 2 + ( 1 − β ) d w 2 2 w 1 = w 1 − α d w 1 s 1 + ε w 2 = w 2 − α d w 2 s 2 + ε \begin{cases} s_{1} = \beta s_{1} + (1-\beta)dw_{1}^{2} \\ s_{2} = \beta s_{2} + (1-\beta)dw_{2}^{2} \\ w_{1} = w_{1} - \alpha \frac{dw_{1}}{\sqrt{s_{1} + \varepsilon}} \\ w_{2} = w_{2} - \alpha \frac{dw_{2}}{\sqrt{s_{2} + \varepsilon}} \end{cases} s1=βs1+(1β)dw12s2=βs2+(1β)dw22w1=w1αs1+ε dw1w2=w2αs2+ε dw2

According to the formula, s 1 , s 2 s_{1}, s_{2} s1,s2 is keeping an exponentially weighted average of the squares of the derivatives. so s 2 s_{2} s2 will be a big number because of the continuous oscillations, s 1 s_{1} s1 will be relatively small. This cause the gradient d w 1 dw_{1} dw1 in the horizontal direction becomes larger, and the gradient d w 2 dw_{2} dw2 in the vertical direction becomes smaller. Therefore, with RMSprop, we can adjust the step size in different dimensions to speed up the convergence.


Adam(Adaptive Moment Estimation)

Adam is one of the most effective optimization algorithms for training neural networks. It combines ideas from RMSProp and Momentum.

How does Adam work?

  1. It calculates an exponentially weighted average of past gradients, and stores it in variables v v v (before bias correction) and v c o r r e c t e d v^{corrected} vcorrected (with bias correction).
  2. It calculates an exponentially weighted average of the squares of the past gradients, and stores it in variables s s s (before bias correction) and s c o r r e c t e d s^{corrected} scorrected (with bias correction).
  3. It updates parameters in a direction based on combining information from “1” and “2”.

The update rule is, for l = 1 , . . . , L l = 1, ..., L l=1,...,L:

{ v d W [ l ] = β 1 v d W [ l ] + ( 1 − β 1 ) ∂ J ∂ W [ l ] v d W [ l ] c o r r e c t e d = v d W [ l ] 1 − ( β 1 ) t s d W [ l ] = β 2 s d W [ l ] + ( 1 − β 2 ) ( ∂ J ∂ W [ l ] ) 2 s d W [ l ] c o r r e c t e d = s d W [ l ] 1 − ( β 1 ) t W [ l ] = W [ l ] − α v d W [ l ] c o r r e c t e d s d W [ l ] c o r r e c t e d + ε \begin{cases} v_{dW^{[l]}} = \beta_1 v_{dW^{[l]}} + (1 - \beta_1) \frac{\partial \mathcal{J} }{ \partial W^{[l]} } \\ v^{corrected}_{dW^{[l]}} = \frac{v_{dW^{[l]}}}{1 - (\beta_1)^t} \\ s_{dW^{[l]}} = \beta_2 s_{dW^{[l]}} + (1 - \beta_2) (\frac{\partial \mathcal{J} }{\partial W^{[l]} })^2 \\ s^{corrected}_{dW^{[l]}} = \frac{s_{dW^{[l]}}}{1 - (\beta_1)^t} \\ W^{[l]} = W^{[l]} - \alpha \frac{v^{corrected}_{dW^{[l]}}}{\sqrt{s^{corrected}_{dW^{[l]}}} + \varepsilon} \end{cases} vdW[l]=β1vdW[l]+(1β1)W[l]JvdW[l]corrected=1(β1)tvdW[l]sdW[l]=β2sdW[l]+(1β2)(W[l]J)2sdW[l]corrected=1(β1)tsdW[l]W[l]=W[l]αsdW[l]corrected +εvdW[l]corrected
where:

  • t counts the number of steps taken of Adam
  • L is the number of layers
  • β 1 \beta_1 β1 and β 2 \beta_2 β2 are hyperparameters that control the two exponentially weighted averages.
  • α \alpha α is the learning rate
  • ε \varepsilon ε is a very small number to avoid dividing by zero

The update rule is also suitable for b.

We usually use “default” values for the hyperparameters β 1 \beta_{1} β1, β 2 \beta_{2} β2 and ε \varepsilon ε in Adam.( β 1 = 0.9 , β 2 = 0.999 , ε = 1 0 − 8 \beta_{1}=0.9,\beta_{2}=0.999,\varepsilon=10^{-8} β1=0.9,β2=0.999,ε=108)

def initializeAdam(parameters):
    """
    Initializes v and s as two python dictionaries with:
                - keys: "dW1", "db1", ..., "dWL", "dbL" 
                - values: numpy arrays of zeros of the same shape as the corresponding gradients/parameters.
    
    Arguments:
        parameters -- python dictionary containing your parameters.
                        parameters["W" + str(l)] = Wl
                        parameters["b" + str(l)] = bl
    
    Returns: 
        v -- python dictionary that will contain the exponentially weighted average of the gradient.
                        v["dW" + str(l)] = ...
                        v["db" + str(l)] = ...
        s -- python dictionary that will contain the exponentially weighted average of the squared gradient.
                        s["dW" + str(l)] = ...
                        s["db" + str(l)] = ...

    """

    L = len(parameters) // 2
    v = {}
    s = {}
    for l in range(L):
        v["dW"+str(l+1)] = np.zeros((parameters["W"+str(l+1)].shape))
        v["db"+str(l+1)] = np.zeros((parameters["b"+str(l+1)].shape))
        s["dW"+str(l+1)] = np.zeros((parameters["W"+str(l+1)].shape))
        s["db"+str(l+1)] = np.zeros((parameters["b"+str(l+1)].shape))
    return v, s


def updateParametersWithAdam(parameters, grads, v, s, t, learning_rate=0.01, beta1=0.9, beta2=0.999, epsilon=1e-8):
    """
    Update parameters using Adam
    
    Arguments:
        parameters -- python dictionary containing your parameters:
                        parameters['W' + str(l)] = Wl
                        parameters['b' + str(l)] = bl
        grads -- python dictionary containing your gradients for each parameters:
                        grads['dW' + str(l)] = dWl
                        grads['db' + str(l)] = dbl
        v -- Adam variable, moving average of the first gradient, python dictionary
        s -- Adam variable, moving average of the squared gradient, python dictionary
        learning_rate -- the learning rate, scalar.
        beta1 -- Exponential decay hyperparameter for the first moment estimates 
        beta2 -- Exponential decay hyperparameter for the second moment estimates 
        epsilon -- hyperparameter preventing division by zero in Adam updates

    Returns:
        parameters -- python dictionary containing your updated parameters 
        v -- Adam variable, moving average of the first gradient, python dictionary
        s -- Adam variable, moving average of the squared gradient, python dictionary
    """

    L = len(parameters) // 2
    v_correct = {}
    s_correct = {}

    for l in range(L):
        v["dW"+str(l+1)] = beta1 * v["dW"+str(l+1)] + (1-beta1) * grads["dW"+str(l+1)]
        v["db"+str(l+1)] = beta1 * v["db"+str(l+1)] + (1-beta1) * grads["db"+str(l+1)]
        v_correct["dW"+str(l+1)] = v["dW"+str(l+1)] / (1-np.power(beta1, t))
        v_correct["db"+str(l+1)] = v["db"+str(l+1)] / (1-np.power(beta1, t))

        s["dW"+str(l+1)] = beta2 * s["dW"+str(l+1)] + (1-beta2) * np.power(grads["dW"+str(l+1)], 2)
        s["db"+str(l+1)] = beta2 * s["db"+str(l+1)] + (1-beta2) * np.power(grads["db"+str(l+1)], 2)
        s_correct["dW"+str(l+1)] = s["dW"+str(l+1)] / (1-np.power(beta2, t))
        s_correct["db"+str(l+1)] = s["db"+str(l+1)] / (1-np.power(beta2, t))

        parameters["W"+str(l+1)] = parameters["W"+str(l+1)] - learning_rate * v_correct["dW"+str(l+1)] / (np.sqrt(s_correct["dW"+str(l+1)]) + epsilon)
        parameters["b"+str(l+1)] = parameters["b"+str(l+1)] - learning_rate * v_correct["db"+str(l+1)] / (np.sqrt(s_correct["db"+str(l+1)]) + epsilon)
    return parameters, v, s

Summary:

Momentum usually helps, but given the small learning rate and the simplistic dataset, its impact is almost negligeable. Also, the huge oscillations you see in the cost come from the fact that some minibatches are more difficult thans others for the optimization algorithm.

Adam on the other hand, clearly outperforms mini-batch gradient descent and Momentum. If you run the model for more epochs on this simple dataset, all three methods will lead to very good results. However, you’ve seen that Adam converges a lot faster.

Some advantages of Adam include:

  • Relatively low memory requirements (though higher than gradient descent and gradient descent with momentum)
  • Usually works well even with little tuning of hyperparameters (except α \alpha α)

Learning Rate Decay

One of the things that might help speed up learning algorithm, is to slowly reduce learning rate over time. We call this learning rate decay.
在这里插入图片描述

Suppose we’re implementing mini-batch gradient descent, with a reasonably small mini-batch. Maybe a mini-batch has just 64, 128 examples. Then as we iterate, our steps will be a little bit noisy. And it will tend towards this minimum over there, but it won’t exactly converge. Our algorithm might just end up wandering around, and never really converge, because of using some fixed value for α \alpha α.

If we were to slowly reduce the learning rate α \alpha α, then during the initial phases, while the learning rate α \alpha α is still large, we can still have relatively fast learning. But then as α \alpha α gets smaller, the steps will be slower and smaller. And so we end up oscillating in a tighter region around this minimum, rather than wandering far away, even as training goes on and on.

There are some formulas we can use to increase the learning rate step by step:

α = 1 1 + d e c a y _ r a t e ∗ e p o c h _ n u m \alpha = \frac{1}{1 + decay\_rate * epoch\_num} α=1+decay_rateepoch_num1

α = 0.9 5 e p o c h _ n u m α 0 \alpha = 0.95^{epoch\_num} \alpha_{0} α=0.95epoch_numα0

α = k e p o c h _ n u m α 0 \alpha = \frac{k}{\sqrt{epoch\_num}} \alpha_{0} α=epoch_num kα0


Saddle Point

在这里插入图片描述
In this picture, it looks like there are a lot of local optima in all those places. And it’d be easy for grading the sense, or one of the other algorithms to get stuck in a local optimum rather than find its way to a global optimum. It turns out that if you are plotting a figure like this in two dimensions, then it’s easy to create plots like this with a lot of different local optima. And these very low dimensional plots used to guide their intuition. But this intuition isn’t actually correct. It turns out if you create a neural network, most points of zero gradients are not local optima like points like this. Instead most points of zero gradient in a cost function are saddle points.

Informally, a function of very high dimensional space, if the gradient is zero, then in each direction it can either be a convex light function or a concave light function. And if you are in, say, a 20 , 000 20,000 20,000 dimensional space, then for it to be a local optima, all 20 , 000 20,000 20,000 directions need to look like the same. And so the chance of that happening is maybe very small, maybe 2 − 20 , 000 2^-20,000 220,000. Instead you’re much more likely to get some directions where the curve bends up like so, as well as some directions where the curve function is bending down rather than have them all bend upwards. So that’s why in very high-dimensional spaces you’re actually much more likely to run into a saddle point like that shown on the right, then the local optimum.

In mathematics, a saddle point or minimax point is a point on the surface of the graph of a function where the slopes (derivatives) in orthogonal directions are all zero (a critical point), but which is not a local extremum of the function. Maxima, minima, and saddle points

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值