06-(第二课)第二周作业：优化算法

最新推荐文章于 2024-01-06 19:13:40 发布

Eden朱

最新推荐文章于 2024-01-06 19:13:40 发布

阅读量1.4k

点赞数

分类专栏： Deep Learning Coursera

本文链接：https://blog.csdn.net/qiqicos/article/details/79114550

版权

Deep Learning Coursera 专栏收录该内容

7 篇文章 0 订阅

订阅专栏

1. 梯度下降

A simple optimization method in machine learning is gradient descent (GD). When you take gradient steps with respect to all $m$ examples on each step, it is also called Batch Gradient Descent. (在机器学习中一种简单的优化方法就是GD。当你求梯度时是同时对所有的样本，所以也叫Batch GD)

Warm-up exercise: Implement the gradient descent update rule. The gradient descent rule is, for $l = 1, ..., L$ :

W [l] = W [l] - α d W [l] (1)

$W^{[l]} = W^{[l]} - \alpha dW^{[l]} \tag{1}$

b [l] = b [l] - α d b [l] (2)

$b^{[l]} = b^{[l]} - \alpha db^{[l]} \tag{2}$

where L is the number of layers and $\alpha$ is the learning rate. All parameters should be stored in the parameters dictionary. Note that the iterator l starts at 0 in the for loop while the first parameters are $W^{[1]}$ and $b^{[1]}$ . You need to shift l to l+1 when coding.

A variant of this is Stochastic Gradient Descent (SGD), which is equivalent to mini-batch gradient descent where each mini-batch has just 1 example. The update rule that you have just implemented does not change. What changes is that you would be computing gradients on just one training example at a time, rather than on the whole training set. The code examples below illustrate the difference between stochastic gradient descent and (batch) gradient descent. (SGD是mini-batch GD中每个batch仅一个样本的特例。下面的代码显示两者之间的区别)

(Batch) Gradient Descent:

X = data_input
Y = labels
parameters = initialize_parameters(layers_dims)
for i in range(0, num_iterations):
    # Forward propagation
    a, caches = forward_propagation(X, parameters)
    # Compute cost.
    cost = compute_cost(a, Y)
    # Backward propagation.
    grads = backward_propagation(a, caches, parameters)
    # Update parameters.
    parameters = update_parameters(parameters, grads)

Stochastic Gradient Descent:

X = data_input
Y = labels
parameters = initialize_parameters(layers_dims)
for i in range(0, num_iterations):
    for j in range(0, m):
        # Forward propagation 
        a, caches = forward_propagation(X[:,j], parameters) 这里使用了切片
        # Compute cost
        cost = compute_cost(a, Y[:,j])
        # Backward propagation
        grads = backward_propagation(a, caches, parameters)
        # Update parameters.
        parameters = update_parameters(parameters, grads)

sgd
Note also that implementing SGD requires 3 for-loops in total(实现SGD总共需要三个循环):

Over the number of iterations(迭代次数)
Over the $m$ training examples( $m$ 个样本)
Over the layers (to update all parameters, from $(W^{[1]},b^{[1]})$ to $(W^{[L]},b^{[L]})$ )(更新参数时遍历所有的层)

mini
What you should remember(你需要记住):

The difference between gradient descent, mini-batch gradient descent and stochastic gradient descent is the number of examples you use to perform one update step.(在执行一次更新参数操作时这三种方法使用的样本数不同)
You have to tune a learning rate hyperparameter $\alpha$ (你应该尝试不同的学习率).
With a well-turned mini-batch size, usually it outperforms either gradient descent or stochastic gradient descent (particularly when the training set is large)(一个不错的mini-batch大小通常比另外要好，特别当训练集很大时).

2. Mini-Batch 梯度下降

Let’s learn how to build mini-batches from the training set (X, Y)(现在让我们学习如何从训练集构建mini-batches).

There are two steps:

Shuffle(清洗): Create a shuffled version of the training set (X, Y) as shown below. Each column of X and Y represents a training example. Note that the random shuffling is done synchronously between X and Y. Such that after the shuffling the $i^{th}$ column of X is the example corresponding to the $i^{th}$ label in Y. The shuffling step ensures that examples will be split randomly into different mini-batches. (创建一个如下打乱的训练集版本。X和Y的每列代表一个训练样本。注意在X和Y之间随机清洗是同步的。清洗的目的是确保样本能被随机划分到不同的mini-batches中)
Partition(划分): Partition the shuffled (X, Y) into mini-batches of size mini_batch_size (here 64). Note that the number of training examples is not always divisible by mini_batch_size. The last mini batch might be smaller, but you don’t need to worry about this. When the final mini-batch is smaller than the full mini_batch_size, it will look like this: (将清洗后的数据集划分为mini_batch_size大小，注意训练样本数不一定刚好够被划分，所以最后一个可能小些但不用担心这个。)

Exercise: Implement random_mini_batches. We coded the shuffling part for you. To help you with the partitioning step, we give you the following code that selects the indexes for the $1^{st}$ and $2^{nd}$ mini-batches:(实现random_mini_batches。已经为你实现清洗部分这将帮助你实现划分部分)

first_mini_batch_X = shuffled_X[:, 0 : mini_batch_size]
second_mini_batch_X = shuffled_X[:, mini_batch_size : 2 * mini_batch_size]
...

Note that the last mini-batch might end up smaller than mini_batch_size=64. Let $\lfloor s \rfloor$ represents $s$ rounded down to the nearest integer (this is math.floor(s) in Python). If the total number of examples is not a multiple of mini_batch_size=64 then there will be $\lfloor \frac{m}{mini\_batch\_size}\rfloor$ mini-batches with a full 64 examples, and the number of examples in the final mini-batch will be ( $m-mini_\_batch_\_size \times \lfloor \frac{m}{mini\_batch\_size}\rfloor$ ).

def random_mini_batches(X, Y, mini_batch_size = 64, seed = 0):
    """
    Creates a list of random minibatches from (X, Y)

    Arguments:
    X -- input data, of shape (input size, number of examples)
    Y -- true "label" vector (1 for blue dot / 0 for red dot), of shape (1, number of examples)
    mini_batch_size -- size of the mini-batches, integer

    Returns:
    mini_batches -- list of synchronous (mini_batch_X, mini_batch_Y)
    """

    np.random.seed(seed)            # To make your "random" minibatches the same as ours
    m = X.shape[1]                  # number of training examples
    mini_batches = []

    # Step 1: Shuffle (X, Y)
    permutation = list(np.random.permutation(m))  # np.random.permutation传入整数返回一个洗牌后的arange，并强制转换为list
    shuffled_X = X[:, permutation]
    shuffled_Y = Y[:, permutation].reshape((1,m))

    # Step 2: Partition (shuffled_X, shuffled_Y). Minus the end case.
    num_complete_minibatches = math.floor(m/mini_batch_size) # number of mini batches of size mini_batch_size in your partitionning
    for k in range(0, num_complete_minibatches):
        ### START CODE HERE ### (approx. 2 lines)
        mini_batch_X = shuffled_X[:,k * mini_batch_size:(k + 1) * mini_batch_size]
        mini_batch_Y = shuffled_Y[:,k * mini_batch_size:(k + 1) * mini_batch_size]
        ### END CODE HERE ###
        mini_batch = (mini_batch_X, mini_batch_Y)
        mini_batches.append(mini_batch)

    # Handling the end case (last mini-batch < mini_batch_size)
    if m % mini_batch_size != 0:
        ### START CODE HERE ### (approx. 2 lines)
        mini_batch_X = shuffled_X[:,num_complete_minibatches * mini_batch_size:]
        mini_batch_Y = shuffled_Y[:,num_complete_minibatches * mini_batch_size:]
        ### END CODE HERE ###
        mini_batch = (mini_batch_X, mini_batch_Y)
        mini_batches.append(mini_batch)

    return mini_batches

3. 动量

Because mini-batch gradient descent makes a parameter update after seeing just a subset of examples, the direction of the update has some variance, and so the path taken by mini-batch gradient descent will “oscillate” toward convergence. Using momentum can reduce these oscillations.(由于小批量梯度下降在只看到样本子集之后就进行参数更新，所以更新的方向有一些变化，因此小批量梯度下降的路径将会“震荡”到收敛。使用动量可以可以减小这些震荡)

Momentum takes into account the past gradients to smooth out the update. We will store the ‘direction’ of the previous gradients in the variable $v$ . Formally, this will be the exponentially weighted average of the gradient on previous steps. You can also think of $v$ as the “velocity” of a ball rolling downhill, building up speed (and momentum) according to the direction of the gradient/slope of the hill(动量考虑到之前的梯度来平滑更新。我们将在变量 $v$ 中存储之前的梯度“方向”。通常我们对之前的梯度使用指数加权平均。可以把v看成是一个滚落球的“速度”，根据梯度/斜率的方向建立速度(动量)。).

def initialize_velocity(parameters):
    """
    Initializes the velocity as a python dictionary with:
                - keys: "dW1", "db1", ..., "dWL", "dbL" 
                - values: numpy arrays of zeros of the same shape as the corresponding gradients/parameters.
    Arguments:
    parameters -- python dictionary containing your parameters.
                    parameters['W' + str(l)] = Wl
                    parameters['b' + str(l)] = bl

    Returns:
    v -- python dictionary containing the current velocity.
                    v['dW' + str(l)] = velocity of dWl
                    v['db' + str(l)] = velocity of dbl
    """

    L = len(parameters) // 2 # number of layers in the neural networks
    v = {}

    # Initialize velocity
    for l in range(L):
        ### START CODE HERE ### (approx. 2 lines)
        v["dW" + str(l+1)] = np.zeros_like(parameters["W" + str(l+1)]) # 返回一个与参数相同shape的0数组
        v["db" + str(l+1)] = np.zeros_like(parameters["b" + str(l+1)])
        ### END CODE HERE ###

    return v

Exercise: Now, implement the parameters update with momentum. The momentum update rule is, for $l = 1, ..., L$ :

{v d W [l] = β v d W [l] + (1 - β) d W [l] W [l] = W [l] - α v d W [l] (3)

$\begin{cases} v_{dW^{[l]}} = \beta v_{dW^{[l]}} + (1 - \beta) dW^{[l]} \\ W^{[l]} = W^{[l]} - \alpha v_{dW^{[l]}} \end{cases}\tag{3}$

{v d b [l] = β v d b [l] + (1 - β) d b [l] b [l] = b [l] - α v d b [l] (4)

$\begin{cases} v_{db^{[l]}} = \beta v_{db^{[l]}} + (1 - \beta) db^{[l]} \\ b^{[l]} = b^{[l]} - \alpha v_{db^{[l]}} \end{cases}\tag{4}$

where L is the number of layers, $\beta$ is the momentum and $\alpha$ is the learning rate. All parameters should be stored in the parameters dictionary. Note that the iterator l starts at 0 in the for loop while the first parameters are $W^{[1]}$ and $b^{[1]}$ (that’s a “one” on the superscript). So you will need to shift l to l+1 when coding.

def update_parameters_with_momentum(parameters, grads, v, beta, learning_rate):
    """
    Update parameters using Momentum

    Arguments:
    parameters -- python dictionary containing your parameters:  python字典包含所有W、b参数
                    parameters['W' + str(l)] = Wl
                    parameters['b' + str(l)] = bl
    grads -- python dictionary containing your gradients for each parameters:   这是已经进行了反向传播后算出的梯度
                    grads['dW' + str(l)] = dWl
                    grads['db' + str(l)] = dbl
    v -- python dictionary containing the current velocity:   在更新参数之前需要计算的v
                    v['dW' + str(l)] = ...
                    v['db' + str(l)] = ...
    beta -- the momentum hyperparameter, scalar
    learning_rate -- the learning rate, scalar

    Returns:
    parameters -- python dictionary containing your updated parameters 
    v -- python dictionary containing your updated velocities
    """

    L = len(parameters) // 2 # number of layers in the neural networks

    # Momentum update for each parameter
    for l in range(L):

        ### START CODE HERE ### (approx. 4 lines)
        # compute velocities  这是在更新各个层的v参数
        v["dW" + str(l+1)] = beta * v["dW" + str(l + 1)] + (1 - beta) * grads['dW' + str(l + 1)]
        v["db" + str(l+1)] = beta * v["db" + str(l + 1)] + (1 - beta) * grads['db' + str(l + 1)]
        # update parameters
        parameters["W" + str(l+1)] = parameters["W" + str(l + 1)] - learning_rate * v["dW" + str(l + 1)]
        parameters["b" + str(l+1)] = parameters["b" + str(l + 1)] - learning_rate * v["db" + str(l + 1)]
        ### END CODE HERE ###

    return parameters, v

Note that:

The velocity is initialized with zeros. So the algorithm will take a few iterations to “build up” velocity and start to take bigger steps(速度被初始为0。所以算法需要几次迭代来“建立”速度，并开始采取更大的步骤。).
If $\beta = 0$ , then this just becomes standard gradient descent without momentum(如果 $\beta=0$ 这时就变成了标准梯度下降而没有动量).

How do you choose $\beta$ ?

The larger the momentum $\beta$ is, the smoother the update because the more we take the past gradients into account. But if $\beta$ is too big, it could also smooth out the updates too much.
Common values for $\beta$ range from 0.8 to 0.999. If you don’t feel inclined to tune this, $\beta = 0.9$ is often a reasonable default( $\beta$ 值通常在0.8到0.999之间， $\beta=0.9$ 经常作为不错的默认值).
Tuning the optimal $\beta$ for your model might need trying several values to see what works best in term of reducing the value of the cost function $J$ .

4. Adam

Adam is one of the most effective optimization algorithms for training neural networks. It combines ideas from RMSProp (described in lecture) and Momentum.

How does Adam work?

It calculates an exponentially weighted average of past gradients, and stores it in variables $v$ (before bias correction) and $v^{corrected}$ (with bias correction).
It calculates an exponentially weighted average of the squares of the past gradients, and stores it in variables $s$ (before bias correction) and $s^{corrected}$ (with bias correction).
It updates parameters in a direction based on combining information from “1” and “2”.

The update rule is, for $l = 1, ..., L$ :

⎧ ⎩ ⎨ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ v d W [l] = β 1 v d W [l] + (1 - β 1) \partial J \partial W [ l ] v c o r r e c t e d d W [l] = v d W [ l ] 1 - ( β 1 ) t s d W [l] = β 2 s d W [l] + (1 - β 2) (\partial J \partial W [ l ]) 2 s c o r r e c t e d d W [l] = s d W [ l ] 1 - ( β 1 ) t W [l] = W [l] - α v c o r r e c t e d d W [ l ] s c o r r e c t e d d W [ l ] \sqrt + ε

$\begin{cases} v_{dW^{[l]}} = \beta_1 v_{dW^{[l]}} + (1 - \beta_1) \frac{\partial \mathcal{J} }{ \partial W^{[l]} } \\ v^{corrected}_{dW^{[l]}} = \frac{v_{dW^{[l]}}}{1 - (\beta_1)^t} \\ s_{dW^{[l]}} = \beta_2 s_{dW^{[l]}} + (1 - \beta_2) (\frac{\partial \mathcal{J} }{\partial W^{[l]} })^2 \\ s^{corrected}_{dW^{[l]}} = \frac{s_{dW^{[l]}}}{1 - (\beta_1)^t} \\ W^{[l]} = W^{[l]} - \alpha \frac{v^{corrected}_{dW^{[l]}}}{\sqrt{s^{corrected}_{dW^{[l]}}} + \varepsilon} \end{cases}$
where:

t counts the number of steps taken of Adam
L is the number of layers
$\beta_1$ and $\beta_2$ are hyperparameters that control the two exponentially weighted averages.
$\alpha$ is the learning rate
$\varepsilon$ is a very small number to avoid dividing by zero

As usual, we will store all parameters in the parameters dictionary
v、s的初始化

def initialize_adam(parameters) :
    """
    Initializes v and s as two python dictionaries with:  初始化v、s两个python字典
                - keys: "dW1", "db1", ..., "dWL", "dbL"   关键字和parameters一样
                - values: numpy arrays of zeros of the same shape as the corresponding gradients/parameters.

    Arguments:
    parameters -- python dictionary containing your parameters.
                    parameters["W" + str(l)] = Wl
                    parameters["b" + str(l)] = bl

    Returns: 
    v -- python dictionary that will contain the exponentially weighted average of the gradient.
                    v["dW" + str(l)] = ...
                    v["db" + str(l)] = ...
    s -- python dictionary that will contain the exponentially weighted average of the squared gradient.
                    s["dW" + str(l)] = ...
                    s["db" + str(l)] = ...

    """

    L = len(parameters) // 2 # number of layers in the neural networks
    v = {}
    s = {}

    # Initialize v, s. Input: "parameters". Outputs: "v, s".
    for l in range(L):
    ### START CODE HERE ### (approx. 4 lines)
        v["dW" + str(l+1)] = np.zeros_like(parameters["W" + str(l + 1)])
        v["db" + str(l+1)] = np.zeros_like(parameters["b" + str(l + 1)])
        s["dW" + str(l+1)] = np.zeros_like(parameters["W" + str(l + 1)])
        s["db" + str(l+1)] = np.zeros_like(parameters["b" + str(l + 1)])
    ### END CODE HERE ###

    return v, s

def update_parameters_with_adam(parameters, grads, v, s, t, learning_rate = 0.01,
                                beta1 = 0.9, beta2 = 0.999,  epsilon = 1e-8):
    """
    Update parameters using Adam

    Arguments:
    parameters -- python dictionary containing your parameters:
                    parameters['W' + str(l)] = Wl
                    parameters['b' + str(l)] = bl
    grads -- python dictionary containing your gradients for each parameters:
                    grads['dW' + str(l)] = dWl
                    grads['db' + str(l)] = dbl
    v -- Adam variable, moving average of the first gradient, python dictionary
    s -- Adam variable, moving average of the squared gradient, python dictionary
    learning_rate -- the learning rate, scalar.
    beta1 -- Exponential decay hyperparameter for the first moment estimates 
    beta2 -- Exponential decay hyperparameter for the second moment estimates 
    epsilon -- hyperparameter preventing division by zero in Adam updates

    Returns:
    parameters -- python dictionary containing your updated parameters 
    v -- Adam variable, moving average of the first gradient, python dictionary
    s -- Adam variable, moving average of the squared gradient, python dictionary
    """

    L = len(parameters) // 2                 # number of layers in the neural networks
    v_corrected = {}                         # Initializing first moment estimate, python dictionary
    s_corrected = {}                         # Initializing second moment estimate, python dictionary

    # Perform Adam update on all parameters
    for l in range(L):
        # Moving average of the gradients. Inputs: "v, grads, beta1". Output: "v".
        ### START CODE HERE ### (approx. 2 lines)
        v["dW" + str(l+1)] = beta1 * v["dW" + str(l + 1)] + (1 - beta1) * grads['dW' + str(l + 1)]
        v["db" + str(l+1)] = beta1 * v["db" + str(l + 1)] + (1 - beta1) * grads['db' + str(l + 1)]
        ### END CODE HERE ###

        # Compute bias-corrected first moment estimate. Inputs: "v, beta1, t". Output: "v_corrected".
        ### START CODE HERE ### (approx. 2 lines)
        v_corrected["dW" + str(l+1)] = v["dW" + str(l + 1)] / (1 - np.power(beta1, t))
        v_corrected["db" + str(l+1)] = v["db" + str(l + 1)] / (1 - np.power(beta1, t))
        ### END CODE HERE ###

        # Moving average of the squared gradients. Inputs: "s, grads, beta2". Output: "s".
        ### START CODE HERE ### (approx. 2 lines)
        s["dW" + str(l+1)] = beta2 * s["dW" + str(l + 1)] + (1 - beta2) * np.power(grads['dW' + str(l + 1)], 2)
        s["db" + str(l+1)] = beta2 * s["db" + str(l + 1)] + (1 - beta2) * np.power(grads['db' + str(l + 1)], 2)
        ### END CODE HERE ###

        # Compute bias-corrected second raw moment estimate. Inputs: "s, beta2, t". Output: "s_corrected".
        ### START CODE HERE ### (approx. 2 lines)
        s_corrected["dW" + str(l+1)] = s["dW" + str(l + 1)] / (1 - np.power(beta2, t))
        s_corrected["db" + str(l+1)] = s["db" + str(l + 1)] / (1 - np.power(beta2, t))
        ### END CODE HERE ###

        # Update parameters. Inputs: "parameters, learning_rate, v_corrected, s_corrected, epsilon". Output: "parameters".
        ### START CODE HERE ### (approx. 2 lines)
        parameters["W" + str(l+1)] = parameters["W" + str(l + 1)] - learning_rate * v_corrected["dW" + str(l + 1)] / np.sqrt(s["dW" + str(l + 1)] + epsilon)
        parameters["b" + str(l+1)] = parameters["b" + str(l + 1)] - learning_rate * v_corrected["db" + str(l + 1)] / np.sqrt(s["db" + str(l + 1)] + epsilon)
        ### END CODE HERE ###

    return parameters, v, s

5.模型调用不同的优化算法

We have already implemented a 3-layer neural network. You will train it with:

Mini-batch Gradient Descent: it will call your function:
- update_parameters_with_gd()
Mini-batch Momentum: it will call your functions:
- initialize_velocity() and update_parameters_with_momentum()
Mini-batch Adam: it will call your functions:
- initialize_adam() and update_parameters_with_adam()

5.1 - Mini-batch Gradient descent

# train 3-layer model
layers_dims = [train_X.shape[0], 5, 2, 1]
parameters = model(train_X, train_Y, layers_dims, optimizer = "gd")

# Predict
predictions = predict(train_X, train_Y, parameters)

# Plot decision boundary
plt.title("Model with Gradient Descent optimization")
axes = plt.gca()
axes.set_xlim([-1.5,2.5])
axes.set_ylim([-1,1.5])
plot_decision_boundary(lambda x: predict_dec(parameters, x.T), train_X, train_Y)

seed

# Define the random minibatches. We increment the seed to reshuffle differently the dataset after each epoch
        # seed = seed + 1  # 每次epoch之后我们都将重新清洗划分数据集
        minibatches = random_mini_batches(X, Y, mini_batch_size, seed)

如果每次epoch之后不重洗数据那么得到的cost函数将不会震荡，如图：

5.2 - Mini-batch gradient descent with momentum

moment

5.3 - Mini-batch with Adam mode

Adam
Adam1
可以看到在相同的epoch次数，Adam学习的更好！

5.4 - Summary

Momentum usually helps, but given the small learning rate and the simplistic dataset, its impact is almost negligeable. Also, the huge oscillations you see in the cost come from the fact that some minibatches are more difficult thans others for the optimization algorithm(Momentum通常有用，但是对于较小的学习率(因为对于较小的学习率更新参数本身就很慢了)和过于简单化的数据集这种影响就微乎其微。你在代价函数所见的震荡来自这样一个事实：小批量对于优化算法来说确实比其它的困难).

Adam on the other hand, clearly outperforms mini-batch gradient descent and Momentum. If you run the model for more epochs on this simple dataset, all three methods will lead to very good results. However, you’ve seen that Adam converges a lot faster(对于这个模型如果在这个简单数据集上进行更多的epoch那么这三种方法都将有个好的结果。但是Adam收敛最快).

Some advantages of Adam include:

Relatively low memory requirements (though higher than gradient descent and gradient descent with momentum) (相对较小的内存需求)
Usually works well even with little tuning of hyperparameters (except $\alpha$ )(通常情况下对超参很少调整也能工作很好)