吴恩达深度学习第二课-第二周笔记及课后编程题

最新推荐文章于 2024-05-23 20:49:54 发布

Giraffeee_

最新推荐文章于 2024-05-23 20:49:54 发布

阅读量444

点赞数 1

分类专栏：吴恩达深度学习文章标签：深度学习机器学习 python

本文链接：https://blog.csdn.net/m0_52370089/article/details/129858347

版权

吴恩达深度学习专栏收录该内容

6 篇文章 8 订阅

订阅专栏

笔记

Mini—batch

mini-batch是神经网络的其中一个超变量（hyperparameter）。

数据量（样本量）过大时，会导致训练速度减慢，即完整遍历一次训练集所需时间变长。如果我们：

- 将训练集随机打乱，均分成若干mini-batch（最后一个mini-batch的样本量可能比前面的mini-batch少，不要紧）；

- 每一个mini-batch做一次forward prop & backward prop，即每遍历一个mini-batch做一次参数更新；

- 完整遍历完整个训练集算作一个epoch，通过多个epoch达到较为理想的误差（converge to the global minimum）

注：mini-batch的大小通常是2的平方数，如64，128，512；它的大小必须符合CPU/GPU的内存。

Mini-batch GD VS SGD VS GD

- 当 mini-batch size = training set size，运用梯度下降法时，我们称之为(Batch) gradient descent，即最普通的梯度下降。

缺点：当训练集很大时（样本量很多），执行一次参数更新需要很长时间，拖慢了训练速度。

示范代码：

# Batch Gradient Descent
X = data_input
Y = labels
parameters = initialize_parameters(layers_dims)
for i in range(0, num_iterations):
    # Forward propagation
    a, caches = forward_propagation(X, parameters)
    # Compute cost.
    cost = compute_cost(a, Y)
    # Backward propagation.
    grads = backward_propagation(a, caches, parameters)
    # Update parameters.
    parameters = update_parameters(parameters, grads)

- 当 mini-batch size = 1，运用梯度下降法时，我们称之为stochastic gradient descent（即SGD）

缺点：有可能会出现永远不会收敛到最小值，只在它附近震荡的情况；失去了向量化（vectorization）的速度提升。

示范代码：

# Stochastic Gradient Descent
X = data_input
Y = labels
parameters = initialize_parameters(layers_dims)
for i in range(0, num_iterations):
    for j in range(0, m):
        # Forward propagation
        a, caches = forward_propagation(X[:,j], parameters)
        # Compute cost
        cost = compute_cost(a, Y[:,j])
        # Backward propagation
        grads = backward_propagation(a, caches, parameters)
        # Update parameters.
        parameters = update_parameters(parameters, grads)

下面我们通过几个图进行比较：

（1）SGD VS GD

"+" denotes a minimum of the cost. SGD leads to many oscillations to reach convergence. But each step is a lot faster to compute for SGD than for GD, as it uses only one training example (vs. the whole batch for GD).

“+”表示的是全局最小损失值。SGD是震荡收缩至最小值，但即便如此每一步都要比GD更快。

（2）SGD VS Mini-batch GD

"+" denotes a minimum of the cost. Using mini-batches in your optimization algorithm often leads to faster optimization. With a well-turned mini-batch size, usually it outperforms either gradient descent or stochastic gradient descent (particularly when the training set is large).

“+”表示的是全局最小损失值。在优化算法中使用mini-batch通常会使优化速度更快。当使用了一个调试好的mini-batch大小，它通常会比SGD或GD的表现更好（尤其是在训练集很大的情况下）。

指数加权平均

指数加权平均，即 Exponentially weighted averages。通过它可以来计算局部的平均值，来描述数值的变化趋势。

主要步骤：

$V_\theta$ = 0

Repeat{

Get next $\theta_t$

$V_\theta$ := β $V_\theta$ + （1-β） $\theta_t$

}

以吴恩达老师举的天气例子为代表，V代表局部平均值，θ表示当天的温度。

days-temperature散点图

红线表示β = 0.9 的温度趋势

绿线表示β = 0.98的温度趋势

黄线表示β = 0.5的温度趋势

$V_\theta$ 看作是 $\frac{1}{1-\beta}$ 天的平均值，如：红线即为10天的平均值，绿线为50天的平均值，黄线为2天的平均值。可以看到绿线比红线更平稳，但是它向右平移了，即产生了延迟；而黄线较红线的波动更大，对于温度的趋势反馈更及时，能更快地适应温度的变化，但与此同时也带来了更多的噪声。

解析——指数加权平均：（假设β=0.9）

$v_{100}=0.9v_{99}+0.1\theta_{100}$

$v_{99}=0.9v_{98}+0.1\theta_{99}$

$v_{98}=0.9v_{97}+0.1\theta_{97}$

...

$v_{100}=0.1\theta_{100}+0.1*0.9\theta_{99}+0.1*0.9^2\theta_{98}+...0.1*0.9^{99}\theta_1$

$=0.1*\sum^{100}_{i=1}0.9^{(100-i)}*\theta_i$

上面的式子就是一个指数加权平均。

解析——β=0.9，平均了10天的温度

ε = 1 - β = 1 - 0.9 = 0.1

当 $(1-\epsilon)^{\frac{1}{e}}=\frac{1}{e}$ 时，即 $(0.9)^{10}=\frac{1}{e}$ （e为自然对数，e=2.71828）

当权重下降到峰值权重的 $\frac{1}{e}$ 时，我们就说平均了 $\frac{1}{\epsilon}=\frac{1}{1-\beta}$ 天

- 优点：如果想要计算10天局部温度的平均值，需要保存最近10天的温度；而使用指数加权平均来计算局部平均值的时候，只需要保存前一个加权平均值，即可以节省大量的空间。

- 相对于直接计算平均值而言，它的精度没那么高。

偏差修正

偏差修正，即Bias Correction。用于提高前期的指数加权平均值的精确度。

我们通过观察可以看到，前期的指数加权平均存在较大的误差，通过误差修正减少误差。

公式： $v_t=\frac{v_t}{1-\beta^t}$

随着t的增大， $1-\beta^t$ 会趋于1，即偏差修正对于后期的指数加权平均的影响逐渐减弱。

Gradient Descent with Momentum

基本思想：计算梯度的指数加权平均数

set VdW = 0, Vdb = 0 (VdW的维度和dW相同，Vdb的维度和db相同)

on iteration t:

Compute dW, db on current mini-batch

VdW = βVdW + (1-β)dW

Vdb = βVdb + (1-β)db

W = W - αVdW

b = b - αVdb

The red arrows shows the direction taken by one step of mini-batch gradient descent with momentum. The blue points show the direction of the gradient (with respect to the current mini-batch) on each step. Rather than just following the gradient, we let the gradient influence v and then take a step in the direction of v.

红色箭头显示了小批量梯度下降的一个步骤所采取的方向。蓝色点表示每一步的梯度方向(相对于当前的迷你批处理)。与其仅仅跟随梯度，我们让梯度影响v，然后在v的方向上走一步。

Momentum takes into account the past gradients to smooth out the update. We will store the 'direction' of the previous gradients in the variable v. Formally, this will be the exponentially weighted average of the gradient on previous steps. You can also think of v as the "velocity" of a ball rolling downhill, building up speed (and momentum) according to the direction of the gradient/slope of the hill.

动量会使用过去的梯度来平滑更新。我们将之前梯度的“方向”存储在变量v中，形式上，这是前几步梯度的指数加权平均。你也可以把v想象成滚动下坡的球的“速度”，根据坡度/斜坡的方向建立速度(和动量)。

RMSprop(Root Mean Square prop)

set SdW = 0, Sdb = 0 (SdW的维度和dW相同，Sdb的维度和db相同)

on iteration t:

Compute dW, db on current mini-batch

SdW = βSdW + (1-β)dW**2

Sdb = βSdb + (1-β)db**2

$W = W-\alpha\frac{dW}{\sqrt{SdW}}$

$b=b-\alpha\frac{db}{\sqrt{Sdb}}$

RMSprop也是起到平滑震荡的作用，与Momentum不同的是，这里的梯度是作为标量，平方后再开方，会使梯度值较大的变量梯度被削减，从而减低了震荡；而Momentum的梯度是作为向量，因此在震荡方向上的变量梯度会有一定程度的抵消，从而降低了震荡。

Adam Optimazation Algorithm

Adam=Adaptive Moment Estimation，使结合了Momentum和RMSprop的重要优化算法。

set VdW = 0, Vdb = 0 (VdW的维度和dW相同，Vdb的维度和db相同)

set SdW = 0, Sdb = 0 (SdW的维度和dW相同，Sdb的维度和db相同)

on iteration t:

Compute dW, db on current mini-batch

   $VdW=\beta_1VdW+(1-\beta_1)dW,Vdb=\beta_1Vdb+(1-\beta_1)db$

         $SdW=\beta_2SdW+(1-\beta_2)(dW)^2,Sdb=\beta_2Sdb+(1-\beta_2)(db)^2$

对以上四个参数做偏差修正

         $W=W-\alpha\frac{V^{corrected}_{dW}}{\sqrt{S^{corrected}_{dW}}+\epsilon}$

         $b=b-\alpha\frac{V^{corrected}_{db}}{\sqrt{S^{corrected}_{db}}+\epsilon}$

对于超参数（Hyperparameters）的选择

通常我们取：

β1=0.9

β2=0.999

ε=e-8

α：需要去调试找到最适合的值

Learning rate decay

当 α 不变时，在训练的最后可能因为 α 过大无法收敛到最小值，而是在最小值附近波动。

因此在训练过程中，我们需要对 α 进行有规律的减小。通常我们会用到以下方法：

$\alpha=\frac{1}{1+decay_rate*epoch_num}\alpha_0$ （α0是 α 的初始值） or

$\alpha=0.95^{epoch\_{num}}\alpha_0$ or

$\alpha=\frac{k}{\sqrt{epoch\_num}}\alpha_0$ or

$\alpha=\frac{k}{\sqrt{t}}\alpha_0$ or

Discrete Staircase，其中每一个阶梯值都是前一个的一半

课后编程题

本周课后编程题需要以下文件：

第二周课后编程题资料，提取码：8y19

import numpy as np
import matplotlib.pyplot as plt
import scipy.io
import math
import sklearn
import sklearn.datasets

from opt_utils import load_params_and_grads, initialize_parameters, forward_propagation, backward_propagation
from opt_utils import compute_cost, predict, predict_dec, plot_decision_boundary, load_dataset
from testCases import *

plt.rcParams['figure.figsize'] = (7.0, 4.0)  # set default size of plots
plt.rcParams['image.interpolation'] = 'nearest'
plt.rcParams['image.cmap'] = 'gray'


# Warm-up exercise: Implement the gradient descent update rule.
def update_parameters_with_gd(parameters, grads, learning_rate):
    """
    Update parameters using one step of gradient descent

    Arguments:
    parameters -- python dictionary containing your parameters to be updated:
                    parameters['W' + str(l)] = Wl
                    parameters['b' + str(l)] = bl
    grads -- python dictionary containing your gradients to update each parameters:
                    grads['dW' + str(l)] = dWl
                    grads['db' + str(l)] = dbl
    learning_rate -- the learning rate, scalar.

    Returns:
    parameters -- python dictionary containing your updated parameters
    """

    L = len(parameters) // 2  # number of layers in the neural networks

    # Update rule for each parameter
    for l in range(L):
        # START CODE HERE #
        parameters["W" + str(l)] = parameters["W" + str(l+1)] - learning_rate * grads['dW' + str(l+1)]
        parameters["b" + str(l)] = parameters["b" + str(l+1)] - learning_rate * grads['db' + str(l+1)]
        # END CODE HERE

    return parameters


# test for update_parameters_with_gd
print("========== test for update_parameters_with_gd ==========")
parameters, grads, learning_rate = update_parameters_with_gd_test_case()
parameters = update_parameters_with_gd(parameters, grads, learning_rate)
print("W1 = " + str(parameters["W1"]))
print("b1 = " + str(parameters["b1"]))
print("W2 = " + str(parameters["W2"]))
print("b2 = " + str(parameters["b2"]))

"""
Comparison of GD and SGD:
In Stochastic Gradient Descent, you use only 1 training example before updating the gradients. When the training set is 
large, SGD can be faster. But the parameters will "oscillate" toward the minimum rather than converge smoothly.
"""
# GD:
# X = data_input
# Y = labels
# parameters = initialize_parameters(layers_dims)
# for i in range(0, num_iterations):
#     # Forward propagation
#     a, caches = forward_propagation(X, parameters)
#     # Compute cost.
#     cost = compute_cost(a, Y)
#     # Backward propagation.
#     grads = backward_propagation(a, caches, parameters)
#     # Update parameters.
#     parameters = update_parameters(parameters, grads)

# SGD:
# X = data_input
# Y = labels
# parameters = initialize_parameters(layers_dims)
# for i in range(0, num_iterations):
#     for j in range(0, m):
#         # Forward propagation
#         a, caches = forward_propagation(X[:,j], parameters)
#         # Compute cost
#         cost = compute_cost(a, Y[:,j])
#         # Backward propagation
#         grads = backward_propagation(a, caches, parameters)
#         # Update parameters.
#         parameters = update_parameters(parameters, grads)

"""
Two steps of building mini-batches from training set (X, Y):
    - Shuffle: Create a shuffled version of the training set (X, Y).The shuffling step ensures that examples will be 
      split randomly into different mini-batches. Note that the random shuffling is done synchronously between X and Y. 
    - Partition: Partition the shuffled (X, Y) into mini-batches of size mini_batch_size.Note that the number of 
      training examples is not always divisible by mini_batch_size. The last mini batch might be smaller, but you don't 
      need to worry about this. 
"""
# Exercise: Implement random_mini_batches.
def random_mini_batches(X, Y, mini_batch_size=64, seed=0):
    """
    Creates a list of random minibatches from (X, Y)

    Arguments:
    X -- input data, of shape (input size, number of examples)
    Y -- true "label" vector (1 for blue dot / 0 for red dot), of shape (1, number of examples)
    mini_batch_size -- size of the mini-batches, integer

    Returns:
    mini_batches -- list of synchronous (mini_batch_X, mini_batch_Y)
    """

    np.random.seed(seed)  # To make your "random" minibatches the same as ours
    m = X.shape[1]  # number of training examples
    mini_batches = []

    # Step 1: Shuffle (X, Y)
    permutation = list(np.random.permutation(m))
    shuffled_X = X[:, permutation]
    shuffled_Y = Y[:, permutation].reshape((1, m))

    # Step 2: Partition (shuffled_X, shuffled_Y). Minus the end case.
    num_complete_minibatches = math.floor(
        m / mini_batch_size)  # number of mini batches of size mini_batch_size in your partitioning
    for k in range(0, num_complete_minibatches):
        # START CODE HERE
        mini_batch_X = shuffled_X[:, k * mini_batch_size: (k+1) * mini_batch_size]
        mini_batch_Y = shuffled_Y[:, k * mini_batch_size: (k+1) * mini_batch_size]
        # END CODE HERE
        mini_batch = (mini_batch_X, mini_batch_Y)
        mini_batches.append(mini_batch)

    # Handling the end case (last mini-batch < mini_batch_size)
    if m % mini_batch_size != 0:
        # START CODE HERE
        end = m - mini_batch_size * num_complete_minibatches
        mini_batch_X = shuffled_X[:, num_complete_minibatches * mini_batch_size:]
        mini_batch_Y = shuffled_Y[:, num_complete_minibatches * mini_batch_size:]
        # END CODE HERE
        mini_batch = (mini_batch_X, mini_batch_Y)
        mini_batches.append(mini_batch)

    return mini_batches


# test for random_mini_batches
print("========== test for random_mini_batches ==========")
X_assess, Y_assess, mini_batch_size = random_mini_batches_test_case()
mini_batches = random_mini_batches(X_assess, Y_assess, mini_batch_size)
print("shape of the 1st mini_batch_X: " + str(mini_batches[0][0].shape))
print("shape of the 2nd mini_batch_X: " + str(mini_batches[1][0].shape))
print("shape of the 3rd mini_batch_X: " + str(mini_batches[2][0].shape))
print("shape of the 1st mini_batch_Y: " + str(mini_batches[0][1].shape))
print("shape of the 2nd mini_batch_Y: " + str(mini_batches[1][1].shape))
print("shape of the 3rd mini_batch_Y: " + str(mini_batches[2][1].shape))
print("mini batch sanity check: " + str(mini_batches[0][0][0][0:3]))


"""
Momentum
Because mini-batch gradient descent makes a parameter update after seeing just a subset of examples, the direction of 
the update has some variance, and so the path taken by mini-batch gradient descent will "oscillate" toward convergence. 
Using momentum can reduce these oscillations.
Momentum takes into account the past gradients to smooth out the update. We will store the 'direction' of the previous 
gradients in the variable v. Formally, this will be the exponentially weighted average of the gradient on previous 
steps. You can also think of v as the "velocity" of a ball rolling downhill, building up speed (and momentum) according 
to the direction of the gradient/slope of the hill.
"""
# Exercise: Initialize the velocity. The velocity, v, is a python dictionary that needs to be initialized with arrays of
# zeros. Its keys are the same as those in the grads dictionary.
def initialize_velocity(parameters):
    """
    Initializes the velocity as a python dictionary with:
                - keys: "dW1", "db1", ..., "dWL", "dbL"
                - values: numpy arrays of zeros of the same shape as the corresponding gradients/parameters.
    Arguments:
    parameters -- python dictionary containing your parameters.
                    parameters['W' + str(l)] = Wl
                    parameters['b' + str(l)] = bl

    Returns:
    v -- python dictionary containing the current velocity.
                    v['dW' + str(l)] = velocity of dWl
                    v['db' + str(l)] = velocity of dbl
    """

    L = len(parameters) // 2  # number of layers in the neural networks
    v = {}

    # Initialize velocity
    for l in range(L):
        # START CODE HERE
        v["dW" + str(l+1)] = np.zeros_like(parameters["W" + str(l+1)])
        v["db" + str(l+1)] = np.zeros_like(parameters["b" + str(l+1)])
        # END CODE HERE

    return v


# test for initialize_velocity
print("========== test for initialize_velocity ==========")
parameters = initialize_velocity_test_case()
v = initialize_velocity(parameters)
print("v[\"dW1\"] = " + str(v["dW1"]))
print("v[\"db1\"] = " + str(v["db1"]))
print("v[\"dW2\"] = " + str(v["dW2"]))
print("v[\"db2\"] = " + str(v["db2"]))


# Exercise: Now, implement the parameters update with momentum.
def update_parameters_with_momentum(parameters, grads, v, beta, learning_rate):
    """
    Update parameters using Momentum

    Arguments:
    parameters -- python dictionary containing your parameters:
                    parameters['W' + str(l)] = Wl
                    parameters['b' + str(l)] = bl
    grads -- python dictionary containing your gradients for each parameter:
                    grads['dW' + str(l)] = dWl
                    grads['db' + str(l)] = dbl
    v -- python dictionary containing the current velocity:
                    v['dW' + str(l)] = ...
                    v['db' + str(l)] = ...
    beta -- the momentum hyperparameter, scalar
    learning_rate -- the learning rate, scalar

    Returns:
    parameters -- python dictionary containing your updated parameters
    v -- python dictionary containing your updated velocities
    """

    L = len(parameters) // 2  # number of layers in the neural networks

    # Momentum update for each parameter
    for l in range(L):
        # START CODE HERE
        # compute velocities
        v["dW" + str(l+1)] = beta * v["dW" + str(l+1)] + (1-beta) * grads['dW' + str(l+1)]
        v["db" + str(l+1)] = beta * v["db" + str(l+1)] + (1-beta) * grads['db' + str(l+1)]
        # update parameters
        parameters["W" + str(l+1)] = parameters["W" + str(l+1)] - learning_rate * v["dW" + str(l+1)]
        parameters["b" + str(l+1)] = parameters["b" + str(l+1)] - learning_rate * v["db" + str(l+1)]
        # END CODE HERE

    return parameters, v


# test for update_parameters_with_momentum
print("========== test for update_parameters_with_momentum ==========")
parameters, grads, v = update_parameters_with_momentum_test_case()
parameters, v = update_parameters_with_momentum(parameters, grads, v, beta = 0.9, learning_rate = 0.01)
print("W1 = " + str(parameters["W1"]))
print("b1 = " + str(parameters["b1"]))
print("W2 = " + str(parameters["W2"]))
print("b2 = " + str(parameters["b2"]))
print("v[\"dW1\"] = " + str(v["dW1"]))
print("v[\"db1\"] = " + str(v["db1"]))
print("v[\"dW2\"] = " + str(v["dW2"]))
print("v[\"db2\"] = " + str(v["db2"]))
"""
Note that:
    - The velocity is initialized with zeros. So the algorithm will take a few iterations to "build up" velocity and 
      start to take bigger steps.
    - If β = 0, then this just becomes standard gradient descent without momentum.
    
How do you choose β ?
    - The larger the momentum β is, the smoother the update because the more we take the past gradients into account. 
      But if β is too big, it could also smooth out the updates too much.
    - Common values for β range from 0.8 to 0.999. If you don't feel inclined to tune this, β = 0.9 is often a 
      reasonable default.
    - Tuning the optimal β for your model might need trying several values to see what works best in term of reducing 
      the value of the cost function J.
"""


# Exercise: Initialize the Adam variables v, s which keep track of the past information
def initialize_adam(parameters):
    """
    Initializes v and s as two python dictionaries with:
                - keys: "dW1", "db1", ..., "dWL", "dbL"
                - values: numpy arrays of zeros of the same shape as the corresponding gradients/parameters.

    Arguments:
    parameters -- python dictionary containing your parameters.
                    parameters["W" + str(l)] = Wl
                    parameters["b" + str(l)] = bl

    Returns:
    v -- python dictionary that will contain the exponentially weighted average of the gradient.
                    v["dW" + str(l)] = ...
                    v["db" + str(l)] = ...
    s -- python dictionary that will contain the exponentially weighted average of the squared gradient.
                    s["dW" + str(l)] = ...
                    s["db" + str(l)] = ...

    """

    L = len(parameters) // 2  # number of layers in the neural networks
    v = {}
    s = {}

    # Initialize v, s. Input: "parameters". Outputs: "v, s".
    for l in range(L):
        # START CODE HERE #
        v["dW" + str(l+1)] = np.zeros_like(parameters["W" + str(l+1)])
        v["db" + str(l+1)] = np.zeros_like(parameters["b" + str(l+1)])

        s["dW" + str(l+1)] = np.zeros_like(parameters["W" + str(l+1)])
        s["db" + str(l+1)] = np.zeros_like(parameters["b" + str(l+1)])
        # END CODE HERE #

    return v, s


# test for initialize_adam
print("========== test for initialize_adam ==========")
parameters = initialize_adam_test_case()
v, s = initialize_adam(parameters)
print("v[\"dW1\"] = " + str(v["dW1"]))
print("v[\"db1\"] = " + str(v["db1"]))
print("v[\"dW2\"] = " + str(v["dW2"]))
print("v[\"db2\"] = " + str(v["db2"]))
print("s[\"dW1\"] = " + str(s["dW1"]))
print("s[\"db1\"] = " + str(s["db1"]))
print("s[\"dW2\"] = " + str(s["dW2"]))
print("s[\"db2\"] = " + str(s["db2"]))


# Exercise: Now, implement the parameters update with Adam.
def update_parameters_with_adam(parameters, grads, v, s, t, learning_rate=0.01,
                                beta1=0.9, beta2=0.999, epsilon=1e-8):
    """
    Update parameters using Adam

    Arguments:
    parameters -- python dictionary containing your parameters:
                    parameters['W' + str(l)] = Wl
                    parameters['b' + str(l)] = bl
    grads -- python dictionary containing your gradients for each parameter:
                    grads['dW' + str(l)] = dWl
                    grads['db' + str(l)] = dbl
    v -- Adam variable, moving average of the first gradient, python dictionary
    s -- Adam variable, moving average of the squared gradient, python dictionary
    learning_rate -- the learning rate, scalar.
    beta1 -- Exponential decay hyperparameter for the first moment estimates
    beta2 -- Exponential decay hyperparameter for the second moment estimates
    epsilon -- hyperparameter preventing division by zero in Adam updates

    Returns:
    parameters -- python dictionary containing your updated parameters
    v -- Adam variable, moving average of the first gradient, python dictionary
    s -- Adam variable, moving average of the squared gradient, python dictionary
    """

    L = len(parameters) // 2  # number of layers in the neural networks
    v_corrected = {}  # Initializing first moment estimate, python dictionary
    s_corrected = {}  # Initializing second moment estimate, python dictionary

    # Perform Adam update on all parameters
    for l in range(L):
        # Moving average of the gradients. Inputs: "v, grads, beta1". Output: "v".
        # START CODE HERE #
        v["dW" + str(l + 1)] = beta1 * v["dW" + str(l + 1)] + (1 - beta1) * grads['dW' + str(l + 1)]
        v["db" + str(l + 1)] = beta1 * v["db" + str(l + 1)] + (1 - beta1) * grads['db' + str(l + 1)]
        # END CODE HERE #

        # Compute bias-corrected first moment estimate. Inputs: "v, beta1, t". Output: "v_corrected".
        # START CODE HERE #
        v_corrected["dW" + str(l+1)] = v["dW" + str(l+1)] / (1-np.power(beta1, t))
        v_corrected["db" + str(l+1)] = v["db" + str(l+1)] / (1-np.power(beta1, t))
        # END CODE HERE #

        # Compute bias-corrected second raw moment estimate. Inputs: "s, beta2, t". Output: "s_corrected".
        # START CODE HERE #
        s["dW" + str(l+1)] = beta1 * s["dW" + str(l+1)] + (1-beta2) * grads['dW' + str(l+1)]
        s["db" + str(l+1)] = beta1 * s["db" + str(l+1)] + (1-beta2) * grads['db' + str(l+1)]
        # END CODE HERE #

        # Update parameters. Inputs: "parameters, learning_rate, v_corrected, s_corrected, epsilon". Output: "parameters".
        # START CODE HERE #
        parameters["W" + str(l+1)] = parameters["W" + str(l+1)] - learning_rate * v_corrected["dW" + str(l+1)] / (np.sqrt(s_corrected["dW" + str(l+1)]) + epsilon)
        parameters["b" + str(l+1)] = parameters["b" + str(l+1)] - learning_rate * v_corrected["db" + str(l+1)] / (np.sqrt(s_corrected["db" + str(l+1)]) + epsilon)
        # END CODE HERE #
    return parameters, v, s


# test for update_parameters_with_adam
print("========== test for update_parameters_with_adam ==========")
parameters, grads, v, s = update_parameters_with_adam_test_case()
parameters, v, s = update_parameters_with_adam(parameters, grads, v, s, t=2)
print("W1 = " + str(parameters["W1"]))
print("b1 = " + str(parameters["b1"]))
print("W2 = " + str(parameters["W2"]))
print("b2 = " + str(parameters["b2"]))
print("v[\"dW1\"] = " + str(v["dW1"]))
print("v[\"db1\"] = " + str(v["db1"]))
print("v[\"dW2\"] = " + str(v["dW2"]))
print("v[\"db2\"] = " + str(v["db2"]))
print("s[\"dW1\"] = " + str(s["dW1"]))
print("s[\"db1\"] = " + str(s["db1"]))
print("s[\"dW2\"] = " + str(s["dW2"]))
print("s[\"db2\"] = " + str(s["db2"]))


# Model with different optimization algorithms
train_X, train_Y = load_dataset()


def model(X, Y, layers_dims, optimizer, learning_rate=0.0007, mini_batch_size=64, beta=0.9,
          beta1=0.9, beta2=0.999, epsilon=1e-8, num_epochs=10000, print_cost=True):
    """
    3-layer neural network model which can be run in different optimizer modes.

    Arguments:
    X -- input data, of shape (2, number of examples)
    Y -- true "label" vector (1 for blue dot / 0 for red dot), of shape (1, number of examples)
    layers_dims -- python list, containing the size of each layer
    learning_rate -- the learning rate, scalar.
    mini_batch_size -- the size of a mini batch
    beta -- Momentum hyperparameter
    beta1 -- Exponential decay hyperparameter for the past gradients estimates
    beta2 -- Exponential decay hyperparameter for the past squared gradients estimates
    epsilon -- hyperparameter preventing division by zero in Adam updates
    num_epochs -- number of epochs
    print_cost -- True to print the cost every 1000 epochs

    Returns:
    parameters -- python dictionary containing your updated parameters
    """

    L = len(layers_dims)  # number of layers in the neural networks
    costs = []  # to keep track of the cost
    t = 0  # initializing the counter required for Adam update
    seed = 10  # For grading purposes, so that your "random" minibatches are the same as ours

    # Initialize parameters
    parameters = initialize_parameters(layers_dims)

    # Initialize the optimizer
    if optimizer == "gd":
        pass  # no initialization required for gradient descent
    elif optimizer == "momentum":
        v = initialize_velocity(parameters)
    elif optimizer == "adam":
        v, s = initialize_adam(parameters)

    # Optimization loop
    for i in range(num_epochs):

        # Define the random mini-batches. We increment the seed to reshuffle differently the dataset after each epoch
        seed = seed + 1
        minibatches = random_mini_batches(X, Y, mini_batch_size, seed)

        for minibatch in minibatches:

            # Select a minibatch
            (minibatch_X, minibatch_Y) = minibatch

            # Forward propagation
            a3, caches = forward_propagation(minibatch_X, parameters)

            # Compute cost
            cost = compute_cost(a3, minibatch_Y)

            # Backward propagation
            grads = backward_propagation(minibatch_X, minibatch_Y, caches)

            # Update parameters
            if optimizer == "gd":
                parameters = update_parameters_with_gd(parameters, grads, learning_rate)
            elif optimizer == "momentum":
                parameters, v = update_parameters_with_momentum(parameters, grads, v, beta, learning_rate)
            elif optimizer == "adam":
                t = t + 1  # Adam counter
                parameters, v, s = update_parameters_with_adam(parameters, grads, v, s,
                                                               t, learning_rate, beta1, beta2, epsilon)

                # Print the cost every 1000 epoch
                if print_cost and i % 1000 == 0:
                    print("Cost after epoch %i: %f" % (i, cost))
                if print_cost and i % 100 == 0:
                    costs.append(cost)

    # plot the cost
    plt.plot(costs)
    plt.ylabel('cost')
    plt.xlabel('epochs (per 100)')
    plt.title("Learning rate = " + str(learning_rate))
    plt.show()

    return parameters


# Mini_batch Gradient descent
# train 3-layer model
layers_dims = [train_X.shape[0], 5, 2, 1]
parameters = model(train_X, train_Y, layers_dims, optimizer="gd")

# Predict
predictions = predict(train_X, train_Y, parameters)

# Plot decision boundary
plt.title("Model with Gradient Descent optimization")
axes = plt.gca()
axes.set_xlim([-1.5, 2.5])
axes.set_ylim([-1, 1.5])
plot_decision_boundary(lambda x: predict_dec(parameters, x.T), train_X, train_Y)

# Mini_batch gradient descent with momentum
# train 3-layer model
layers_dims = [train_X.shape[0], 5, 2, 1]
parameters = model(train_X, train_Y, layers_dims, beta=0.9, optimizer="momentum")

# Predict
predictions = predict(train_X, train_Y, parameters)

# Plot decision boundary
plt.title("Model with Momentum optimization")
axes = plt.gca()
axes.set_xlim([-1.5, 2.5])
axes.set_ylim([-1, 1.5])
plot_decision_boundary(lambda x: predict_dec(parameters, x.T), train_X, train_Y)

# Mini-batch with Adam mode
# train 3-layer model
layers_dims = [train_X.shape[0], 5, 2, 1]
parameters = model(train_X, train_Y, layers_dims, optimizer="adam")

# Predict
predictions = predict(train_X, train_Y, parameters)

# Plot decision boundary
plt.title("Model with Adam optimization")
axes = plt.gca()
axes.set_xlim([-1.5, 2.5])
axes.set_ylim([-1, 1.5])
plot_decision_boundary(lambda x: predict_dec(parameters, x.T), train_X, train_Y)