深度学习优化算法实现(Momentum, Adam)

最新推荐文章于 2024-08-06 15:30:00 发布

萝卜地里的兔子

最新推荐文章于 2024-08-06 15:30:00 发布

阅读量2.9k

点赞数 2

分类专栏：机器学习文章标签： Momentum Adam 深度学习

本文链接：https://blog.csdn.net/u012613903/article/details/99736536

版权

机器学习专栏收录该内容

14 篇文章 0 订阅

订阅专栏

除了常见的梯度下降法外，还有几种比较通用的优化算法；表现都优于梯度下降法。本文只记录完成吴恩达深度学习作业时遇到的Momentum和Adam算法，而且只有简要的代码。具体原理请看深度学习优化算法解析(Momentum, RMSProp, Adam),比较具体的说明了吴恩达版本的三种优化算法的讲述！

Momentum

初始化

def initialize_velocity(parameters):
    """
    Initializes the velocity as a python dictionary with:
          - keys: "dW1", "db1", ..., "dWL", "dbL" 
          - values: numpy arrays of zeros 
      		of the same shape as the corresponding
      		gradients/parameters.
    Arguments:
    parameters -- python dictionary containing your parameters.
                    parameters['W' + str(l)] = Wl
                    parameters['b' + str(l)] = bl
    
    Returns:
    v -- python dictionary containing the current velocity.
                    v['dW' + str(l)] = velocity of dWl
                    v['db' + str(l)] = velocity of dbl
    """
    
    L = len(parameters) // 2 # number of layers in the neural networks
    v = {}
    
    # Initialize velocity
    for l in range(L):
        ### START CODE HERE ### (approx. 2 lines)
        v['dW' + str(l + 1)] = 
        np.zeros(np.shape(parameters['W' + str(l + 1)]))
        v['db' + str(l + 1)] = 
        np.zeros(np.shape(parameters['b' + str(l + 1)]))
        ### END CODE HERE ###
    return v

更新参数

def update_parameters_with_momentum(parameters, grads, 
v, beta, learning_rate):
    """
    Update parameters using Momentum
    
    Arguments:
    parameters -- python dictionary containing your parameters:
                    parameters['W' + str(l)] = Wl
                    parameters['b' + str(l)] = bl
    grads -- python dictionary containing your gradients for
     each parameters:
                    grads['dW' + str(l)] = dWl
                    grads['db' + str(l)] = dbl
    v -- python dictionary containing the current velocity:
                    v['dW' + str(l)] = ...
                    v['db' + str(l)] = ...
    beta -- the momentum hyperparameter, scalar
    learning_rate -- the learning rate, scalar
    
    Returns:
    parameters -- python dictionary containing your updated parameters 
    v -- python dictionary containing your updated velocities
    """

    L = len(parameters) // 2 # number of layers in the neural networks
    
    # Momentum update for each parameter
    for l in range(L):
        
        ### START CODE HERE ### (approx. 4 lines)
        # compute velocities
        v['dW' + str(l + 1)] = 
        beta * v['dW' + str(l + 1)] + 
        (1 - beta) * grads['dW' + str(l + 1)]
        v['db' + str(l + 1)] = 
        beta * v['db' + str(l + 1)] + 
        (1 - beta) * grads['db' + str(l + 1)]
        # update parameters
        parameters['W' + str(l + 1)] += 
        -learning_rate * v['dW' + str(l + 1)]
        parameters['b' + str(l + 1)] += 
        -learning_rate * v['db' + str(l + 1)]
        ### END CODE HERE ###
    return parameters, v

Adam

初始化

def initialize_adam(parameters) :
    """
    Initializes v and s as two python dictionaries with:
                - keys: "dW1", "db1", ..., "dWL", "dbL" 
                - values: numpy arrays of zeros of the same shape as the 
                corresponding gradients/parameters.
    
    Arguments:
    parameters -- python dictionary containing your parameters.
                    parameters["W" + str(l)] = Wl
                    parameters["b" + str(l)] = bl
    
    Returns: 
    v -- python dictionary that will contain the exponentially weighted 
    average of the gradient.
                    v["dW" + str(l)] = ...
                    v["db" + str(l)] = ...
    s -- python dictionary that will contain the exponentially weighted 
    average of the squared gradient.
                    s["dW" + str(l)] = ...
                    s["db" + str(l)] = ...

    """
    
    L = len(parameters) // 2 # number of layers in the neural networks
    v = {}
    s = {}
    
    # Initialize v, s. Input: "parameters". Outputs: "v, s".
    for l in range(L):
    ### START CODE HERE ### (approx. 4 lines)
        v['dW' + str(l + 1)] = 
        np.zeros(np.shape(parameters['W' + str(l + 1)]))
        v['db' + str(l + 1)] = 
        np.zeros(np.shape(parameters['b' + str(l + 1)]))
        s['dW' + str(l + 1)] = 
        np.zeros(np.shape(parameters['W' + str(l + 1)]))
        s['db' + str(l + 1)] = 
        np.zeros(np.shape(parameters['b' + str(l + 1)]))
    ### END CODE HERE ###
    return v, s

更新参数

def update_parameters_with_adam(parameters, grads, v, s, t,
 learning_rate = 0.01,beta1 = 0.9, beta2 = 0.999,  epsilon = 1e-8):
    """
    Update parameters using Adam
    
    Arguments:
    parameters -- python dictionary containing your parameters:
                    parameters['W' + str(l)] = Wl
                    parameters['b' + str(l)] = bl
    grads -- python dictionary containing your gradients
    for each parameters:
                    grads['dW' + str(l)] = dWl
                    grads['db' + str(l)] = dbl
    v -- Adam variable, moving average of the first gradient,
     python dictionary
    s -- Adam variable, moving average of the squared gradient,
     python dictionary
    learning_rate -- the learning rate, scalar.
    beta1 -- Exponential decay hyperparameter for
     the first moment estimates 
    beta2 -- Exponential decay hyperparameter for
     the second moment estimates 
    epsilon -- hyperparameter preventing division
     by zero in Adam updates

    Returns:
    parameters -- python dictionary containing your updated parameters 
    v -- Adam variable, moving average of the first gradient
    , python dictionary
    s -- Adam variable, moving average of the squared gradient
    , python dictionary
    """
    
    L = len(parameters) // 2 # number of layers in the neural networks
    # Initializing first moment estimate,python dictionary
    v_corrected = {} 
    # Initializing second moment estimate,python dictionary
    
    s_corrected = {} 
    # Perform Adam update on all parameters
    for l in range(L):
        ### START CODE HERE ### (approx. 2 lines)
        v['dW' + str(l + 1)] = 
        beta1 * v['dW' + str(l + 1)] 
        + (1 - beta1) * grads['dW' + str(l + 1)]
        v['db' + str(l + 1)] = 
        beta1 * v['db' + str(l + 1)] 
        + (1 - beta1) * grads['db' + str(l + 1)]
        ### END CODE HERE ###


        ### START CODE HERE ### (approx. 2 lines)
        v_corrected['dW' + str(l + 1)] = 
        v['dW' + str(l + 1)] / (1 - beta1 ** t)
        v_corrected['db' + str(l + 1)] = 
        v['db' + str(l + 1)] / (1 - beta1 ** t)
        ### END CODE HERE ###


        ### START CODE HERE ### (approx. 2 lines)
        s['dW' + str(l + 1)] = 
        beta2 * s['dW' + str(l + 1)] 
        + (1 - beta2) * np.square(grads['dW' + str(l + 1)])
        s['db' + str(l + 1)] = 
        beta2 * s['db' + str(l + 1)] 
        + (1 - beta2) * np.square(grads['db' + str(l + 1)]) 
        ### END CODE HERE ###

        ### START CODE HERE ### (approx. 2 lines)
        s_corrected['dW' + str(l + 1)] = 
        s['dW' + str(l + 1)] / (1 - beta2 ** t)
        s_corrected['db' + str(l + 1)] = 
        s['db' + str(l + 1)] / (1 - beta2 ** t)
        ### END CODE HERE ###

        ### START CODE HERE ### (approx. 2 lines)
        parameters['W' + str(l + 1)] += 
        -learning_rate * v_corrected['dW' + str(l + 1)] 
        / (np.sqrt(s['dW' + str(l + 1)]) + epsilon)
        parameters['b' + str(l + 1)] += 
        -learning_rate * v_corrected['db' + str(l + 1)] 
        / (np.sqrt(s['db' + str(l + 1)]) + epsilon)
        ### END CODE HERE ###
    return parameters, v, s