梯度优化（SGD、Adam、动量、mini-batch）

JZJQuest

已于 2024-07-09 17:02:22 修改

阅读量540

点赞数 13

分类专栏：深度学习文章标签： batch 开发语言

于 2024-07-09 14:52:31 首次发布

本文链接：https://blog.csdn.net/m0_70087562/article/details/140295579

版权

深度学习专栏收录该内容

1 篇文章 0 订阅

订阅专栏

文章目录

一、随机梯度下降（SGD）
二、动量的更新
- - 小批量mini_batch
三、Adam

一、随机梯度下降（SGD）

随机梯度下降（Stochastic Gradient Descent, SGD）是一种优化算法，常用于训练各种类型的机器学习模型，尤其是神经网络。在SGD中，不是每次更新都使用全部数据，而是每次更新只随机选取一个样本或一小批（batch）样本来计算梯度，这样可以大大加快训练速度。

以下是随机梯度下降算法的基本步骤，以及一个简单的Python示例，用于线性回归问题。

步骤：

初始化参数：选择一个小的学习率 α，并初始化模型的参数（例如，线性回归的权重 w 和偏置 b）。
选择损失函数：对于线性回归，常用的损失函数是均方误差（MSE）。
随机选择样本：从数据集中随机选择一个样本或一批样本。
计算梯度：使用选定的样本计算损失函数关于模型参数的梯度。
更新参数：使用梯度和学习率来更新模型的参数。

重复步骤3-5，直到满足某个停止条件（如达到最大迭代次数，或损失函数值小于某个阈值）。

优点：可以一个不是最优的鞍点附近直接跳到更优的鞍点处，但是可以保证不会从一个更优的鞍点跳到更劣的鞍点附近。
例如可以下图中红点左边的鞍点跳到右边更优的鞍点。
在这里插入图片描述

Python 示例

这里我们实现一个简单的线性回归问题，使用SGD来找到最优的权重 w 和偏置 b。

import numpy as np  
  
# 生成一些数据  
np.random.seed(0)  
X = 2 * np.random.rand(100, 1)  
y = 4 + 3 * X + np.random.randn(100, 1)  
  
# 初始化参数  
w = np.random.randn(1, 1)  
b = np.zeros((1, 1))  
learning_rate = 0.01  
n_epochs = 1000  
m = len(X)  
  
# 损失历史  
losses = []  
  
# 训练模型  
for epoch in range(n_epochs):  
    # 随机打乱数据  
    indices = np.random.permutation(m)  
    X_shuffled = X[indices]  
    y_shuffled = y[indices]  
      
    # 梯度累加  
    dw = 0  
    db = 0  
      
    # 遍历每个样本  
    for i in range(m):  
        xi = X_shuffled[i:i+1]  
        yi = y_shuffled[i:i+1]  
        predictions = xi.dot(w) + b  
        error = predictions - yi  
        dw += 2 * xi.T.dot(error)  
        db += 2 * error  
      
    # 更新参数  
    w -= learning_rate * dw / m  
    b -= learning_rate * db / m  
      
    # 记录损失  
    if epoch % 100 == 0:  
        total_error = np.sum((y - (X.dot(w) + b)) ** 2)  
        losses.append(total_error)  
  
# 输出结果  
print("训练后的权重 w:", w)  
print("训练后的偏置 b:", b)  
print("最终损失:", losses[-1])  
  
# 可视化损失变化（如果有matplotlib）  
import matplotlib.pyplot as plt  
plt.plot(losses)  
plt.xlabel('Epochs')  
plt.ylabel('Loss')  
plt.title('SGD Loss Over Epochs')  
plt.show()

结果
运行上述代码后，你将看到模型训练后的权重 w 和偏置 b，以及最终损失值。由于初始化和随机性的原因，每次运行的结果可能略有不同。

注意：这个示例使用了全批量梯度下降（Full Batch Gradient Descent）的更新方式（即每次迭代都遍历整个数据集），但计算梯度时只使用了单个样本，这实际上模拟了SGD的随机性。为了真正的SGD，你需要在每次迭代中只使用一个样本或一小批样本，而不是整个数据集。你可以通过修改内部循环来实现这一点，例如，每次只处理一个样本或固定大小的batch。

吴恩达作业—梯度优化介绍

吴恩达《深度学习》L2W2作业 - Heywhale.com

包括：梯度下降（GD）
批量处理；随机梯度下降（SGD）；小批量GD
momentum；Adam。

二、动量的更新

Exercise: Now, implement the parameters update with momentum. The momentum update rule is,（实现带冲量的参数更新。冲量更新规则是，） for $l = 1, ..., L$ :

$\begin{cases} v_{dW^{[l]}} = \beta v_{dW^{[l]}} + (1 - \beta) dW^{[l]} \\ W^{[l]} = W^{[l]} - \alpha v_{dW^{[l]}} \end{cases}\tag{3}$

$\begin{cases} v_{db^{[l]}} = \beta v_{db^{[l]}} + (1 - \beta) db^{[l]} \\ b^{[l]} = b^{[l]} - \alpha v_{db^{[l]}} \end{cases}\tag{4}$

where L is the number of layers, $\beta$ is the momentum and $\alpha$ is the learning rate. All parameters should be stored in the parameters dictionary. Note that the iterator l starts at 0 in the for loop while the first parameters are $W^{[1]}$ and $b^{[1]}$ (that’s a “one” on the superscript). So you will need to shift l to l+1 when coding.

初始化动量

def initialize_velocity(parameters):
    """
    Initializes the velocity as a python dictionary with:
                - keys: "dW1", "db1", ..., "dWL", "dbL" 
                - values: numpy arrays of zeros of the same shape as the corresponding gradients/parameters.
    Arguments:
    parameters -- python dictionary containing your parameters.
                    parameters['W' + str(l)] = Wl
                    parameters['b' + str(l)] = bl
    
    Returns:
    v -- python dictionary containing the current velocity.
                    v['dW' + str(l)] = velocity of dWl
                    v['db' + str(l)] = velocity of dbl
    """
    
    L = len(parameters) // 2 # number of layers in the neural networks
    v = {}
    
    # Initialize velocity
    for l in range(L):
        ### START CODE HERE ### (approx. 2 lines)
        v["dW" + str(l+1)] = np.zeros(parameters['W' + str(l+1)].shape)
        v["db" + str(l+1)] = np.zeros(parameters['b' + str(l+1)].shape)
        ### END CODE HERE ###
        
    return v

更新参数with动量

def update_parameters_with_momentum(parameters, grads, v, beta, learning_rate):
    """
    Update parameters using Momentum
    
    Arguments:
    parameters -- python dictionary containing your parameters:
                    parameters['W' + str(l)] = Wl
                    parameters['b' + str(l)] = bl
    grads -- python dictionary containing your gradients for each parameters:
                    grads['dW' + str(l)] = dWl
                    grads['db' + str(l)] = dbl
    v -- python dictionary containing the current velocity:
                    v['dW' + str(l)] = ...
                    v['db' + str(l)] = ...
    beta -- the momentum hyperparameter, scalar
    learning_rate -- the learning rate, scalar
    
    Returns:
    parameters -- python dictionary containing your updated parameters 
    v -- python dictionary containing your updated velocities
    """

    L = len(parameters) // 2 # number of layers in the neural networks
    
    # Momentum update for each parameter
    for l in range(L):
        
        ### START CODE HERE ### (approx. 4 lines)
        # compute velocities
        v["dW" + str(l + 1)] = beta*v["dW" + str(l + 1)]+(1-beta)*grads['dW' + str(l+1)]
        v["db" + str(l + 1)] = beta*v["db" + str(l + 1)]+(1-beta)*grads['db' + str(l+1)]
        # update parameters
        parameters["W" + str(l + 1)] = parameters['W' + str(l+1)] - learning_rate*v["dW" + str(l + 1)] 
        parameters["b" + str(l + 1)] = parameters['b' + str(l+1)] - learning_rate*v["db" + str(l + 1)] 
        ### END CODE HERE ###
        
    return parameters, v

小批量mini_batch

def random_mini_batches(X, Y, mini_batch_size = 64, seed = 0):  
    """  
    Creates a list of random minibatches from (X, Y)        Arguments:  
    X -- input data, of shape (input size, number of examples) (m, Hi, Wi, Ci)    Y -- true "label" vector (containing 0 if cat, 1 if non-cat), of shape (1, number of examples) (m, n_y)    mini_batch_size - size of the mini-batches, integer    seed -- this is only for the purpose of grading, so that you're "random minibatches are the same as ours.        Returns:  
    mini_batches -- list of synchronous (mini_batch_X, mini_batch_Y)    """    m = X.shape[0]                  # number of training examples  
    mini_batches = []  
    np.random.seed(seed)  
      
    # Step 1: Shuffle (X, Y)  
    permutation = list(np.random.permutation(m))  
    shuffled_X = X[permutation,:,:,:]  
    shuffled_Y = Y[permutation,:]  
  
    # Step 2: Partition (shuffled_X, shuffled_Y). Minus the end case.  
    num_complete_minibatches = math.floor(m/mini_batch_size) # number of mini batches of size mini_batch_size in your partitionning  
    for k in range(0, num_complete_minibatches):  
        mini_batch_X = shuffled_X[k * mini_batch_size : k * mini_batch_size + mini_batch_size,:,:,:]  
        mini_batch_Y = shuffled_Y[k * mini_batch_size : k * mini_batch_size + mini_batch_size,:]  
        mini_batch = (mini_batch_X, mini_batch_Y)  
        mini_batches.append(mini_batch)  
      
    # Handling the end case (last mini-batch < mini_batch_size)  
    if m % mini_batch_size != 0:  
        mini_batch_X = shuffled_X[num_complete_minibatches * mini_batch_size : m,:,:,:]  
        mini_batch_Y = shuffled_Y[num_complete_minibatches * mini_batch_size : m,:]  
        mini_batch = (mini_batch_X, mini_batch_Y)  
        mini_batches.append(mini_batch)  
      
    return mini_batches

三、Adam

Adam是训练神经网络最有效的优化算法之一。它结合了RMSProp和Momentum的优点。

Adam原理
1.计算过去梯度的指数加权平均值，并将其存储在变量（使用偏差校正之前）和（使用偏差校正）中。
2.计算过去梯度的平方的指数加权平均值，并将其存储在变量（偏差校正之前）和（偏差校正中）中。
3.组合“1”和“2”的信息，在一个方向上更新参数

$\begin{cases} v_{dW^{[l]}} = \beta_1 v_{dW^{[l]}} + (1 - \beta_1) \frac{\partial \mathcal{J} }{ \partial W^{[l]} } \\ v^{corrected}_{dW^{[l]}} = \frac{v_{dW^{[l]}}}{1 - (\beta_1)^t} \\ s_{dW^{[l]}} = \beta_2 s_{dW^{[l]}} + (1 - \beta_2) (\frac{\partial \mathcal{J} }{\partial W^{[l]} })^2 \\ s^{corrected}_{dW^{[l]}} = \frac{s_{dW^{[l]}}}{1 - (\beta_1)^t} \\ W^{[l]} = W^{[l]} - \alpha \frac{v^{corrected}_{dW^{[l]}}}{\sqrt{s^{corrected}_{dW^{[l]}}} + \varepsilon} \end{cases}$

初始化

def initialize_adam(parameters) :
    
    L = len(parameters) // 2 # number of layers in the neural networks
    v = {}
    s = {}
    
    # Initialize v, s. Input: "parameters". Outputs: "v, s".
    for l in range(L):
    ### START CODE HERE ### (approx. 4 lines)
        v["dW" + str(l + 1)] = np.zeros(parameters["W" + str(l+1)].shape)
        v["db" + str(l + 1)] = np.zeros(parameters["b" + str(l+1)].shape)
        s["dW" + str(l + 1)] = np.zeros(parameters["W" + str(l+1)].shape)
        s["db" + str(l + 1)] = np.zeros(parameters["b" + str(l+1)].shape)
        ### END CODE HERE ###
    
    return v, s

查看结果：

parameters = initialize_adam_test_case()

v, s = initialize_adam(parameters)
print("v[\"dW1\"] = " + str(v["dW1"]))
print("v[\"db1\"] = " + str(v["db1"]))
print("v[\"dW2\"] = " + str(v["dW2"]))
print("v[\"db2\"] = " + str(v["db2"]))
print("s[\"dW1\"] = " + str(s["dW1"]))
print("s[\"db1\"] = " + str(s["db1"]))
print("s[\"dW2\"] = " + str(s["dW2"]))
print("s[\"db2\"] = " + str(s["db2"]))

# GRADED FUNCTION: update_parameters_with_adam

def update_parameters_with_adam(parameters, grads, v, s, t, learning_rate = 0.01,
                                beta1 = 0.9, beta2 = 0.999,  epsilon = 1e-8):
    
    L = len(parameters) // 2                 # number of layers in the neural networks
    v_corrected = {}                         # Initializing first moment estimate, python dictionary
    s_corrected = {}                         # Initializing second moment estimate, python dictionary
    
    # Perform Adam update on all parameters
    for l in range(L):
        # Moving average of the gradients. Inputs: "v, grads, beta1". Output: "v".
        ### START CODE HERE ### (approx. 2 lines)
        v["dW" + str(l + 1)] = beta1*v["dW" + str(l + 1)] +(1-beta1)*grads['dW' + str(l+1)]
        v["db" + str(l + 1)] = beta1*v["db" + str(l + 1)] +(1-beta1)*grads['db' + str(l+1)]
        ### END CODE HERE ###

        # Compute bias-corrected first moment estimate. Inputs: "v, beta1, t". Output: "v_corrected".
        ### START CODE HERE ### (approx. 2 lines)
        v_corrected["dW" + str(l + 1)] = v["dW" + str(l + 1)]/(1-(beta1)**t)
        v_corrected["db" + str(l + 1)] = v["db" + str(l + 1)]/(1-(beta1)**t)
        ### END CODE HERE ###

        # Moving average of the squared gradients. Inputs: "s, grads, beta2". Output: "s".
        ### START CODE HERE ### (approx. 2 lines)
        s["dW" + str(l + 1)] =beta2*s["dW" + str(l + 1)] + (1-beta2)*(grads['dW' + str(l+1)]**2)
        s["db" + str(l + 1)] = beta2*s["db" + str(l + 1)] + (1-beta2)*(grads['db' + str(l+1)]**2)
        ### END CODE HERE ###

        # Compute bias-corrected second raw moment estimate. Inputs: "s, beta2, t". Output: "s_corrected".
        ### START CODE HERE ### (approx. 2 lines)
        s_corrected["dW" + str(l + 1)] =s["dW" + str(l + 1)]/(1-(beta2)**t)
        s_corrected["db" + str(l + 1)] = s["db" + str(l + 1)]/(1-(beta2)**t)
        ### END CODE HERE ###

        # Update parameters. Inputs: "parameters, learning_rate, v_corrected, s_corrected, epsilon". Output: "parameters".
        ### START CODE HERE ### (approx. 2 lines)
        parameters["W" + str(l + 1)] = parameters["W" + str(l + 1)]-learning_rate*(v_corrected["dW" + str(l + 1)]/np.sqrt( s_corrected["dW" + str(l + 1)]+epsilon))
        parameters["b" + str(l + 1)] = parameters["b" + str(l + 1)]-learning_rate*(v_corrected["db" + str(l + 1)]/np.sqrt( s_corrected["db" + str(l + 1)]+epsilon))
        ### END CODE HERE ###
        
    return parameters, v, s

结果：

parameters = initialize_adam_test_case()

v, s = initialize_adam(parameters)
print("v[\"dW1\"] = " + str(v["dW1"]))
print("v[\"db1\"] = " + str(v["db1"]))
print("v[\"dW2\"] = " + str(v["dW2"]))
print("v[\"db2\"] = " + str(v["db2"]))
print("s[\"dW1\"] = " + str(s["dW1"]))
print("s[\"db1\"] = " + str(s["db1"]))
print("s[\"dW2\"] = " + str(s["dW2"]))
print("s[\"db2\"] = " + str(s["db2"]))

JZJQuest

关注

13
点赞
踩
23

收藏

觉得还不错? 一键收藏
0
评论
梯度优化（SGD、Adam、动量、mini-batch）

注意：这个示例使用了全批量梯度下降（Full Batch Gradient Descent）的更新方式（即每次迭代都遍历整个数据集），但计算梯度时只使用了单个样本，这实际上模拟了SGD的随机性。为了真正的SGD，你需要在每次迭代中只使用一个样本或一小批样本，而不是整个数据集。在SGD中，不是每次更新都使用全部数据，而是每次更新只随机选取一个样本或一小批（batch）样本来计算梯度，这样可以大大加快训练速度。2.计算过去梯度的平方的指数加权平均值，并将其存储在变量（偏差校正之前）和（偏差校正中）中。
复制链接

扫一扫