梯度优化(SGD、Adam、动量、mini-batch)

一、随机梯度下降(SGD)

随机梯度下降(Stochastic Gradient Descent, SGD)是一种优化算法,常用于训练各种类型的机器学习模型,尤其是神经网络。在SGD中,不是每次更新都使用全部数据,而是每次更新只随机选取一个样本或一小批(batch)样本来计算梯度,这样可以大大加快训练速度。

以下是随机梯度下降算法的基本步骤,以及一个简单的Python示例,用于线性回归问题。

步骤:

  1. 初始化参数:选择一个小的学习率 α,并初始化模型的参数(例如,线性回归的权重 w 和偏置 b)。
  2. 选择损失函数:对于线性回归,常用的损失函数是均方误差(MSE)。
  3. 随机选择样本:从数据集中随机选择一个样本或一批样本。
  4. 计算梯度:使用选定的样本计算损失函数关于模型参数的梯度。
  5. 更新参数:使用梯度和学习率来更新模型的参数。

重复步骤3-5,直到满足某个停止条件(如达到最大迭代次数,或损失函数值小于某个阈值)。

优点:可以一个不是最优的鞍点附近直接跳到更优的鞍点处,但是可以保证不会从一个更优的鞍点跳到更劣的鞍点附近。
例如可以下图中红点左边的鞍点跳到右边更优的鞍点。
在这里插入图片描述

Python 示例

这里我们实现一个简单的线性回归问题,使用SGD来找到最优的权重 w 和偏置 b。

import numpy as np  
  
# 生成一些数据  
np.random.seed(0)  
X = 2 * np.random.rand(100, 1)  
y = 4 + 3 * X + np.random.randn(100, 1)  
  
# 初始化参数  
w = np.random.randn(1, 1)  
b = np.zeros((1, 1))  
learning_rate = 0.01  
n_epochs = 1000  
m = len(X)  
  
# 损失历史  
losses = []  
  
# 训练模型  
for epoch in range(n_epochs):  
    # 随机打乱数据  
    indices = np.random.permutation(m)  
    X_shuffled = X[indices]  
    y_shuffled = y[indices]  
      
    # 梯度累加  
    dw = 0  
    db = 0  
      
    # 遍历每个样本  
    for i in range(m):  
        xi = X_shuffled[i:i+1]  
        yi = y_shuffled[i:i+1]  
        predictions = xi.dot(w) + b  
        error = predictions - yi  
        dw += 2 * xi.T.dot(error)  
        db += 2 * error  
      
    # 更新参数  
    w -= learning_rate * dw / m  
    b -= learning_rate * db / m  
      
    # 记录损失  
    if epoch % 100 == 0:  
        total_error = np.sum((y - (X.dot(w) + b)) ** 2)  
        losses.append(total_error)  
  
# 输出结果  
print("训练后的权重 w:", w)  
print("训练后的偏置 b:", b)  
print("最终损失:", losses[-1])  
  
# 可视化损失变化(如果有matplotlib)  
import matplotlib.pyplot as plt  
plt.plot(losses)  
plt.xlabel('Epochs')  
plt.ylabel('Loss')  
plt.title('SGD Loss Over Epochs')  
plt.show()

结果
运行上述代码后,你将看到模型训练后的权重 w 和偏置 b,以及最终损失值。由于初始化和随机性的原因,每次运行的结果可能略有不同。

注意:这个示例使用了全批量梯度下降(Full Batch Gradient Descent)的更新方式(即每次迭代都遍历整个数据集),但计算梯度时只使用了单个样本,这实际上模拟了SGD的随机性。为了真正的SGD,你需要在每次迭代中只使用一个样本或一小批样本,而不是整个数据集。你可以通过修改内部循环来实现这一点,例如,每次只处理一个样本或固定大小的batch。

吴恩达作业—梯度优化介绍

吴恩达《深度学习》L2W2作业 - Heywhale.com

包括:梯度下降(GD)
批量处理;随机梯度下降(SGD);小批量GD
momentum;Adam。

二、动量的更新

Exercise: Now, implement the parameters update with momentum. The momentum update rule is,(实现带冲量的参数更新。冲量更新规则是,) for l = 1 , . . . , L l = 1, ..., L l=1,...,L:

{ v d W [ l ] = β v d W [ l ] + ( 1 − β ) d W [ l ] W [ l ] = W [ l ] − α v d W [ l ] (3) \begin{cases} v_{dW^{[l]}} = \beta v_{dW^{[l]}} + (1 - \beta) dW^{[l]} \\ W^{[l]} = W^{[l]} - \alpha v_{dW^{[l]}} \end{cases}\tag{3} {vdW[l]=βvdW[l]+(1β)dW[l]W[l]=W[l]αvdW[l](3)

{ v d b [ l ] = β v d b [ l ] + ( 1 − β ) d b [ l ] b [ l ] = b [ l ] − α v d b [ l ] (4) \begin{cases} v_{db^{[l]}} = \beta v_{db^{[l]}} + (1 - \beta) db^{[l]} \\ b^{[l]} = b^{[l]} - \alpha v_{db^{[l]}} \end{cases}\tag{4} {vdb[l]=βvdb[l]+(1β)db[l]b[l]=b[l]αvdb[l](4)

where L is the number of layers, β \beta β is the momentum and α \alpha α is the learning rate. All parameters should be stored in the parameters dictionary. Note that the iterator l starts at 0 in the for loop while the first parameters are W [ 1 ] W^{[1]} W[1] and b [ 1 ] b^{[1]} b[1] (that’s a “one” on the superscript). So you will need to shift l to l+1 when coding.

  1. 初始化动量
def initialize_velocity(parameters):
    """
    Initializes the velocity as a python dictionary with:
                - keys: "dW1", "db1", ..., "dWL", "dbL" 
                - values: numpy arrays of zeros of the same shape as the corresponding gradients/parameters.
    Arguments:
    parameters -- python dictionary containing your parameters.
                    parameters['W' + str(l)] = Wl
                    parameters['b' + str(l)] = bl
    
    Returns:
    v -- python dictionary containing the current velocity.
                    v['dW' + str(l)] = velocity of dWl
                    v['db' + str(l)] = velocity of dbl
    """
    
    L = len(parameters) // 2 # number of layers in the neural networks
    v = {}
    
    # Initialize velocity
    for l in range(L):
        ### START CODE HERE ### (approx. 2 lines)
        v["dW" + str(l+1)] = np.zeros(parameters['W' + str(l+1)].shape)
        v["db" + str(l+1)] = np.zeros(parameters['b' + str(l+1)].shape)
        ### END CODE HERE ###
        
    return v
  1. 更新参数with动量
def update_parameters_with_momentum(parameters, grads, v, beta, learning_rate):
    """
    Update parameters using Momentum
    
    Arguments:
    parameters -- python dictionary containing your parameters:
                    parameters['W' + str(l)] = Wl
                    parameters['b' + str(l)] = bl
    grads -- python dictionary containing your gradients for each parameters:
                    grads['dW' + str(l)] = dWl
                    grads['db' + str(l)] = dbl
    v -- python dictionary containing the current velocity:
                    v['dW' + str(l)] = ...
                    v['db' + str(l)] = ...
    beta -- the momentum hyperparameter, scalar
    learning_rate -- the learning rate, scalar
    
    Returns:
    parameters -- python dictionary containing your updated parameters 
    v -- python dictionary containing your updated velocities
    """

    L = len(parameters) // 2 # number of layers in the neural networks
    
    # Momentum update for each parameter
    for l in range(L):
        
        ### START CODE HERE ### (approx. 4 lines)
        # compute velocities
        v["dW" + str(l + 1)] = beta*v["dW" + str(l + 1)]+(1-beta)*grads['dW' + str(l+1)]
        v["db" + str(l + 1)] = beta*v["db" + str(l + 1)]+(1-beta)*grads['db' + str(l+1)]
        # update parameters
        parameters["W" + str(l + 1)] = parameters['W' + str(l+1)] - learning_rate*v["dW" + str(l + 1)] 
        parameters["b" + str(l + 1)] = parameters['b' + str(l+1)] - learning_rate*v["db" + str(l + 1)] 
        ### END CODE HERE ###
        
    return parameters, v

小批量mini_batch

def random_mini_batches(X, Y, mini_batch_size = 64, seed = 0):  
    """  
    Creates a list of random minibatches from (X, Y)        Arguments:  
    X -- input data, of shape (input size, number of examples) (m, Hi, Wi, Ci)    Y -- true "label" vector (containing 0 if cat, 1 if non-cat), of shape (1, number of examples) (m, n_y)    mini_batch_size - size of the mini-batches, integer    seed -- this is only for the purpose of grading, so that you're "random minibatches are the same as ours.        Returns:  
    mini_batches -- list of synchronous (mini_batch_X, mini_batch_Y)    """    m = X.shape[0]                  # number of training examples  
    mini_batches = []  
    np.random.seed(seed)  
      
    # Step 1: Shuffle (X, Y)  
    permutation = list(np.random.permutation(m))  
    shuffled_X = X[permutation,:,:,:]  
    shuffled_Y = Y[permutation,:]  
  
    # Step 2: Partition (shuffled_X, shuffled_Y). Minus the end case.  
    num_complete_minibatches = math.floor(m/mini_batch_size) # number of mini batches of size mini_batch_size in your partitionning  
    for k in range(0, num_complete_minibatches):  
        mini_batch_X = shuffled_X[k * mini_batch_size : k * mini_batch_size + mini_batch_size,:,:,:]  
        mini_batch_Y = shuffled_Y[k * mini_batch_size : k * mini_batch_size + mini_batch_size,:]  
        mini_batch = (mini_batch_X, mini_batch_Y)  
        mini_batches.append(mini_batch)  
      
    # Handling the end case (last mini-batch < mini_batch_size)  
    if m % mini_batch_size != 0:  
        mini_batch_X = shuffled_X[num_complete_minibatches * mini_batch_size : m,:,:,:]  
        mini_batch_Y = shuffled_Y[num_complete_minibatches * mini_batch_size : m,:]  
        mini_batch = (mini_batch_X, mini_batch_Y)  
        mini_batches.append(mini_batch)  
      
    return mini_batches

三、Adam

Adam是训练神经网络最有效的优化算法之一。它结合了RMSProp和Momentum的优点。

Adam原理
1.计算过去梯度的指数加权平均值,并将其存储在变量(使用偏差校正之前)和(使用偏差校正)中。
2.计算过去梯度的平方的指数加权平均值,并将其存储在变量(偏差校正之前)和(偏差校正中)中。
3.组合“1”和“2”的信息,在一个方向上更新参数

{ v d W [ l ] = β 1 v d W [ l ] + ( 1 − β 1 ) ∂ J ∂ W [ l ] v d W [ l ] c o r r e c t e d = v d W [ l ] 1 − ( β 1 ) t s d W [ l ] = β 2 s d W [ l ] + ( 1 − β 2 ) ( ∂ J ∂ W [ l ] ) 2 s d W [ l ] c o r r e c t e d = s d W [ l ] 1 − ( β 1 ) t W [ l ] = W [ l ] − α v d W [ l ] c o r r e c t e d s d W [ l ] c o r r e c t e d + ε \begin{cases} v_{dW^{[l]}} = \beta_1 v_{dW^{[l]}} + (1 - \beta_1) \frac{\partial \mathcal{J} }{ \partial W^{[l]} } \\ v^{corrected}_{dW^{[l]}} = \frac{v_{dW^{[l]}}}{1 - (\beta_1)^t} \\ s_{dW^{[l]}} = \beta_2 s_{dW^{[l]}} + (1 - \beta_2) (\frac{\partial \mathcal{J} }{\partial W^{[l]} })^2 \\ s^{corrected}_{dW^{[l]}} = \frac{s_{dW^{[l]}}}{1 - (\beta_1)^t} \\ W^{[l]} = W^{[l]} - \alpha \frac{v^{corrected}_{dW^{[l]}}}{\sqrt{s^{corrected}_{dW^{[l]}}} + \varepsilon} \end{cases} vdW[l]=β1vdW[l]+(1β1)W[l]JvdW[l]corrected=1(β1)tvdW[l]sdW[l]=β2sdW[l]+(1β2)(W[l]J)2sdW[l]corrected=1(β1)tsdW[l]W[l]=W[l]αsdW[l]corrected +εvdW[l]corrected

  1. 初始化
def initialize_adam(parameters) :
    
    L = len(parameters) // 2 # number of layers in the neural networks
    v = {}
    s = {}
    
    # Initialize v, s. Input: "parameters". Outputs: "v, s".
    for l in range(L):
    ### START CODE HERE ### (approx. 4 lines)
        v["dW" + str(l + 1)] = np.zeros(parameters["W" + str(l+1)].shape)
        v["db" + str(l + 1)] = np.zeros(parameters["b" + str(l+1)].shape)
        s["dW" + str(l + 1)] = np.zeros(parameters["W" + str(l+1)].shape)
        s["db" + str(l + 1)] = np.zeros(parameters["b" + str(l+1)].shape)
        ### END CODE HERE ###
    
    return v, s

查看结果:

parameters = initialize_adam_test_case()

v, s = initialize_adam(parameters)
print("v[\"dW1\"] = " + str(v["dW1"]))
print("v[\"db1\"] = " + str(v["db1"]))
print("v[\"dW2\"] = " + str(v["dW2"]))
print("v[\"db2\"] = " + str(v["db2"]))
print("s[\"dW1\"] = " + str(s["dW1"]))
print("s[\"db1\"] = " + str(s["db1"]))
print("s[\"dW2\"] = " + str(s["dW2"]))
print("s[\"db2\"] = " + str(s["db2"]))
# GRADED FUNCTION: update_parameters_with_adam

def update_parameters_with_adam(parameters, grads, v, s, t, learning_rate = 0.01,
                                beta1 = 0.9, beta2 = 0.999,  epsilon = 1e-8):
    
    L = len(parameters) // 2                 # number of layers in the neural networks
    v_corrected = {}                         # Initializing first moment estimate, python dictionary
    s_corrected = {}                         # Initializing second moment estimate, python dictionary
    
    # Perform Adam update on all parameters
    for l in range(L):
        # Moving average of the gradients. Inputs: "v, grads, beta1". Output: "v".
        ### START CODE HERE ### (approx. 2 lines)
        v["dW" + str(l + 1)] = beta1*v["dW" + str(l + 1)] +(1-beta1)*grads['dW' + str(l+1)]
        v["db" + str(l + 1)] = beta1*v["db" + str(l + 1)] +(1-beta1)*grads['db' + str(l+1)]
        ### END CODE HERE ###

        # Compute bias-corrected first moment estimate. Inputs: "v, beta1, t". Output: "v_corrected".
        ### START CODE HERE ### (approx. 2 lines)
        v_corrected["dW" + str(l + 1)] = v["dW" + str(l + 1)]/(1-(beta1)**t)
        v_corrected["db" + str(l + 1)] = v["db" + str(l + 1)]/(1-(beta1)**t)
        ### END CODE HERE ###

        # Moving average of the squared gradients. Inputs: "s, grads, beta2". Output: "s".
        ### START CODE HERE ### (approx. 2 lines)
        s["dW" + str(l + 1)] =beta2*s["dW" + str(l + 1)] + (1-beta2)*(grads['dW' + str(l+1)]**2)
        s["db" + str(l + 1)] = beta2*s["db" + str(l + 1)] + (1-beta2)*(grads['db' + str(l+1)]**2)
        ### END CODE HERE ###

        # Compute bias-corrected second raw moment estimate. Inputs: "s, beta2, t". Output: "s_corrected".
        ### START CODE HERE ### (approx. 2 lines)
        s_corrected["dW" + str(l + 1)] =s["dW" + str(l + 1)]/(1-(beta2)**t)
        s_corrected["db" + str(l + 1)] = s["db" + str(l + 1)]/(1-(beta2)**t)
        ### END CODE HERE ###

        # Update parameters. Inputs: "parameters, learning_rate, v_corrected, s_corrected, epsilon". Output: "parameters".
        ### START CODE HERE ### (approx. 2 lines)
        parameters["W" + str(l + 1)] = parameters["W" + str(l + 1)]-learning_rate*(v_corrected["dW" + str(l + 1)]/np.sqrt( s_corrected["dW" + str(l + 1)]+epsilon))
        parameters["b" + str(l + 1)] = parameters["b" + str(l + 1)]-learning_rate*(v_corrected["db" + str(l + 1)]/np.sqrt( s_corrected["db" + str(l + 1)]+epsilon))
        ### END CODE HERE ###
        
    return parameters, v, s

结果:

parameters = initialize_adam_test_case()

v, s = initialize_adam(parameters)
print("v[\"dW1\"] = " + str(v["dW1"]))
print("v[\"db1\"] = " + str(v["db1"]))
print("v[\"dW2\"] = " + str(v["dW2"]))
print("v[\"db2\"] = " + str(v["db2"]))
print("s[\"dW1\"] = " + str(s["dW1"]))
print("s[\"db1\"] = " + str(s["db1"]))
print("s[\"dW2\"] = " + str(s["dW2"]))
print("s[\"db2\"] = " + str(s["db2"]))
  • 13
    点赞
  • 23
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值