一、随机梯度下降(SGD)
随机梯度下降(Stochastic Gradient Descent, SGD)是一种优化算法,常用于训练各种类型的机器学习模型,尤其是神经网络。在SGD中,不是每次更新都使用全部数据,而是每次更新只随机选取一个样本或一小批(batch)样本来计算梯度,这样可以大大加快训练速度。
以下是随机梯度下降算法的基本步骤,以及一个简单的Python示例,用于线性回归问题。
步骤:
- 初始化参数:选择一个小的学习率 α,并初始化模型的参数(例如,线性回归的权重 w 和偏置 b)。
- 选择损失函数:对于线性回归,常用的损失函数是均方误差(MSE)。
- 随机选择样本:从数据集中随机选择一个样本或一批样本。
- 计算梯度:使用选定的样本计算损失函数关于模型参数的梯度。
- 更新参数:使用梯度和学习率来更新模型的参数。
重复步骤3-5,直到满足某个停止条件(如达到最大迭代次数,或损失函数值小于某个阈值)。
优点:可以一个不是最优的鞍点附近直接跳到更优的鞍点处,但是可以保证不会从一个更优的鞍点跳到更劣的鞍点附近。
例如可以下图中红点左边的鞍点跳到右边更优的鞍点。
Python 示例
这里我们实现一个简单的线性回归问题,使用SGD来找到最优的权重 w 和偏置 b。
import numpy as np
# 生成一些数据
np.random.seed(0)
X = 2 * np.random.rand(100, 1)
y = 4 + 3 * X + np.random.randn(100, 1)
# 初始化参数
w = np.random.randn(1, 1)
b = np.zeros((1, 1))
learning_rate = 0.01
n_epochs = 1000
m = len(X)
# 损失历史
losses = []
# 训练模型
for epoch in range(n_epochs):
# 随机打乱数据
indices = np.random.permutation(m)
X_shuffled = X[indices]
y_shuffled = y[indices]
# 梯度累加
dw = 0
db = 0
# 遍历每个样本
for i in range(m):
xi = X_shuffled[i:i+1]
yi = y_shuffled[i:i+1]
predictions = xi.dot(w) + b
error = predictions - yi
dw += 2 * xi.T.dot(error)
db += 2 * error
# 更新参数
w -= learning_rate * dw / m
b -= learning_rate * db / m
# 记录损失
if epoch % 100 == 0:
total_error = np.sum((y - (X.dot(w) + b)) ** 2)
losses.append(total_error)
# 输出结果
print("训练后的权重 w:", w)
print("训练后的偏置 b:", b)
print("最终损失:", losses[-1])
# 可视化损失变化(如果有matplotlib)
import matplotlib.pyplot as plt
plt.plot(losses)
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.title('SGD Loss Over Epochs')
plt.show()
结果
运行上述代码后,你将看到模型训练后的权重 w 和偏置 b,以及最终损失值。由于初始化和随机性的原因,每次运行的结果可能略有不同。
注意:这个示例使用了全批量梯度下降(Full Batch Gradient Descent)的更新方式(即每次迭代都遍历整个数据集),但计算梯度时只使用了单个样本,这实际上模拟了SGD的随机性。为了真正的SGD,你需要在每次迭代中只使用一个样本或一小批样本,而不是整个数据集。你可以通过修改内部循环来实现这一点,例如,每次只处理一个样本或固定大小的batch。
吴恩达作业—梯度优化介绍
吴恩达《深度学习》L2W2作业 - Heywhale.com
包括:梯度下降(GD)
批量处理;随机梯度下降(SGD);小批量GD
momentum;Adam。
二、动量的更新
Exercise: Now, implement the parameters update with momentum. The momentum update rule is,(实现带冲量的参数更新。冲量更新规则是,) for l = 1 , . . . , L l = 1, ..., L l=1,...,L:
{ v d W [ l ] = β v d W [ l ] + ( 1 − β ) d W [ l ] W [ l ] = W [ l ] − α v d W [ l ] (3) \begin{cases} v_{dW^{[l]}} = \beta v_{dW^{[l]}} + (1 - \beta) dW^{[l]} \\ W^{[l]} = W^{[l]} - \alpha v_{dW^{[l]}} \end{cases}\tag{3} {vdW[l]=βvdW[l]+(1−β)dW[l]W[l]=W[l]−αvdW[l](3)
{ v d b [ l ] = β v d b [ l ] + ( 1 − β ) d b [ l ] b [ l ] = b [ l ] − α v d b [ l ] (4) \begin{cases} v_{db^{[l]}} = \beta v_{db^{[l]}} + (1 - \beta) db^{[l]} \\ b^{[l]} = b^{[l]} - \alpha v_{db^{[l]}} \end{cases}\tag{4} {vdb[l]=βvdb[l]+(1−β)db[l]b[l]=b[l]−αvdb[l](4)
where L is the number of layers,
β
\beta
β is the momentum and
α
\alpha
α is the learning rate. All parameters should be stored in the parameters
dictionary. Note that the iterator l
starts at 0 in the for
loop while the first parameters are
W
[
1
]
W^{[1]}
W[1] and
b
[
1
]
b^{[1]}
b[1] (that’s a “one” on the superscript). So you will need to shift l
to l+1
when coding.
- 初始化动量
def initialize_velocity(parameters):
"""
Initializes the velocity as a python dictionary with:
- keys: "dW1", "db1", ..., "dWL", "dbL"
- values: numpy arrays of zeros of the same shape as the corresponding gradients/parameters.
Arguments:
parameters -- python dictionary containing your parameters.
parameters['W' + str(l)] = Wl
parameters['b' + str(l)] = bl
Returns:
v -- python dictionary containing the current velocity.
v['dW' + str(l)] = velocity of dWl
v['db' + str(l)] = velocity of dbl
"""
L = len(parameters) // 2 # number of layers in the neural networks
v = {}
# Initialize velocity
for l in range(L):
### START CODE HERE ### (approx. 2 lines)
v["dW" + str(l+1)] = np.zeros(parameters['W' + str(l+1)].shape)
v["db" + str(l+1)] = np.zeros(parameters['b' + str(l+1)].shape)
### END CODE HERE ###
return v
- 更新参数with动量
def update_parameters_with_momentum(parameters, grads, v, beta, learning_rate):
"""
Update parameters using Momentum
Arguments:
parameters -- python dictionary containing your parameters:
parameters['W' + str(l)] = Wl
parameters['b' + str(l)] = bl
grads -- python dictionary containing your gradients for each parameters:
grads['dW' + str(l)] = dWl
grads['db' + str(l)] = dbl
v -- python dictionary containing the current velocity:
v['dW' + str(l)] = ...
v['db' + str(l)] = ...
beta -- the momentum hyperparameter, scalar
learning_rate -- the learning rate, scalar
Returns:
parameters -- python dictionary containing your updated parameters
v -- python dictionary containing your updated velocities
"""
L = len(parameters) // 2 # number of layers in the neural networks
# Momentum update for each parameter
for l in range(L):
### START CODE HERE ### (approx. 4 lines)
# compute velocities
v["dW" + str(l + 1)] = beta*v["dW" + str(l + 1)]+(1-beta)*grads['dW' + str(l+1)]
v["db" + str(l + 1)] = beta*v["db" + str(l + 1)]+(1-beta)*grads['db' + str(l+1)]
# update parameters
parameters["W" + str(l + 1)] = parameters['W' + str(l+1)] - learning_rate*v["dW" + str(l + 1)]
parameters["b" + str(l + 1)] = parameters['b' + str(l+1)] - learning_rate*v["db" + str(l + 1)]
### END CODE HERE ###
return parameters, v
小批量mini_batch
def random_mini_batches(X, Y, mini_batch_size = 64, seed = 0):
"""
Creates a list of random minibatches from (X, Y) Arguments:
X -- input data, of shape (input size, number of examples) (m, Hi, Wi, Ci) Y -- true "label" vector (containing 0 if cat, 1 if non-cat), of shape (1, number of examples) (m, n_y) mini_batch_size - size of the mini-batches, integer seed -- this is only for the purpose of grading, so that you're "random minibatches are the same as ours. Returns:
mini_batches -- list of synchronous (mini_batch_X, mini_batch_Y) """ m = X.shape[0] # number of training examples
mini_batches = []
np.random.seed(seed)
# Step 1: Shuffle (X, Y)
permutation = list(np.random.permutation(m))
shuffled_X = X[permutation,:,:,:]
shuffled_Y = Y[permutation,:]
# Step 2: Partition (shuffled_X, shuffled_Y). Minus the end case.
num_complete_minibatches = math.floor(m/mini_batch_size) # number of mini batches of size mini_batch_size in your partitionning
for k in range(0, num_complete_minibatches):
mini_batch_X = shuffled_X[k * mini_batch_size : k * mini_batch_size + mini_batch_size,:,:,:]
mini_batch_Y = shuffled_Y[k * mini_batch_size : k * mini_batch_size + mini_batch_size,:]
mini_batch = (mini_batch_X, mini_batch_Y)
mini_batches.append(mini_batch)
# Handling the end case (last mini-batch < mini_batch_size)
if m % mini_batch_size != 0:
mini_batch_X = shuffled_X[num_complete_minibatches * mini_batch_size : m,:,:,:]
mini_batch_Y = shuffled_Y[num_complete_minibatches * mini_batch_size : m,:]
mini_batch = (mini_batch_X, mini_batch_Y)
mini_batches.append(mini_batch)
return mini_batches
三、Adam
Adam是训练神经网络最有效的优化算法之一。它结合了RMSProp和Momentum的优点。
Adam原理
1.计算过去梯度的指数加权平均值,并将其存储在变量(使用偏差校正之前)和(使用偏差校正)中。
2.计算过去梯度的平方的指数加权平均值,并将其存储在变量(偏差校正之前)和(偏差校正中)中。
3.组合“1”和“2”的信息,在一个方向上更新参数
{ v d W [ l ] = β 1 v d W [ l ] + ( 1 − β 1 ) ∂ J ∂ W [ l ] v d W [ l ] c o r r e c t e d = v d W [ l ] 1 − ( β 1 ) t s d W [ l ] = β 2 s d W [ l ] + ( 1 − β 2 ) ( ∂ J ∂ W [ l ] ) 2 s d W [ l ] c o r r e c t e d = s d W [ l ] 1 − ( β 1 ) t W [ l ] = W [ l ] − α v d W [ l ] c o r r e c t e d s d W [ l ] c o r r e c t e d + ε \begin{cases} v_{dW^{[l]}} = \beta_1 v_{dW^{[l]}} + (1 - \beta_1) \frac{\partial \mathcal{J} }{ \partial W^{[l]} } \\ v^{corrected}_{dW^{[l]}} = \frac{v_{dW^{[l]}}}{1 - (\beta_1)^t} \\ s_{dW^{[l]}} = \beta_2 s_{dW^{[l]}} + (1 - \beta_2) (\frac{\partial \mathcal{J} }{\partial W^{[l]} })^2 \\ s^{corrected}_{dW^{[l]}} = \frac{s_{dW^{[l]}}}{1 - (\beta_1)^t} \\ W^{[l]} = W^{[l]} - \alpha \frac{v^{corrected}_{dW^{[l]}}}{\sqrt{s^{corrected}_{dW^{[l]}}} + \varepsilon} \end{cases} ⎩ ⎨ ⎧vdW[l]=β1vdW[l]+(1−β1)∂W[l]∂JvdW[l]corrected=1−(β1)tvdW[l]sdW[l]=β2sdW[l]+(1−β2)(∂W[l]∂J)2sdW[l]corrected=1−(β1)tsdW[l]W[l]=W[l]−αsdW[l]corrected+εvdW[l]corrected
- 初始化
def initialize_adam(parameters) :
L = len(parameters) // 2 # number of layers in the neural networks
v = {}
s = {}
# Initialize v, s. Input: "parameters". Outputs: "v, s".
for l in range(L):
### START CODE HERE ### (approx. 4 lines)
v["dW" + str(l + 1)] = np.zeros(parameters["W" + str(l+1)].shape)
v["db" + str(l + 1)] = np.zeros(parameters["b" + str(l+1)].shape)
s["dW" + str(l + 1)] = np.zeros(parameters["W" + str(l+1)].shape)
s["db" + str(l + 1)] = np.zeros(parameters["b" + str(l+1)].shape)
### END CODE HERE ###
return v, s
查看结果:
parameters = initialize_adam_test_case()
v, s = initialize_adam(parameters)
print("v[\"dW1\"] = " + str(v["dW1"]))
print("v[\"db1\"] = " + str(v["db1"]))
print("v[\"dW2\"] = " + str(v["dW2"]))
print("v[\"db2\"] = " + str(v["db2"]))
print("s[\"dW1\"] = " + str(s["dW1"]))
print("s[\"db1\"] = " + str(s["db1"]))
print("s[\"dW2\"] = " + str(s["dW2"]))
print("s[\"db2\"] = " + str(s["db2"]))
# GRADED FUNCTION: update_parameters_with_adam
def update_parameters_with_adam(parameters, grads, v, s, t, learning_rate = 0.01,
beta1 = 0.9, beta2 = 0.999, epsilon = 1e-8):
L = len(parameters) // 2 # number of layers in the neural networks
v_corrected = {} # Initializing first moment estimate, python dictionary
s_corrected = {} # Initializing second moment estimate, python dictionary
# Perform Adam update on all parameters
for l in range(L):
# Moving average of the gradients. Inputs: "v, grads, beta1". Output: "v".
### START CODE HERE ### (approx. 2 lines)
v["dW" + str(l + 1)] = beta1*v["dW" + str(l + 1)] +(1-beta1)*grads['dW' + str(l+1)]
v["db" + str(l + 1)] = beta1*v["db" + str(l + 1)] +(1-beta1)*grads['db' + str(l+1)]
### END CODE HERE ###
# Compute bias-corrected first moment estimate. Inputs: "v, beta1, t". Output: "v_corrected".
### START CODE HERE ### (approx. 2 lines)
v_corrected["dW" + str(l + 1)] = v["dW" + str(l + 1)]/(1-(beta1)**t)
v_corrected["db" + str(l + 1)] = v["db" + str(l + 1)]/(1-(beta1)**t)
### END CODE HERE ###
# Moving average of the squared gradients. Inputs: "s, grads, beta2". Output: "s".
### START CODE HERE ### (approx. 2 lines)
s["dW" + str(l + 1)] =beta2*s["dW" + str(l + 1)] + (1-beta2)*(grads['dW' + str(l+1)]**2)
s["db" + str(l + 1)] = beta2*s["db" + str(l + 1)] + (1-beta2)*(grads['db' + str(l+1)]**2)
### END CODE HERE ###
# Compute bias-corrected second raw moment estimate. Inputs: "s, beta2, t". Output: "s_corrected".
### START CODE HERE ### (approx. 2 lines)
s_corrected["dW" + str(l + 1)] =s["dW" + str(l + 1)]/(1-(beta2)**t)
s_corrected["db" + str(l + 1)] = s["db" + str(l + 1)]/(1-(beta2)**t)
### END CODE HERE ###
# Update parameters. Inputs: "parameters, learning_rate, v_corrected, s_corrected, epsilon". Output: "parameters".
### START CODE HERE ### (approx. 2 lines)
parameters["W" + str(l + 1)] = parameters["W" + str(l + 1)]-learning_rate*(v_corrected["dW" + str(l + 1)]/np.sqrt( s_corrected["dW" + str(l + 1)]+epsilon))
parameters["b" + str(l + 1)] = parameters["b" + str(l + 1)]-learning_rate*(v_corrected["db" + str(l + 1)]/np.sqrt( s_corrected["db" + str(l + 1)]+epsilon))
### END CODE HERE ###
return parameters, v, s
结果:
parameters = initialize_adam_test_case()
v, s = initialize_adam(parameters)
print("v[\"dW1\"] = " + str(v["dW1"]))
print("v[\"db1\"] = " + str(v["db1"]))
print("v[\"dW2\"] = " + str(v["dW2"]))
print("v[\"db2\"] = " + str(v["db2"]))
print("s[\"dW1\"] = " + str(s["dW1"]))
print("s[\"db1\"] = " + str(s["db1"]))
print("s[\"dW2\"] = " + str(s["dW2"]))
print("s[\"db2\"] = " + str(s["db2"]))