Mini-batch
mini-batch 的梯度下降法
parameters["W" + str(l+1)] = parameters["W" + str(l+1)] - learning_rate * grads['dW' + str(l+1)]
parameters["b" + str(l+1)] = parameters["b" + str(l+1)] - learning_rate * grads['db' + str(l+1)]
因为这是一个 mini-batch 的损失,所以将 𝐽 损失加上上角标𝑡,放在大括号里
其实跟之前的 batch 执行梯度下降法如出一辙,除了你现在的对象不是𝑋,𝑌,而是𝑋^{𝑡}和𝑌^{𝑡}。接下来,执行反向传播来计算𝐽^{𝑡}的梯度,只需要使用𝑋^{𝑡}和𝑌^{𝑡}代替𝑋,𝑌,然后更新加权值
"""
Arguments:
X -- input data, of shape (input size, number of examples)
Y -- true "label" vector (1 for blue dot / 0 for red dot), of shape (1, number of examples)
mini_batch_size -- size of the mini-batches, integer
"""
# Step 1: Shuffle (X, Y)
permutation = list(np.random.permutation(m)) # 随机排序
shuffled_X = X[:, permutation]
shuffled_Y = Y[:, permutation].reshape((1,m))
# Step 2: Partition (shuffled_X, shuffled_Y). Minus the end case.
num_complete_minibatches = math.floor(m/mini_batch_size) # number of mini batches of size mini_batch_size in your partitionning
for k in range(0, num_complete_minibatches):
### START CODE HERE ### (approx. 2 lines)
mini_batch_X = shuffled_X[:, k*mini_batch_size : (k+1)*mini_batch_size]
mini_batch_Y = shuffled_Y[:, k*mini_batch_size : (k+1)*mini_batch_size]
### END CODE HERE ###
mini_batch = (mini_batch_X, mini_batch_Y)
mini_batches.append(mini_batch)
# Handling the end case (last mini-batch < mini_batch_size)
if m % mini_batch_size != 0:
### START CODE HERE ### (approx. 2 lines)
mini_batch_X = shuffled_X[:, num_complete_minibatches*mini_batch_size : m]
mini_batch_Y = shuffled_Y[:, num_complete_minibatches*mini_batch_size : m]
### END CODE HERE ###
mini_batch = (mini_batch_X, mini_batch_Y)
mini_batches.append(mini_batch)
简单理解 mini-batch
mini-batch 大小(batch_size选取)
动量梯度下降法(Gradient descent with Momentum)
为了进一步加速梯度下降算法的速度,可以计算梯度的指数加权平均数,并利用该梯度更新你的权重,也就是动量梯度下降法。动量梯度下降法、RMSprop 、Adam 也可以防止陷入局部最优和快速通过鞍点。
例如,如果你要优化成本函数,函数形状如图,红点代表最小值的位置
假设你从这里 (蓝色点)开始梯度下降法,如果进行梯度下降法的一次迭代,无论是 batch 或 mini-batch 下降法,也许会指向这里,现在在椭圆的另一边,计算下一步梯度下降,结果或许如此,然后再计算下去,你会发现梯度下降法要很多计算步骤对吧。
得到𝑑𝑊的移动平均数(moving average), 接着同样地计算𝑣𝑑𝑏。
#Initialize velocity
for l in range(L):
v["dW" + str(l+1)] = np.zeros((parameters['W' + str(l+1)].shape[0], parameters['W' + str(l+1)].shape[1]))
v["db" + str(l+1)] = np.zeros((parameters['b' + str(l+1)].shape[0], parameters['b' + str(l+1)].shape[1]))
# Momentum update for each parameter
for l in range(L):
### START CODE HERE ### (approx. 4 lines)
# compute velocities
v["dW" + str(l+1)] = beta * v["dW" + str(l+1)] + (1-beta) * grads['dW' + str(l+1)]
v["db" + str(l+1)] = beta * v["db" + str(l+1)] + (1-beta) * grads['db' + str(l+1)]
# update parameters
parameters["W" + str(l+1)] = parameters["W" + str(l+1)] - learning_rate * v["dW" + str(l+1)]
parameters["b" + str(l+1)] = parameters["b" + str(l+1)] - learning_rate * v["db" + str(l+1)]
### END CODE HERE ###
RMSprop
这样无论什么原因,你都不会除以一个很小很小的数。
RMSprop 跟 Momentum 有很相似的一点,可以消除梯度下降中的摆动,包括mini-batch 梯度下降,并允许你使用一个更大的学习率 𝛼,从而加快你的算法学习速度。
Adam 优化算法(Adam optimization algorithm)
RMSprop 以及 Adam 优化算法,是少有的经受住人们考验的两种算法,已被证明适用于不同的深度学习结构,很好地解决了许多问题。
Adam 优化算法基本上就是将 Momentum 和 RMSprop 结合在一起。
def update_parameters_with_adam(parameters, grads, v, s, t, learning_rate = 0.01,
beta1 = 0.9, beta2 = 0.999, epsilon = 1e-8):
"""
Update parameters using Adam
Arguments:
parameters -- python dictionary containing your parameters:
parameters['W' + str(l)] = Wl
parameters['b' + str(l)] = bl
grads -- python dictionary containing your gradients for each parameters:
grads['dW' + str(l)] = dWl
grads['db' + str(l)] = dbl
v -- Adam variable, moving average of the first gradient, python dictionary
s -- Adam variable, moving average of the squared gradient, python dictionary
learning_rate -- the learning rate, scalar.
beta1 -- Exponential decay hyperparameter for the first moment estimates
beta2 -- Exponential decay hyperparameter for the second moment estimates
epsilon -- hyperparameter preventing division by zero in Adam updates
Returns:
parameters -- python dictionary containing your updated parameters
v -- Adam variable, moving average of the first gradient, python dictionary
s -- Adam variable, moving average of the squared gradient, python dictionary
"""
L = len(parameters) // 2 # number of layers in the neural networks
v_corrected = {} # Initializing first moment estimate, python dictionary
s_corrected = {} # Initializing second moment estimate, python dictionary
# Perform Adam update on all parameters
for l in range(L):
# Moving average of the gradients. Inputs: "v, grads, beta1". Output: "v".
### START CODE HERE ### (approx. 2 lines)
v["dW" + str(l+1)] = beta1 * v["dW" + str(l+1)] + (1-beta1) * grads['dW' + str(l+1)]
v["db" + str(l+1)] = beta1 * v["db" + str(l+1)] + (1-beta1) * grads['db' + str(l+1)]
### END CODE HERE ###
# Compute bias-corrected first moment estimate. Inputs: "v, beta1, t". Output: "v_corrected".
### START CODE HERE ### (approx. 2 lines)
v_corrected["dW" + str(l+1)] = v["dW" + str(l+1)]/(1-(beta1)**t)
v_corrected["db" + str(l+1)] = v["db" + str(l+1)]/(1-(beta1)**t)
### END CODE HERE ###
# Moving average of the squared gradients. Inputs: "s, grads, beta2". Output: "s".
### START CODE HERE ### (approx. 2 lines)
s["dW" + str(l+1)] = beta2*s["dW" + str(l+1)] + (1-beta2)*(grads['dW' + str(l+1)]**2)
s["db" + str(l+1)] = beta2*s["db" + str(l+1)] + (1-beta2)*(grads['db' + str(l+1)]**2)
### END CODE HERE ###
# Compute bias-corrected second raw moment estimate. Inputs: "s, beta2, t". Output: "s_corrected".
### START CODE HERE ### (approx. 2 lines)
s_corrected["dW" + str(l+1)] = s["dW" + str(l+1)]/(1-(beta2)**t)
s_corrected["db" + str(l+1)] = s["db" + str(l+1)]/(1-(beta2)**t)
### END CODE HERE ###
# Update parameters. Inputs: "parameters, learning_rate, v_corrected, s_corrected, epsilon". Output: "parameters".
### START CODE HERE ### (approx. 2 lines)
parameters["W" + str(l + 1)] = parameters["W" + str(l + 1)]-learning_rate*(v_corrected["dW" + str(l + 1)]/np.sqrt( s_corrected["dW" + str(l + 1)]+epsilon))
parameters["b" + str(l + 1)] = parameters["b" + str(l + 1)]-learning_rate*(v_corrected["db" + str(l + 1)]/np.sqrt( s_corrected["db" + str(l + 1)]+epsilon))
### END CODE HERE ###
return parameters, v, s
效果对比
假设需要分类的数据集如下图所示:
各个算法运行效果:
Mini-batch Gradient descent:
Mini-batch gradient descent with momentum:
Mini-batch with Adam:
学习率衰减(Learning rate decay)
假设你要使用 mini-batch 梯度下降法,mini-batch 数量不大,大概 64 或者 128 个样本,在迭代过程中会有噪音(蓝色线),下降朝向这里的最小值,但是不会精确地收敛,所以你的算法最后在附近摆动,并不会真正收敛,因为你用的𝛼是固定值,不同的 mini-batch 中有噪音。
但要慢慢减少学习率 𝛼 的话,在初期的时候,𝛼 学习率还较大,你的学习还是相对较快,但随着𝛼变小,你的步伐也会变慢变小,所以最后你的曲线(绿色线)会在最小值附近的一小块区域里摆动,而不是在训练过程中,大幅度在最小值附近摆动。
其它公式有
(𝑡为 mini-batch 的数字)
学习率衰减的确大有裨益,有时候可以加快训练,但它并不是率先尝试的内容。