转载请注明出处。谢谢。
本博文根据 coursera 吴恩达 Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization整理。作为深度学习网络优化的记录,将三章内容均以要点的形式记录,并结合实例说明。
这一章的重点在于理解几个优化函数,虽然在深度学习的框架中,均被较好的封装,一般一行代码即可实现,但是,理解其原理才能更好的应用。
序号接上一篇。
Q7:mini-batch gradient descent
梯度下降的方法有三种:
1. Gradient descent: 最早的训练方法,在进行每一次梯度下降时,处理整个数据集,做一次更新,其优点是cost function 总是向减小的方向下降; 但如果数据集很大,处理速度就会比较慢。
X = data_input
Y = labels
parameters = initialize_parameters(layers_dims)
for i in range(0, num_iterations):
# Forward propagation
a, caches = forward_propagation(X, parameters)
# Compute cost.
cost = compute_cost(a, Y)
# Backward propagation.
grads = backward_propagation(a, caches, parameters)
# Update parameters.
parameters = update_parameters(parameters, grads)
2. 随机梯度下降 SGD:对每一个训练样本执行一次梯度下降;缺点是丢失了向量化带来的加速,而且虽然每次向最小值方向下降,但由于单个样本的噪声,无法达到全局最小值点,会呈现波动的形式。
X = data_input
Y = labels
parameters = initialize_parameters(layers_dims)
for i in range(0, num_iterations):
for j in range(0, m):
# Forward propagation
a, caches = forward_propagation(X[:,j], parameters)
# Compute cost
cost = compute_cost(a, Y[:,j])
# Backward propagation
grads = backward_propagation(a, caches, parameters)
# Update parameters.
parameters = update_parameters(parameters, grads)
3. mini-batch梯度下降:在GD与SGD中trade off, 选择1<batch_size<m,使既可以较快速,又可以使cost function下降处于两者之间。
记得分割时处理残留非完整batch。
def random_mini_batches(X, Y, mini_batch_size = 64, seed = 0):
"""
Creates a list of random minibatches from (X, Y)
Arguments:
X -- input data, of shape (input size, number of examples)
Y -- true "label" vector (1 for blue dot / 0 for red dot), of shape (1, number of examples)
mini_batch_size -- size of the mini-batches, integer
Returns:
mini_batches -- list of synchronous (mini_batch_X, mini_batch_Y)
"""
np.random.seed(seed) # To make your "random" minibatches the same as ours
m = X.shape[1] # number of training examples
mini_batches = []
# Step 1: Shuffle (X, Y)
permutation = list(np.random.permutation(m))
shuffled_X = X[:, permutation]
shuffled_Y = Y[:, permutation].reshape((1,m))
# Step 2: Partition (shuffled_X, shuffled_Y). Minus the end case.
num_complete_minibatches = math.floor(m/mini_batch_size) # number of mini batches of size mini_batch_size in your partitionning
for k in range(0, num_complete_minibatches):
### START CODE HERE ### (approx. 2 lines)
mini_batch_X = shuffled_X[:, mini_batch_size*k : mini_batch_size*(k+1)]
mini_batch_Y = shuffled_Y[:, mini_batch_size*k : mini_batch_size*(k+1)]
### END CODE HERE ###
mini_batch = (mini_batch_X, mini_batch_Y)
mini_batches.append(mini_batch)
# Handling the end case (last mini-batch < mini_batch_size)
if m % mini_batch_size != 0:
### START CODE HERE ### (approx. 2 lines)
mini_batch_X = shuffled_X[:, mini_batch_size*(k+1) :]
mini_batch_Y = shuffled_Y[:, mini_batch_size*(k+1) :]
### END CODE HERE ###
mini_batch = (mini_batch_X, mini_batch_Y)
mini_batches.append(mini_batch)
return mini_batches
经验表明:mini-batch的选择一般为2^6, 2^7,...,2^10,更需要匹配cpu/gpu的内存。
Q8. 梯度下降的优化方法
背景知识:指数加权平均
公式:
理解指数加权平均的意义可以从下面实例开始:
根据公式计算:令
......
可以看出,vt 是对每天温度的加权平均;加权系数随时间以指数递减,时间越靠近,权重越大,因而称为指数加权平均。
另外,考察一下加权系数 β 的作用:
β=0.9 红线,表示最近10天的加权平均
β=0.98 绿线,表示最近50天的加权平均
β=0.5 黄线,表示最近2天的加权平均
原因: 以β=0.9为例, ; 以1/e 为分析点,小于1/e 忽略,
那么对于任意 β, , 所以令, 指数加权平均就是最近 天的平均值。
另外理论上可以加上 进行修正,但在实践操作中,不常使用。
1. Momentum
公式:
分析:
如果用原来的更新方法,较大的学习率会导致偏离函数范围,因而只能用较小的学习率,但这样会比较慢。因此利用动量的原理,在纵轴上由于上下波动,平均后近似为0, 水平方向平均后仍旧较大,可前进,从而导致整体的梯度呈现在横轴前进的趋势。减少了纵轴的波动。
代码:
def update_parameters_with_momentum(parameters, grads, v, beta, learning_rate):
"""
Update parameters using Momentum
Arguments:
parameters -- python dictionary containing your parameters:
parameters['W' + str(l)] = Wl
parameters['b' + str(l)] = bl
grads -- python dictionary containing your gradients for each parameters:
grads['dW' + str(l)] = dWl
grads['db' + str(l)] = dbl
v -- python dictionary containing the current velocity:
v['dW' + str(l)] = ...
v['db' + str(l)] = ...
beta -- the momentum hyperparameter, scalar
learning_rate -- the learning rate, scalar
Returns:
parameters -- python dictionary containing your updated parameters
v -- python dictionary containing your updated velocities
"""
L = len(parameters) // 2 # number of layers in the neural networks
# Momentum update for each parameter
for l in range(L):
### START CODE HERE ### (approx. 4 lines)
# compute velocities
v["dW" + str(l+1)] = beta * v["dW" + str(l+1)] + (1 - beta)* grads['dW' + str(l+1)]
v["db" + str(l+1)] = beta * v["db" + str(l+1)] + (1 - beta)* grads['db' + str(l+1)]
# update parameters
parameters["W" + str(l+1)] = parameters['W' + str(l+1)] - learning_rate * v["dW" + str(l+1)]
parameters["b" + str(l+1)] = parameters['b' + str(l+1)] - learning_rate * v["db" + str(l+1)]
### END CODE HERE ###
return parameters, v
2. RMSprop (root mean square prop)
公式:
分析:
与上面唯一不同的是使用了微分平方加权平均数;消除了摆动幅度的方向,修正了摆动幅度,使各维度摆动幅度较小,使网络收敛更快。例如,纵轴维度摆动幅度大,因而S就大,将其作为除数,使得 1/sqrt(S) 就小,1/sqrt(S)就是该方向更新的梯度值,就小;同理 S 较小的时候,波动较小,倒数之后就大,该方向更新就多。
代码:
def update_parameters_with_rmsprop(parameters, grads, s, beta, learning_rate):
"""
Update parameters using RMSprop
Arguments:
parameters -- python dictionary containing your parameters:
parameters['W' + str(l)] = Wl
parameters['b' + str(l)] = bl
grads -- python dictionary containing your gradients for each parameters:
grads['dW' + str(l)] = dWl
grads['db' + str(l)] = dbl
s -- python dictionary containing the current velocity:
s['dW' + str(l)] = ...
s['db' + str(l)] = ...
beta -- the momentum hyperparameter, scalar
learning_rate -- the learning rate, scalar
Returns:
parameters -- python dictionary containing your updated parameters
s -- python dictionary containing your updated velocities
"""
L = len(parameters) // 2 # number of layers in the neural networks
# Momentum update for each parameter
for l in range(L):
### START CODE HERE ### (approx. 4 lines)
# compute velocities
s["dW" + str(l+1)] = beta * s["dW" + str(l+1)] + (1 - beta)* np.power(grads['dW' + str(l+1)],2)
s["db" + str(l+1)] = beta * s["db" + str(l+1)] + (1 - beta)* np.power(grads['db' + str(l+1)],2)
# update parameters
parameters["W" + str(l+1)] = parameters['W' + str(l+1)] - learning_rate * grads['dW' + str(l+1)]/(np.sqrt(s["dW" + str(l+1)])+1e-8)
parameters["b" + str(l+1)] = parameters['b' + str(l+1)] - learning_rate * grads["db" + str(l+1)]/(np.sqrt(s["dW" + str(l+1)])+1e-8)
### END CODE HERE ###
return parameters, s
3. Adam
公式:
, 其中 t 表示循环次数(iteration times)
分析:
将1和2的优点相融合
代码:
def update_parameters_with_adam(parameters, grads, v, s, t, learning_rate = 0.01,
beta1 = 0.9, beta2 = 0.999, epsilon = 1e-8):
"""
Update parameters using Adam
Arguments:
parameters -- python dictionary containing your parameters:
parameters['W' + str(l)] = Wl
parameters['b' + str(l)] = bl
grads -- python dictionary containing your gradients for each parameters:
grads['dW' + str(l)] = dWl
grads['db' + str(l)] = dbl
v -- Adam variable, moving average of the first gradient, python dictionary
s -- Adam variable, moving average of the squared gradient, python dictionary
learning_rate -- the learning rate, scalar.
beta1 -- Exponential decay hyperparameter for the first moment estimates
beta2 -- Exponential decay hyperparameter for the second moment estimates
epsilon -- hyperparameter preventing division by zero in Adam updates
Returns:
parameters -- python dictionary containing your updated parameters
v -- Adam variable, moving average of the first gradient, python dictionary
s -- Adam variable, moving average of the squared gradient, python dictionary
"""
L = len(parameters) // 2 # number of layers in the neural networks
v_corrected = {} # Initializing first moment estimate, python dictionary
s_corrected = {} # Initializing second moment estimate, python dictionary
# Perform Adam update on all parameters
for l in range(L):
# Moving average of the gradients. Inputs: "v, grads, beta1". Output: "v".
### START CODE HERE ### (approx. 2 lines)
v["dW" + str(l+1)] = beta1 * v["dW" + str(l+1)] + (1 - beta1)* grads["dW" + str(l+1)]
v["db" + str(l+1)] = beta1 * v["db" + str(l+1)] + (1 - beta1)* grads["db" + str(l+1)]
### END CODE HERE ###
# Compute bias-corrected first moment estimate. Inputs: "v, beta1, t". Output: "v_corrected".
### START CODE HERE ### (approx. 2 lines)
v_corrected["dW" + str(l+1)] = v["dW" + str(l+1)] / (1 - np.power(beta1,t))
v_corrected["db" + str(l+1)] = v["db" + str(l+1)] / (1 - np.power(beta1,t))
### END CODE HERE ###
# Moving average of the squared gradients. Inputs: "s, grads, beta2". Output: "s".
### START CODE HERE ### (approx. 2 lines)
s["dW" + str(l+1)] = beta2 * s["dW" + str(l+1)] + (1 - beta2) * np.power(grads["dW" + str(l+1)],2)
s["db" + str(l+1)] = beta2 * s["db" + str(l+1)] + (1 - beta2) * np.power(grads["db" + str(l+1)],2)
### END CODE HERE ###
# Compute bias-corrected second raw moment estimate. Inputs: "s, beta2, t". Output: "s_corrected".
### START CODE HERE ### (approx. 2 lines)
s_corrected["dW" + str(l+1)] = s["dW" + str(l+1)] / (1 - np.power(beta2,t))
s_corrected["db" + str(l+1)] = s["db" + str(l+1)] / (1 - np.power(beta2,t))
### END CODE HERE ###
# Update parameters. Inputs: "parameters, learning_rate, v_corrected, s_corrected, epsilon". Output: "parameters".
### START CODE HERE ### (approx. 2 lines)
parameters["W" + str(l+1)] = parameters["W" + str(l+1)]-learning_rate * v_corrected["dW" + str(l+1)] /( np.sqrt(s_corrected["dW" + str(l+1)])+epsilon)
parameters["b" + str(l+1)] = parameters["b" + str(l+1)]-learning_rate * v_corrected["db" + str(l+1)] /( np.sqrt(s_corrected["db" + str(l+1)])+epsilon)
### END CODE HERE ###
return parameters, v, s
Q9. 学习率衰减
在训练中,如果使用统一的学习率,在最小值附近,由于样本的固有噪声,会产生一定的波动,因而难以精确收敛。所以提出使用学习率衰减,在初始状态,使用较大的学习率,可以更快的移动,当靠近最小值附近,使用较小的学习率,更容易找到最优点。
常用函数:
(1)
(2) 指数衰减
(3)
(4) 离散下降 分段使用不同的 α