1 mini-batch Gradient Descent
1.1 步骤概念
-
将样本随机打乱(确保X和Y一起打乱,保证X与lableY仍然相对应)
-
permutation = list(np.random.permutation(m))
shuffled_X = X[:, permutation]
shuffled_Y = Y[:, permutation].reshape((1,m)) -
划分小批量
# Step 2: Partition (shuffled_X, shuffled_Y). Minus the end case.
num_complete_minibatches = math.floor(m/mini_batch_size) # number of mini batches of size mini_batch_size in your partitionning
for k in range(0, num_complete_minibatches):
### START CODE HERE ### (approx. 2 lines)
mini_batch_X = shuffled_X[:,(k)*mini_batch_size:(k+1)*mini_batch_size]
mini_batch_Y = shuffled_Y[:,(k)*mini_batch_size:(k+1)*mini_batch_size]
### END CODE HERE ###
mini_batch = (mini_batch_X, mini_batch_Y)
mini_batches.append(mini_batch)
# Handling the end case (last mini-batch < mini_batch_size)
if m % mini_batch_size != 0:
### START CODE HERE ### (approx. 2 lines)
mini_batch_X = shuffled_X[:,mini_batch_size*(math.floor(m/mini_batch_size)):m]
mini_batch_Y = shuffled_Y[:,mini_batch_size*(math.floor(m/mini_batch_size)):m]
### END CODE HERE ###
mini_batch = (mini_batch_X, mini_batch_Y)
mini_batches.append(mini_batch)
- 洗牌和分区是构建小型批所需的两个步骤
2的幂通常选择为小批大小,例如,16、32、64、128。
2. 动量Momentum
也就是V值等等的一系列操作
2.1 初始化initialize_velocity
# GRADED FUNCTION: initialize_velocity
def initialize_velocity(parameters):
"""
Initializes the velocity as a python dictionary with:
- keys: "dW1", "db1", ..., "dWL", "dbL"
- values: numpy arrays of zeros of the same shape as the corresponding gradients/parameters.
Arguments:
parameters -- python dictionary containing your parameters.
parameters['W' + str(l)] = Wl
parameters['b' + str(l)] = bl
Returns:
v -- python dictionary containing the current velocity.
v['dW' + str(l)] = velocity of dWl
v['db' + str(l)] = velocity of dbl
"""
L = len(parameters) // 2 # number of layers in the neural networks
v = {}
# Initialize velocity
for l in range(L):
### START CODE HERE ### (approx. 2 lines)
v["dW" + str(l+1)] = np.zeros((parameters['W'+str(l+1)].shape[0],parameters['W'+str(l+1)].shape[1]))
v["db" + str(l+1)] = np.zeros((parameters['b'+str(l+1)].shape[0],parameters['b'+str(l+1)].shape[1]))
### END CODE HERE ###
return v
2.2 更新参数with动量
# GRADED FUNCTION: update_parameters_with_momentum
def update_parameters_with_momentum(parameters, grads, v, beta, learning_rate):
"""
Update parameters using Momentum
Arguments:
parameters -- python dictionary containing your parameters:
parameters['W' + str(l)] = Wl
parameters['b' + str(l)] = bl
grads -- python dictionary containing your gradients for each parameters:
grads['dW' + str(l)] = dWl
grads['db' + str(l)] = dbl
v -- python dictionary containing the current velocity:
v['dW' + str(l)] = ...
v['db' + str(l)] = ...
beta -- the momentum hyperparameter, scalar
learning_rate -- the learning rate, scalar
Returns:
parameters -- python dictionary containing your updated parameters
v -- python dictionary containing your updated velocities
"""
L = len(parameters) // 2 # number of layers in the neural networks
# Momentum update for each parameter
for l in range(L):
### START CODE HERE ### (approx. 4 lines)
# compute velocities
v["dW" + str(l+1)] = beta*v["dW" + str(l)]+(1-beta*v["dW" + str(l+1)])
v["db" + str(l+1)] = beta*v["db" + str(l)]+(1-beta*v["db" + str(l+1)])
# update parameters
parameters["W" + str(l+1)] = parameters["W" + str(l+1)] - learning_rate*v["dW" + str(l+1)]
parameters["b" + str(l+1)] = parameters["b" + str(l+1)] - learning_rate*v["db" + str(l+1)]
### END CODE HERE ###
return parameters, v
3.Adam算法
Adam算法是训练神经网络最有效的优化算法之一。它结合了RMSProp(在讲座中描述)和Momentum的思想。
How does Adam work?
- 它计算过去梯度的指数加权平均值,并将其存储在变量 v v v(偏差校正前)和 v c o r r e c t e d v^{corrected} vcorrected(偏差校正后)中。
- 它计算过去梯度的平方的指数加权平均值,并将其存储在变量 s s s(偏差校正前)和 s c o r r e c t e d s^{corrected} scorrected{纠正后)中。
- 它根据来自“1”和“2”的信息组合的方向更新参数。
The update rule is, for l = 1 , . . . , L l = 1, ..., L l=1,...,L:
{ v d W [ l ] = β 1 v d W [ l ] + ( 1 − β 1 ) ∂ J ∂ W [ l ] v d W [ l ] c o r r e c t e d = v d W [ l ] 1 − ( β 1