本文以一个实例来说明不同初始化方法之间的区别,包括初始化为0、随机初始化以及He初始化
一、模型构建
本模型用于分类问题,两种类别分别以蓝色和红色两种点显示在以下图像中:
train_X, train_Y, test_X, test_Y = load_dataset()
三层神经网络的基本结构如下:首先通过传入不同的initalization参数实现不同的初始化方法,具体的构建方法可见后文;然后通过propagation选择能够最小化loss的参数并返回。
def model(X, Y, learning_rate = 0.01, num_iterations = 15000, print_cost = True, initialization = "he"):
"""
Implements a three-layer neural network: LINEAR->RELU->LINEAR->RELU->LINEAR->SIGMOID.
Arguments:
X -- input data, of shape (2, number of examples)
Y -- true "label" vector (containing 0 for red dots; 1 for blue dots), of shape (1, number of examples)
learning_rate -- learning rate for gradient descent
num_iterations -- number of iterations to run gradient descent
print_cost -- if True, print the cost every 1000 iterations
initialization -- flag to choose which initialization to use ("zeros","random" or "he")
Returns:
parameters -- parameters learnt by the model
"""
grads = {}
costs = [] # to keep track of the loss
m = X.shape[1] # number of examples
layers_dims = [X.shape[0], 10, 5, 1]
# Initialize parameters dictionary.
if initialization == "zeros":
parameters = initialize_parameters_zeros(layers_dims)
elif initialization == "random":
parameters = initialize_parameters_random(layers_dims)
elif initialization == "he":
parameters = initialize_parameters_he(layers_dims)
# Loop (gradient descent)
for i in range(num_iterations):
# Forward propagation: LINEAR -> RELU -> LINEAR -> RELU -> LINEAR -> SIGMOID.
a3, cache = forward_propagation(X, parameters)
# Loss
cost = compute_loss(a3, Y)
# Backward propagation.
grads = backward_propagation(X, Y, cache)
# Update parameters.
parameters = update_parameters(parameters, grads, learning_rate)
# Print the loss every 1000 iterations
if print_cost and i % 1000 == 0:
print("Cost after iteration {}: {}".format(i, cost))
costs.append(cost)
# plot the loss
plt.plot(costs)
plt.ylabel('cost')
plt.xlabel('iterations (per hundreds)')
plt.title("Learning rate =" + str(learning_rate))
plt.show()
return parameters
二、Zero Initialization
需要进行初始化的参数包括Weight matrices 和bias vector
# GRADED FUNCTION: initialize_parameters_zeros
def initialize_parameters_zeros(layers_dims):
"""
Arguments:
layer_dims -- python array (list) containing the size of each layer.
Returns:
parameters -- python dictionary containing your parameters "W1", "b1", ..., "WL", "bL":
W1 -- weight matrix of shape (layers_dims[1], layers_dims[0])
b1 -- bias vector of shape (layers_dims[1], 1)
...
WL -- weight matrix of shape (layers_dims[L], layers_dims[L-1])
bL -- bias vector of shape (layers_dims[L], 1)
"""
parameters = {}
L = len(layers_dims) # number of layers in the network
for l in range(1, L):
#(≈ 2 lines of code)
# YOUR CODE STARTS HERE
parameters['W' + str(l)] = np.zeros((layers_dims[l],layers_dims[l-1]))
parameters['b' + str(l)] = np.zeros((layers_dims[l],1))
# YOUR CODE ENDS HERE
return parameters
但是,将所有参数都初始化为0,则每一层的z都是0,模型非常低效。对于本三层神经网络模型而言,所有sigmoid层的 , 最后一层ReLU层的 ,
已经得到y_pred=0.5,则y=0对应的loss和y=1对应的loss都是-ln(1/2);并且无论iteration是多少,cost都相同
将得到的classifier显示在图像上为:
三、随机初始化
# GRADED FUNCTION: initialize_parameters_random
def initialize_parameters_random(layers_dims):
"""
Arguments:
layer_dims -- python array (list) containing the size of each layer.
Returns:
parameters -- python dictionary containing your parameters "W1", "b1", ..., "WL", "bL":
W1 -- weight matrix of shape (layers_dims[1], layers_dims[0])
b1 -- bias vector of shape (layers_dims[1], 1)
...
WL -- weight matrix of shape (layers_dims[L], layers_dims[L-1])
bL -- bias vector of shape (layers_dims[L], 1)
"""
np.random.seed(3) # This seed makes sure your "random" numbers will be the as ours
parameters = {}
L = len(layers_dims) # integer representing the number of layers
for l in range(1, L):
#(≈ 2 lines of code)
# parameters['W' + str(l)] =
# parameters['b' + str(l)] =
# YOUR CODE STARTS HERE
parameters['W' + str(l)] = np.random.randn(layers_dims[l],layers_dims[l-1])*10
parameters['b' + str(l)] = np.zeros((layers_dims[l],1))
# YOUR CODE ENDS HERE
return parameters
选择参数时的性能如下。由于最初是随机选择,因此初始的cost较高;并且较大的参数也会导致gradient很小,减慢学习的速度
最终得到的分类器为:
四、He Initialization
基本思想和随机初始化类似,但是为了防止初始gradient太小,我们希望参数不要过大;因此比起将随机值*10,此处采取其他的缩小办法。
# GRADED FUNCTION: initialize_parameters_he
def initialize_parameters_he(layers_dims):
"""
Arguments:
layer_dims -- python array (list) containing the size of each layer.
Returns:
parameters -- python dictionary containing your parameters "W1", "b1", ..., "WL", "bL":
W1 -- weight matrix of shape (layers_dims[1], layers_dims[0])
b1 -- bias vector of shape (layers_dims[1], 1)
...
WL -- weight matrix of shape (layers_dims[L], layers_dims[L-1])
bL -- bias vector of shape (layers_dims[L], 1)
"""
np.random.seed(3)
parameters = {}
L = len(layers_dims) - 1 # integer representing the number of layers
for l in range(1, L + 1):
#(≈ 2 lines of code)
# parameters['W' + str(l)] =
# parameters['b' + str(l)] =
# YOUR CODE STARTS HERE
parameters['W' + str(l)] = np.random.randn(layers_dims[l],layers_dims[l-1])*np.sqrt(2/layers_dims[l-1])
parameters['b' + str(l)] = np.zeros((layers_dims[l],1))
# YOUR CODE ENDS HERE
return parameters
对应性能以及得到的分类器如下:
五、总结
三种方法的准确率依次上升。zero initialization由于没有打破对称关系,准确率为50%;随机初始化由于初始值过大,gradient太小,导致学习速度很慢;He intializatioin的性能相比之下最佳