超参数权值初始化问题

深层神经网络的搭建中,我们提到关于超参数权值的初始化至关重要。今天我们就来谈谈其重要性以及如何选择恰当的数值来初始化这一参数。


1. 权值初始化的意义


     一个好的权值初始值,有以下优点:

  • 加快梯度下降的收敛速度
  • 增加梯度下降到最小训练误差的几率

2. 编写代码

      为了理解上面提及的意义,下面通过比较来进行进一步地解释。

2.1 数据准备

   
   
  1. import numpy as np
  2. import matplotlib.pyplot as plt
  3. import sklearn
  4. import sklearn.datasets
  5. from init_utils import sigmoid, relu, compute_loss, forward_propagation, backward_propagation
  6. from init_utils import update_parameters, predict, load_dataset, plot_decision_boundary, predict_dec
  7. %matplotlib inline
  8. plt.rcParams[ ‘figure.figsize’] = ( 7.0, 4.0) # set default size of plots
  9. plt.rcParams[ ‘image.interpolation’] = 'nearest’
  10. plt.rcParams[ ‘image.cmap’] = ‘gray’
  11. # load image dataset: blue/red dots in circles
  12. train_X, train_Y, test_X, test_Y = load_dataset()

执行结果如下:


2.2 编写相应初始化权值的方法

全初始化为0:


  
  
  1. # GRADED FUNCTION: initialize_parameters_zeros
  2. def initialize_parameters_zeros(layers_dims):
  3. “”"
  4. Arguments:
  5. layer_dims – python array (list) containing the size of each layer.
  6. Returns:
  7. parameters – python dictionary containing your parameters “W1”, “b1”, …, “WL”, “bL”:
  8. W1 – weight matrix of shape (layers_dims[1], layers_dims[0])
  9. b1 – bias vector of shape (layers_dims[1], 1)
  10. WL – weight matrix of shape (layers_dims[L], layers_dims[L-1])
  11. bL – bias vector of shape (layers_dims[L], 1)
  12. “”"
  13. parameters = {}
  14. L = len(layers_dims) # number of layers in the network
  15. for l in range( 1, L):
  16. ### START CODE HERE ### (≈ 2 lines of code)
  17. parameters[ ‘W’ + str(l)] = np.zeros((layers_dims[ 1], layers_dims[ 0])) if l == 1 else np.zeros((layers_dims[l], layers_dims[l -1]))
  18. parameters[ ‘b’ + str(l)] = np.zeros((layers_dims[l], 1))
  19. ### END CODE HERE ###
  20. return parameters

全初始化为比较大的值:


  
  
  1. # GRADED FUNCTION: initialize_parameters_random
  2. def initialize_parameters_random(layers_dims):
  3. “”"
  4. Arguments:
  5. layer_dims – python array (list) containing the size of each layer.
  6. Returns:
  7. parameters – python dictionary containing your parameters “W1”, “b1”, …, “WL”, “bL”:
  8. W1 – weight matrix of shape (layers_dims[1], layers_dims[0])
  9. b1 – bias vector of shape (layers_dims[1], 1)
  10. WL – weight matrix of shape (layers_dims[L], layers_dims[L-1])
  11. bL – bias vector of shape (layers_dims[L], 1)
  12. “”"
  13. np.random.seed( 3) # This seed makes sure your “random” numbers will be the as ours
  14. parameters = {}
  15. L = len(layers_dims) # integer representing the number of layers
  16. for l in range( 1, L):
  17. ### START CODE HERE ### (≈ 2 lines of code)
  18. parameters[ ‘W’ + str(l)] = np.random.randn(layers_dims[ 1], layers_dims[ 0]) * 10 if l == 1 else </div>
  19. np.random.randn(layers_dims[l], layers_dims[l -1]) * 10
  20. parameters[ ‘b’ + str(l)] = np.zeros((layers_dims[l], 1))
  21. ### END CODE HERE ###
  22. return parameters

全初始化为比较小的值:


  
  
  1. # GRADED FUNCTION: initialize_parameters_he
  2. def initialize_parameters_he(layers_dims):
  3. “”"
  4. Arguments:
  5. layer_dims – python array (list) containing the size of each layer.
  6. Returns:
  7. parameters – python dictionary containing your parameters “W1”, “b1”, …, “WL”, “bL”:
  8. W1 – weight matrix of shape (layers_dims[1], layers_dims[0])
  9. b1 – bias vector of shape (layers_dims[1], 1)
  10. WL – weight matrix of shape (layers_dims[L], layers_dims[L-1])
  11. bL – bias vector of shape (layers_dims[L], 1)
  12. “”"
  13. np.random.seed( 3)
  14. parameters = {}
  15. L = len(layers_dims) - 1 # integer representing the number of layers
  16. for l in range( 1, L + 1):
  17. ### START CODE HERE ### (≈ 2 lines of code)
  18. parameters[ ‘W’ + str(l)] = np.random.randn(layers_dims[ 1], layers_dims[ 0]) * np.sqrt( 2./layers_dims[ 0]) if l == 1 </div>
  19. else np.random.randn(layers_dims[l], layers_dims[l -1]) * np.sqrt( 2./layers_dims[l -1])
  20. parameters[ ‘b’ + str(l)] = np.zeros((layers_dims[l], 1))
  21. ### END CODE HERE ###
  22. return parameters


2.3 编写深层神经网络模型


  
  
  1. def model(X, Y, learning_rate = 0.01, num_iterations = 15000, print_cost = True, initialization = “he”):
  2. “”"
  3. Implements a three-layer neural network: LINEAR->RELU->LINEAR->RELU->LINEAR->SIGMOID.
  4. Arguments:
  5. X – input data, of shape (2, number of examples)
  6. Y – true “label” vector (containing 0 for red dots; 1 for blue dots), of shape (1, number of examples)
  7. learning_rate – learning rate for gradient descent
  8. num_iterations – number of iterations to run gradient descent
  9. print_cost – if True, print the cost every 1000 iterations
  10. initialization – flag to choose which initialization to use (“zeros”,“random” or “he”)
  11. Returns:
  12. parameters – parameters learnt by the model
  13. “”"
  14. grads = {}
  15. costs = [] # to keep track of the loss
  16. m = X.shape[ 1] # number of examples
  17. layers_dims = [X.shape[ 0], 10, 5, 1]
  18. # Initialize parameters dictionary.
  19. if initialization == “zeros”:
  20. parameters = initialize_parameters_zeros(layers_dims)
  21. elif initialization == “random”:
  22. parameters = initialize_parameters_random(layers_dims)
  23. elif initialization == “he”:
  24. parameters = initialize_parameters_he(layers_dims)
  25. # Loop (gradient descent)
  26. for i in range( 0, num_iterations):
  27. # Forward propagation: LINEAR -> RELU -> LINEAR -> RELU -> LINEAR -> SIGMOID.
  28. a3, cache = forward_propagation(X, parameters)
  29. # Loss
  30. cost = compute_loss(a3, Y)
  31. # Backward propagation.
  32. grads = backward_propagation(X, Y, cache)
  33. # Update parameters.
  34. parameters = update_parameters(parameters, grads, learning_rate)
  35. # Print the loss every 1000 iterations
  36. if print_cost and i % 1000 == 0:
  37. print( “Cost after iteration {}: {}”.format(i, cost))
  38. costs.append(cost)
  39. # plot the loss
  40. plt.plot(costs)
  41. plt.ylabel( ‘cost’)
  42. plt.xlabel( ‘iterations (per hundreds)’)
  43. plt.title( “Learning rate =” + str(learning_rate))
  44. plt.show()
  45. return parameters


3. 实验比较


3.1 方案一


  
  
  1. parameters = model(train_X, train_Y, initialization = “zeros”)
  2. print ( “On the train set:”)
  3. predictions_train = predict(train_X, train_Y, parameters)
  4. print ( “On the test set:”)
  5. predictions_test = predict(test_X, test_Y, parameters)

执行结果:


       如果把权值全部初始化为0,代价函数将不会减少,而且训练效果和预测效果都不好。这是因为权值设置全为0的网络是对称的,也就是说任一层的每个神经单元将学习相同的权值,最终学习的结果也是线性的,因此效果甚至还没有单个线性回归分类的效果有效。


3.2 方案二


  
  
  1. parameters = model(train_X, train_Y, initialization = “random”)
  2. print ( “On the train set:”)
  3. predictions_train = predict(train_X, train_Y, parameters)
  4. print ( “On the test set:”)
  5. predictions_test = predict(test_X, test_Y, parameters)

执行结果:


       开始迭代时,代价函数非常大。这是因为随机生成的权值向量较大,进一步通过激活函数sigmod使得某些样例的y_hat非常接近与0或1,最终表现出来的log0使得代价函数无穷大。此外,把权值初始化为较大的值会延缓优化的速度。权值过大或者过小会引起梯度爆炸或者梯度消失。


3.3 方案三


  
  
  1. parameters = model(train_X, train_Y, initialization = “he”)
  2. print ( “On the train set:”)
  3. predictions_train = predict(train_X, train_Y, parameters)
  4. print ( “On the test set:”)
  5. predictions_test = predict(test_X, test_Y, parameters)

执行结果:


        通过结果,不难发现用“He”来初始化训练效果非常棒!


4. 小结


  • 不同的初始化会有不同的训练效果
  • 随机初始化被用来打破对称性,使得每个神经单元可以学习不同的事情。
  • 不要把初始值设置的太大
  • 使用“He”初始化使用“ReLU”激活单元的网络训练效果最佳


评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值