2 Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimizatio 笔记及课后习题解答

2 Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization 吴恩达-Coursera课程

WEEK 1

1.If you have 10,000,000 examples, how would you split the train/dev/test set?
98% train, 1% dev, 1% test

2.The dev and test set should:
Come from the same distribution

3.If your Neural Network model seems to have high bias, what of the following would be promising things to try? (Check all that apply.)
Increase the number of units in each hidden layer
Make the Neural Network deeper

4.You are working on an automated check-out kiosk for a supermarket, and are building a classifier for apples, bananas and oranges. Suppose your classifier obtains a training set error of 0.5%, and a dev set error of 7%. Which of the following are promising things to try to improve your classifier? (Check all that apply.)
Increase the regularization parameter lambda
Get more training data

5.What is weight decay?
A regularization technique (such as L2 regularization) that results in gradient descent shrinking the weights on every iteration.

!6.What happens when you increase the regularization hyperparameter lambda?
Weight are pushed toward becoming smaller

7.With the inverted dropout technique, at test time:
You do not apply dropout (do not randomly eliminate units) and do not keep the 1/keep_prob factor in the calculations used in training

8.Increasing the parameter keep_prob from (say) 0.5 to 0.6 will likely cause the following: (Check the two that apply)
Reducing the regularization effect
Causing the neural network to end up with a lower training set error

9.Which of these techniques are useful for reducing variance (reducing overfitting)? (Check all that apply.)
L2 regularization
Dropout
Data augmentation

!10.Why do we normalize the inputs x x x?
It makes the cost function faster to optimize

3.1 - Forward propagation with dropout

Exercise: Implement the forward propagation with dropout. You are using a 3 layer neural network, and will add dropout to the first and second hidden layers. We will not apply dropout to the input layer or output layer.

Instructions:
You would like to shut down some neurons in the first and second layers. To do that, you are going to carry out 4 Steps:

  1. In lecture, we dicussed creating a variable d [ 1 ] d^{[1]} d[1] with the same shape as a [ 1 ] a^{[1]} a[1] using np.random.rand() to randomly get numbers between 0 and 1. Here, you will use a vectorized implementation, so create a random matrix D [ 1 ] = [ d [ 1 ] ( 1 ) d [ 1 ] ( 2 ) . . . d [ 1 ] ( m ) ] D^{[1]} = [d^{[1](1)} d^{[1](2)} ... d^{[1](m)}] D[1]=[d[1](1)d[1](2)...d[1](m)] of the same dimension as A [ 1 ] A^{[1]} A[1].
  2. Set each entry of D [ 1 ] D^{[1]} D[1] to be 1 with probability (keep_prob), and 0 otherwise.

Hint: Let’s say that keep_prob = 0.8, which means that we want to keep about 80% of the neurons and drop out about 20% of them. We want to generate a vector that has 1’s and 0’s, where about 80% of them are 1 and about 20% are 0.
This python statement:
X = (X < keep_prob).astype(int)

is conceptually the same as this if-else statement (for the simple case of a one-dimensional array) :

for i,v in enumerate(x):
    if v < keep_prob:
        x[i] = 1
    else: # v >= keep_prob
        x[i] = 0

Note that the X = (X < keep_prob).astype(int) works with multi-dimensional arrays, and the resulting output preserves the dimensions of the input array.

Also note that without using .astype(int), the result is an array of booleans True and False, which Python automatically converts to 1 and 0 if we multiply it with numbers. (However, it’s better practice to convert data into the data type that we intend, so try using .astype(int).)

  1. Set A [ 1 ] A^{[1]} A[1] to A [ 1 ] ∗ D [ 1 ] A^{[1]} * D^{[1]} A[1]D[1]. (You are shutting down some neurons). You can think of D [ 1 ] D^{[1]} D[1] as a mask, so that when it is multiplied with another matrix, it shuts down some of the values.
  2. Divide A [ 1 ] A^{[1]} A[1] by keep_prob. By doing this you are assuring that the result of the cost will still have the same expected value as without drop-out. (This technique is also called inverted dropout.)

WEEK 2

Optimization algorithms

1.Which notation would you use to denote the 3rd layer’s activations when the input is the 7th example from the 8th minibatch?
a[3]{8}(7)

2.Which of these statements about mini-batch gradient descent do you agree with?
One iteration of mini-batch gradient descent (computing on a single mini-batch) is faster than one iteration of batch gradient descent.

3.Why is the best mini-batch size usually not 1 and not m, but instead something in-between?
If the mini-batch size is 1, you lose the benefits of vectorization across examples in the mini-batch.
If the mini-batch size is m, you end up with batch gradient descent, which has to process the whole training set before making progress.

4.Suppose your learning algorithm’s cost JJ, plotted as a function of the number of iterations, looks like this:
If you’re using mini-batch gradient descent, this looks acceptable. But if you’re using batch gradient descent, something is wrong.

5.Suppose the temperature in Casablanca over the first three days of January are the same:
v2=7.5,v2c=10

6.Which of these is NOT a good learning rate decay scheme? Here, t is the epoch number.
e^t

7.You use an exponentially weighted average on the London temperature dataset. You use the following to track the temperature:
Decreasing β will create more oscillation within the red line
Increasing β will shift the red line slightly to the right.

8.These plots were generated with gradient descent; with gradient descent with momentum (\betaβ = 0.5) and gradient descent with momentum (\betaβ = 0.9). Which curve corresponds to which algorithm?
(1) is gradient descent. (2) is gradient descent with momentum (small β). (3) is gradient descent with momentum (large β)

9.Suppose batch gradient descent in a deep network is taking excessively long to find a value of the parameters that achieves a small value for the cost function
Try using Adam
Try tuning the learning rate \alphaα
Try mini-batch gradient descent
Try better random initalization for the weights

10.Which of the following statements about Adam is False?
Adam should be used with batch gradient computations, not with mini-batches.

WEEK 3

Hyperparameter tuning, Batch Normalization, Programming Frameworks

1.If searching among a large number of hyperparameters, you should try values in a grid rather than random values, so that you can carry out the search more systematically and not rely on chance. True or False?
False

2.Every hyperparameter, if set poorly, can have a huge negative impact on training, and so all hyperparameters are about equally important to tune well. True or False?
False

3.During hyperparameter search, whether you try to babysit one model (“Panda” strategy) or train a lot of models in parallel (“Caviar”) is largely determined by:
The amount of computational power you can access

4.If you think β \beta β (hyperparameter for momentum) is between on 0.9 and 0.99, which of the following is the recommended way to sample a value for beta?
r=np.random.rand()
beta=1-10**( - r - 1 )

5.Finding good hyperparameter values is very time-consuming. So typically you should do it once at the start of the project, and try to find very good hyperparameters so that you don’t ever have to revisit tuning them again. True or false?
False

6.In batch normalization as presented in the videos, if you apply it on the llth layer of your neural network, what are you normalizing?
z [ l ] z^{[l]} z[l]

7.In the normalization formula z(i)norm=z(i)−μσ2+ε√, why do we use epsilon?
To avoid division by zero

8.Which of the following statements about \gammaγ and \betaβ in Batch Norm are true?
They can be learned using Adam, Gradient descent with momentum, or RMSprop, not just with gradient descent.
They set the mean and variance of the linear variable z [ l ] z^{[l]} z[l] of a given layer.

9.After training a neural network with Batch Norm, at test time, to evaluate the neural network on a new example you should:
Perform the needed normalizations, use μ \mu μ and σ 2 \sigma^2 σ2 estimated using an exponentially weighted average across mini-batches seen during training.

10.Which of these statements about deep learning programming frameworks are true? (Check all that apply)
A programming framework allows you to code up deep learning algorithms with typically fewer lines of code than a lower-level language such as Python.
Even if a project is currently open source, good governance of the project helps ensure that the it remains open even in the long term, rather than become closed or modified to benefit only one company.

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

Kin-Zhang

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值