DL-1: Tips for Training Deep Neural Network_smaller derivatives, larger learning rate, and vic-CSDN博客

本文链接：https://blog.csdn.net/tianzhaixing2013/article/details/64922111

Different approaches for different problems.

e.g. dropout for good results on testing data.

Choosing proper loss

Square Error

\sum i = 1 n (y i - y i ˆ) 2

$\sum_{i=1}^{n}(y_i-\widehat{y_i})^2 \qquad$

Cross Entropy

- \sum i = 1 n y i ˆ l n y i

$-\sum_{i=1}^{n}\widehat{y_i}ln{y_i} \qquad$

Mini-batch

We do not really minimize total loss!

batch_size: 每次批处理训练样本个数；
nb_epoch: 整个训练数据重复处理次数。
总的训练样本数量不变。

Mini-batch is Faster. Not always true with parallel computing.

Mini-batch has better performance!

Shuffle the training examples for each epoch. This is the default of Keras.

New activation function

Q: Vanishing Gradient Problem

Smaller gradients
Learn very slow
Almost random
Larger gradients
Learn very fast
Already converge

2006 RBM –> 2015 ReLU

ReLU: Rectified Linear Unit
1. Fast to compute
2. Biological reason
3. Infinite sigmoid with different biases
4. Vanishing gradient problem

A Thinner linear network. Do not have smaller gradients.

ReLU

ReLU1

ReLU2

ReLU3

Adaptive Learning Rate

Set the learning rate $\eta$ carefully.

If learning rate is too large, total loss may not decrease after each update.
If learning rate is too small, training would be too slow.

Solution:

Popular & Simple Idea: Reduce the learning rate by some factor every few epochs.
- At the beginning, use larger learning rate
- After several epochs, reduce the learning rate. E.g. 1/t decay: $\eta^t = \eta/\sqrt{t+1}$
Learning rate cannot be one-size-fits-all.
- Giving different parameters different learning rates

Adagrad: $w = w - \eta_w \partial L/ \partial w$
$\eta_w$ : Parameter dependent learning rate.

η w = η \sum t i = 0 ( g i ) 2

$\eta_w = \frac{\eta}{\sum_{i=0}^{t}(g^i)^2}$

$\eta$ : constant
$g^i$ : is $\partial L / \partial w$ obtained at the i-th update.

Summation of the square of the previous derivatives.

Observation:
1. Learning rate is smaller and smaller for all parameters.
2. Smaller derivatives, larger learning rate, and vice versa.

Adagrad [John Duchi, JMLR’11]
RMSprop
https://www.youtube.com/watch?v=O3sxAc4hxZU
Adadelta [Matthew D. Zeiler, arXiv’12]
“No more pesky learning rates” [Tom Schaul, arXiv’12]
AdaSecant [Caglar Gulcehre, arXiv’14]
Adam [DiederikP. Kingma, ICLR’15]
Nadam
http://cs229.stanford.edu/proj2015/054_report.pdf

Momentum

Momentum1

Overfitting

Learning target is defined by the training data.
Training data and testing data can be different.
The parameters achieving the learning target do not necessary have good results on the testing data.
Panacea for Overfitting
- Have more training data
- Create more training data

Early Stopping

Keras-Early Stopping

Regularization

Weight decay is one kind of regularization.

Keras-regularizers

Dropout

Training

Each time before updating the parameters
1. Each neuron has p% to dropout
  
  The structure of the network is changed.
2. Using the new network for training
  For each mini-batch, we resample the dropout neurons.

Testing

**No dropout**

If the dropout rate at training is p%, all the weights times (1-p)%
Assume that the dropout rate is 50%.

If a weight w = 1 by training, set w = 0.5 for testing.

Dropout -Intuitive Reason

When teams up, if everyone expect the partner will do the work, nothing will be done finally.
However, if you know your partner will dropout, you will do better.
When testing, no one dropout actually, so obtaining good results eventually.