DL-1: Tips for Training Deep Neural Network

Different approaches for different problems.

e.g. dropout for good results on testing data.

Choosing proper loss

  • Square Error

i=1n(yiyiˆ)2

  • Cross Entropy

i=1nyiˆlnyi

Mini-batch

We do not really minimize total loss!

batch_size: 每次批处理训练样本个数;
nb_epoch: 整个训练数据重复处理次数。
总的训练样本数量不变。

Mini-batch is Faster. Not always true with parallel computing.

Mini-batch has better performance!

Shuffle the training examples for each epoch. This is the default of Keras.

New activation function

Q: Vanishing Gradient Problem

  • Smaller gradients
  • Learn very slow
  • Almost random

  • Larger gradients

  • Learn very fast
  • Already converge

2006 RBM –> 2015 ReLU

ReLU: Rectified Linear Unit
1. Fast to compute
2. Biological reason
3. Infinite sigmoid with different biases
4. Vanishing gradient problem

A Thinner linear network. Do not have smaller gradients.

ReLU

ReLU1

ReLU2

ReLU3

Adaptive Learning Rate

Set the learning rate η carefully.

  • If learning rate is too large, total loss may not decrease after each update.
  • If learning rate is too small, training would be too slow.

Solution:

  • Popular & Simple Idea: Reduce the learning rate by some factor every few epochs.
    • At the beginning, use larger learning rate
    • After several epochs, reduce the learning rate. E.g. 1/t decay: ηt=η/t+1
  • Learning rate cannot be one-size-fits-all.
    • Giving different parameters different learning rates

Adagrad: w=wηwL/w
ηw : Parameter dependent learning rate.

ηw=ηti=0(gi)2

η : constant
gi : is L/w obtained at the i-th update.

Summation of the square of the previous derivatives.

Observation:
1. Learning rate is smaller and smaller for all parameters.
2. Smaller derivatives, larger learning rate, and vice versa.

  • Adagrad [John Duchi, JMLR’11]
  • RMSprop
    https://www.youtube.com/watch?v=O3sxAc4hxZU
  • Adadelta [Matthew D. Zeiler, arXiv’12]
  • No more pesky learning rates” [Tom Schaul, arXiv’12]
  • AdaSecant [Caglar Gulcehre, arXiv’14]
  • Adam [DiederikP. Kingma, ICLR’15]
  • Nadam
    http://cs229.stanford.edu/proj2015/054_report.pdf

Momentum

Momentum

Momentum1

Overfitting

  • Learning target is defined by the training data.

  • Training data and testing data can be different.

  • The parameters achieving the learning target do not necessary have good results on the testing data.

  • Panacea for Overfitting

    • Have more training data
    • Create more training data

Early Stopping

Keras-Early Stopping

Regularization

Weight decay is one kind of regularization.

Keras-regularizers

Dropout

Training
  • Each time before updating the parameters
    1. Each neuron has p% to dropout
      The structure of the network is changed.
    2. Using the new network for training
      For each mini-batch, we resample the dropout neurons.
Testing
**No dropout**
  • If the dropout rate at training is p%, all the weights times (1-p)%
  • Assume that the dropout rate is 50%.
    If a weight w = 1 by training, set w = 0.5 for testing.
Dropout -Intuitive Reason
  • When teams up, if everyone expect the partner will do the work, nothing will be done finally.
  • However, if you know your partner will dropout, you will do better.
  • When testing, no one dropout actually, so obtaining good results eventually.
Dropout is a kind of ensemble

dropout1

dropout2

dropout3

dropout4

dropout5

Network Structure

CNN is a very good example!

参考

  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

Digital2Slave

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值