Machine Learning(7)Neural network —— optimization techniques I

Machine Learning(7)Neural network —— optimization techniques I


Chenjing Ding
2018/02/27


notationmeaning
g(x)activate function
xn x n the n-th input vector (simplified as xi x i when n is not specified)
xni x n i the i-th entry of xn x n (simplified as xi x i when n is not specified)
Nthe number of input vectors
Kthe number of classes
tn t n a vector with K dimensional with k-th entry 1 only when the n-th input vector belongs to k-th class, tn = (0,0,…1…0)
yj(x) y j ( x ) the output of j-th output neural
y(x) y ( x ) a output vector of input vector x; y(x)=(y1(x)...yK(x)) y ( x ) = ( y 1 ( x ) . . . y K ( x ) )
Wτ+1ji W j i τ + 1 the ( τ+1 τ + 1 )-th update of weight Wji W j i
Wτji W j i τ the τ τ -th update of weight Wji W j i
E(W)W(m)ij ∂ E ( W ) ∂ W i j ( m ) the gradient of m-th layer weight
li l i the number of neural in i-th layer(simplified as l when i is not specified)
W(mn)ji W j i ( m n ) the weight between layer m and n

1. Regularization

To avoid overfitting:

E(W)=i=1nL(tn,y(xn))+λΩ(W) E ( W ) = ∑ i = 1 n L ( t n , y ( x n ) ) + λ Ω ( W )
L(tn,y(xn)) L ( t n , y ( x n ) ) is a loss function
Ω(W) Ω ( W ) is regularizers:L2 regularizer is ||W||2=ijw2ji | | W | | 2 = ∑ i ∑ j w j i 2 ; L1 regularizer is |W|=ij|wji| | W | = ∑ i ∑ j | w j i | ;
λ λ is regularization parameter;
This means every weight wji w j i can not be too big, thus the model cannot be too complex including so many useless features.

1.what is L1, L2 regularization :
https://www.youtube.com/watch?v=TmzzQoO8mr4 (Chinese)

2.Regularization and Cost Function
https://www.youtube.com/watch?v=KvtGD37Rm5I&list=PLLssT5z_DsK-h9vYZkQkYNWcItqhlRJLN&index=40

2.Normalizing the Inputs

Convergence is faster if:

  • the mean of all input data is 0
    E(W)wji=yigE(W)yj ∂ E ( W ) ∂ w j i = y i g ′ ∂ E ( W ) ∂ y j ,weights can only change together if input vector are all positive ao negative, thus it will lead to slow convergence.
  • the variance of all input data is the same
  • all input data are not correlated if possible (using PCA to decorrelate them)
    if the input are correlated, the direction of steepest descent is not optimal, maybe perpendicular to the direction towards the minimum.

3.Commonly Used Nonlinearities

??????????????The activation function is often nonlinear, here are some.

  • logistic sigmoid

    σ(a)=11+exp(a);σ(a)=σ(a)(1σ(a)); σ ( a ) = 1 1 + e x p ( − a ) ; σ ′ ( a ) = σ ( a ) ( 1 − σ ( a ) ) ;

  • tanh

    tanh(a)=2σ(2a)1;tanh(a)=1tanh2(a) t a n h ( a ) = 2 σ ( 2 a ) − 1 ; t a n h ′ ( a ) = 1 − t a n h 2 ( a )

    Advantages: compared with logistic sigmoid

    tanh(a) t a n h ( a ) already centred at zero , thus often converge faster than the standard logistic sigmoid.



figure1 nonlinear activation function(left: logistic sigmoid; right: tanh)

  • softmax

    gi(a)=exp(ai)jexp(aj)softmax(a+b)=softmax(a)(both a and b are vectors) g i ( a ) = e x p ( − a i ) ∑ j e x p ( − a j ) s o f t m a x ( a + b ) = s o f t m a x ( a ) ( b o t h   a   a n d   b   a r e   v e c t o r s )

  • ReLU

    g(a)=max{0,a}g(a)={  1, a>0  0,  else g ( a ) = m a x { 0 , a } g ′ ( a ) = {     0 ,     e l s e     1 ,   a > 0

    Advantages:

    1. thus gradient will be passed with a constant factor( E(W)wji=yigE(W)yj ∂ E ( W ) ∂ w j i = y i g ′ ∂ E ( W ) ∂ y j ), make it easier to propagate gradient through deep networks.( Imagine g<0 g ′ < 0 , then E(W)wji ∂ E ( W ) ∂ w j i will be smaller and smaller with the networks deep, finally the gradient will be close to zero)

    2. don’t need to store ReLU output separately
      Reduction of the required memory by half compared with tanh!
      Because of these two features,ReLU has become the de-facto standard for deep networks.

    Disadvantages:

    1. stuck at zero, if the output of ReLU is zero for the input vector, then the concerned gradient can not be passed to next layer down since it is zero.

    2. Offset bias since it is always positive.

  • Leaky ReLU

    g(a)=max{βa,a} g ( a ) = m a x { β a , a }

    Advantages:

    1. avoid “stuck at zero”

    2. weaker offset bias.

  • ELU

    g(a)={  ea1, a>=0  a,  a<0 g ( a ) = {     a ,     a < 0     e a − 1 ,   a >= 0

    no offset bias but needs to store the activation.



    figure2 left:ReLU middle: Leaky ReLU right: ELU

  • usage of nonlinear function

    1. Output nodes
      2 class classfication: sigmoid
      multi-class classification: softmax
      regression tasks: tanh

    2. Internal nodes
      tanh is better than sigmoid for internal nodes since it is already centered at 0;

4.Weight Initialization

If we normalize all the input data, we also want to reserve the variance of input data because that the output data which is the input data of next layer again will have the same variance. As a result, convergence will be faster.
Thus our goal is to let variance of input data and output data be same.

yj(x)=i=1lwjixiVar(wjixi)=E(xi)2Var(wji)+E(wji)2Var(xi)+Var(wji)Var(xi) y j ( x ) = ∑ i = 1 l w j i x i V a r ( w j i x i ) = E ( x i ) 2 ∗ V a r ( w j i ) + E ( w j i ) 2 ∗ V a r ( x i ) + V a r ( w j i ) V a r ( x i )

if the mean of input data and weights are zero and they are identical independent distributed
Var(wjixi)=Var(wji)Var(xi)Var(yj(x))=i=1lVar(wji)Var(xi)=lVar(wji)Var(xi) V a r ( w j i x i ) = V a r ( w j i ) V a r ( x i ) V a r ( y j ( x ) ) = ∑ i = 1 l V a r ( w j i ) V a r ( x i ) = l V a r ( w j i ) V a r ( x i )

4.1 Glorot Initialization

If Var(wji)=1linVar(yj(x))=Var(xi) V a r ( w j i ) = 1 l i n ⇒ V a r ( y j ( x ) ) = V a r ( x i ) ; lin l i n is the number of input neural linked to j-th output neural. If we do the same for the backpropagated gradient ( l=lout l = l o u t ), then Var(wji)=1lout V a r ( w j i ) = 1 l o u t .
The glorot initialization is:

Var(wji)=2lin+lout V a r ( w j i ) = 2 l i n + l o u t

4.2 He Initialization

The glorot initialization was based on tanh(centred at 0), He et al. made the derivations, proposed to use instead based on ReLU:

Var(W)=2lin V a r ( W ) = 2 l i n

5.Stochastic and Batch learning

In Gradient Descent, the last step is to adjust weights in the direction of gradients. The equation is:

5.1 Batch learning

Process the full data at once to computer the gradient.

E(W)=i=1NEi(W)wτ+1ji=wτjiηE(W)Wτji E ( W ) = ∑ i = 1 N E i ( W ) w j i τ + 1 = w j i τ − η ∂ E ( W ) ∂ W j i τ

5.2 Stochastic learning

Choose a single training sample xn x n to obtain En(W) E n ( W ) ;

wτ+1ji=wτjiηEn(W)Wτji w j i τ + 1 = w j i τ − η ∂ E n ( W ) ∂ W j i τ

5.3 Stochastic vs. Batch Learning
5.3.1 Batching learning advantages
  • Many acceleration techniques (e.g., conjugate gradients) only operate in batch learning.
  • Theoretical analysis of the weight dynamics and convergence rates are simpler.
5.3.2 Stochastic advantages
  • Usually much faster than batch learning.
  • Often results in better solutions.
  • Can be used for tracking changes.
5.4 Minibatch

Minibatch combine two methods above together, process only a small batch of training examples together.

5.4.1 Advantages
  • more stable than stochastic learning but faster than batch learning
  • take advantage of redundancies in training data(some training sample can appear in different mini batches)
  • the input data is matrix since it’s the combination of input vector, and matrix operations is more efficient than vector operations
5.4.2 Caveat

The error function needs to be normalized by minibatch size because we want to keep the learning rate same in different minibatches. Suppose M is the minibatch size:

E(W)=1Mi=1MEi(W)+λMΩ(W) E ( W ) = 1 M ∑ i = 1 M E i ( W ) + λ M Ω ( W )

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值