Machine Learning(7)Neural network —— optimization techniques I
Chenjing Ding
2018/02/27
notation | meaning |
---|---|
g(x) | activate function |
xn x n | the n-th input vector (simplified as xi x i when n is not specified) |
xni x n i | the i-th entry of xn x n (simplified as xi x i when n is not specified) |
N | the number of input vectors |
K | the number of classes |
tn t n | a vector with K dimensional with k-th entry 1 only when the n-th input vector belongs to k-th class, tn = (0,0,…1…0) |
yj(x) y j ( x ) | the output of j-th output neural |
y(x) y ( x ) | a output vector of input vector x; y(x)=(y1(x)...yK(x)) y ( x ) = ( y 1 ( x ) . . . y K ( x ) ) |
Wτ+1ji W j i τ + 1 | the ( τ+1 τ + 1 )-th update of weight Wji W j i |
Wτji W j i τ | the τ τ -th update of weight Wji W j i |
∂E(W)∂W(m)ij ∂ E ( W ) ∂ W i j ( m ) | the gradient of m-th layer weight |
li l i | the number of neural in i-th layer(simplified as l when i is not specified) |
W(mn)ji W j i ( m n ) | the weight between layer m and n |
1. Regularization
To avoid overfitting:
Ω(W) Ω ( W ) is regularizers:L2 regularizer is ||W||2=∑i∑jw2ji | | W | | 2 = ∑ i ∑ j w j i 2 ; L1 regularizer is |W|=∑i∑j|wji| | W | = ∑ i ∑ j | w j i | ;
λ λ is regularization parameter;
This means every weight wji w j i can not be too big, thus the model cannot be too complex including so many useless features.
1.what is L1, L2 regularization :
https://www.youtube.com/watch?v=TmzzQoO8mr4 (Chinese)2.Regularization and Cost Function
https://www.youtube.com/watch?v=KvtGD37Rm5I&list=PLLssT5z_DsK-h9vYZkQkYNWcItqhlRJLN&index=40
2.Normalizing the Inputs
Convergence is faster if:
- the mean of all input data is 0
∂E(W)∂wji=yig′∂E(W)∂yj ∂ E ( W ) ∂ w j i = y i g ′ ∂ E ( W ) ∂ y j ,weights can only change together if input vector are all positive ao negative, thus it will lead to slow convergence. - the variance of all input data is the same
- all input data are not correlated if possible (using PCA to decorrelate them)
if the input are correlated, the direction of steepest descent is not optimal, maybe perpendicular to the direction towards the minimum.
3.Commonly Used Nonlinearities
??????????????The activation function is often nonlinear, here are some.
logistic sigmoid
σ(a)=11+exp(−a);σ′(a)=σ(a)(1−σ(a)); σ ( a ) = 1 1 + e x p ( − a ) ; σ ′ ( a ) = σ ( a ) ( 1 − σ ( a ) ) ;tanh
tanh(a)=2σ(2a)−1;tanh′(a)=1−tanh2(a) t a n h ( a ) = 2 σ ( 2 a ) − 1 ; t a n h ′ ( a ) = 1 − t a n h 2 ( a )
Advantages: compared with logistic sigmoidtanh(a) t a n h ( a ) already centred at zero , thus often converge faster than the standard logistic sigmoid.
figure1 nonlinear activation function(left: logistic sigmoid; right: tanh)
softmax
gi(a)=exp(−ai)∑jexp(−aj)softmax(a+b)=softmax(a)(both a and b are vectors) g i ( a ) = e x p ( − a i ) ∑ j e x p ( − a j ) s o f t m a x ( a + b ) = s o f t m a x ( a ) ( b o t h a a n d b a r e v e c t o r s )ReLU
g(a)=max{0,a}g′(a)={ 1, a>0 0, else g ( a ) = m a x { 0 , a } g ′ ( a ) = { 0 , e l s e 1 , a > 0
Advantages:thus gradient will be passed with a constant factor( ∂E(W)∂wji=yig′∂E(W)∂yj ∂ E ( W ) ∂ w j i = y i g ′ ∂ E ( W ) ∂ y j ), make it easier to propagate gradient through deep networks.( Imagine g′<0 g ′ < 0 , then ∂E(W)∂wji ∂ E ( W ) ∂ w j i will be smaller and smaller with the networks deep, finally the gradient will be close to zero)
don’t need to store ReLU output separately
Reduction of the required memory by half compared with tanh!
Because of these two features,ReLU has become the de-facto standard for deep networks.
Disadvantages:
stuck at zero, if the output of ReLU is zero for the input vector, then the concerned gradient can not be passed to next layer down since it is zero.
Offset bias since it is always positive.
Leaky ReLU
g(a)=max{βa,a} g ( a ) = m a x { β a , a }
Advantages:avoid “stuck at zero”
weaker offset bias.
ELU
g(a)={ ea−1, a>=0 a, a<0 g ( a ) = { a , a < 0 e a − 1 , a >= 0
no offset bias but needs to store the activation.
figure2 left:ReLU middle: Leaky ReLU right: ELUusage of nonlinear function
Output nodes
2 class classfication: sigmoid
multi-class classification: softmax
regression tasks: tanhInternal nodes
tanh is better than sigmoid for internal nodes since it is already centered at 0;
4.Weight Initialization
If we normalize all the input data, we also want to reserve the variance of input data because that the output data which is the input data of next layer again will have the same variance. As a result, convergence will be faster.
Thus our goal is to let variance of input data and output data be same.
if the mean of input data and weights are zero and they are identical independent distributed
4.1 Glorot Initialization
If
Var(wji)=1lin⇒Var(yj(x))=Var(xi)
V
a
r
(
w
j
i
)
=
1
l
i
n
⇒
V
a
r
(
y
j
(
x
)
)
=
V
a
r
(
x
i
)
;
lin
l
i
n
is the number of input neural linked to j-th output neural. If we do the same for the backpropagated gradient (
l=lout
l
=
l
o
u
t
), then
Var(wji)=1lout
V
a
r
(
w
j
i
)
=
1
l
o
u
t
.
The glorot initialization is:
4.2 He Initialization
The glorot initialization was based on tanh(centred at 0), He et al. made the derivations, proposed to use instead based on ReLU:
5.Stochastic and Batch learning
In Gradient Descent, the last step is to adjust weights in the direction of gradients. The equation is:
5.1 Batch learning
Process the full data at once to computer the gradient.
5.2 Stochastic learning
Choose a single training sample
xn
x
n
to obtain
En(W)
E
n
(
W
)
;
5.3 Stochastic vs. Batch Learning
5.3.1 Batching learning advantages
- Many acceleration techniques (e.g., conjugate gradients) only operate in batch learning.
- Theoretical analysis of the weight dynamics and convergence rates are simpler.
5.3.2 Stochastic advantages
- Usually much faster than batch learning.
- Often results in better solutions.
- Can be used for tracking changes.
5.4 Minibatch
Minibatch combine two methods above together, process only a small batch of training examples together.
5.4.1 Advantages
- more stable than stochastic learning but faster than batch learning
- take advantage of redundancies in training data(some training sample can appear in different mini batches)
- the input data is matrix since it’s the combination of input vector, and matrix operations is more efficient than vector operations
5.4.2 Caveat
The error function needs to be normalized by minibatch size because we want to keep the learning rate same in different minibatches. Suppose M is the minibatch size: