Machine Learning(7)Neural network —— Perceptrons

Machine Learning(7)Neural network —— Perceptrons


Chenjing Ding
2018/02/21


notationmeaning
g(x)activate function
xn x n the n-th input vector (simplified as xi x i when n is not specified)
xni x n i the i-th entry of xn x n (simplified as xi x i when n is not specified)
Nthe number of input vectors
Kthe number of classes
tn t n a vector with K dimensional with k-th entry 1 only when the n-th input vector belongs to k-th class, tn = (0,0,…1…0)
yj(x) y j ( x ) the output of j-th output neural
y(x) y ( x ) a output vector of input vector x; y(x)=(y1(x)...yK(x)) y ( x ) = ( y 1 ( x ) . . . y K ( x ) )
Wτ+1ji W j i τ + 1 the ( τ+1 τ + 1 )-th update of weight Wji W j i
Wτji W j i τ the τ τ -th update of weight Wji W j i
E(W)W(m)ij ∂ E ( W ) ∂ W i j ( m ) the gradient of m-th layer weight
li l i the number of neural in i-th layer
W(mn)ji W j i ( m n ) the weight between layer m and n

1. two layers perceptron

1.1 construction

2 layers refers to output layer and input layer; the basic construction of 2 layers perceptron is as followed:



figure1 the construction of 2 layers perceptron

input layer:
d neural, d is the dimensional of an input vector x; The input layer can applied with non-linear basic functions ϕ(x) ϕ ( x ) .

Weights:
Wji W j i , j is the index of neural in output layer;i is the index of neural in input layer;

Output layer:
There are k classes, so there are k output functions. The output layer can applied with activate function g(x).

yi(x)>yj(x),jiinput data xCilinear output:yj(x)=i=0dWjixi  or i=0dWjiϕ(xi)nonlinear output:yj(x)=g(i=0dWjixi)  or  g(i=0dWjiϕ(xi)) ∃ y i ( x ) > y j ( x ) , ∀ j ≠ i ⇒ i n p u t   d a t a   x ∈ C i l i n e a r   o u t p u t : y j ( x ) = ∑ i = 0 d W j i ∗ x i     o r   ∑ i = 0 d W j i ∗ ϕ ( x i ) n o n l i n e a r   o u t p u t : y j ( x ) = g ( ∑ i = 0 d W j i ∗ x i )     o r     g ( ∑ i = 0 d W j i ∗ ϕ ( x i ) )

1.2 Learning: How to get Wji W j i

Gradient descent with sequential updating can be used to minimize the error function E(W) to adjust weights.

step1: set up an error function E(W);
if we use L2 loss,

En(W)=12( y(xn)tn)2=12l=1K(yl(xn)tnl)2=12l=1K(m=0dWlmϕ(xm)tnl)2 E n ( W ) = 1 2 (   y ( x n ) − t n ) 2 = 1 2 ∑ l = 1 K ( y l ( x n ) − t n l ) 2 = 1 2 ∑ l = 1 K ( ∑ m = 0 d W l m ∗ ϕ ( x m ) − t n l ) 2

step2: calculate En(W)Wji ∂ E n ( W ) ∂ W j i ;
En(W)Wji=l=1K[(yl(xn)tnl)yl(xn)Wji]=( yj(xn)tnj)ϕ(xi) ∂ E n ( W ) ∂ W j i = ∑ l = 1 K [ ( y l ( x n ) − t n l ) ∗ ∂ y l ( x n ) ∂ W j i ] = (   y j ( x n ) − t n j ) ∗ ϕ ( x i )

step3: sequential updating, η η is the learning rate;
Wτ+1ji=WτjiηEn(W)Wji=Wτjiη[( yj(xn)tnj)ϕ(xi)]  (Delta rule/LMS rule) W j i τ + 1 = W j i τ − η ∂ E n ( W ) ∂ W j i = W j i τ − η [ (   y j ( x n ) − t n j ) ∗ ϕ ( x i ) ]     ( D e l t a   r u l e / L M S   r u l e )

Thus, perceptron learning corresponds to Gradient Descent of a quadratic error function.

  1. effor function more details:
  2. sequential updating and delta rule:
  3. Gradient descent
1.3 properties of 2 layers perceptron
  1. it can only represent the linear function since

    yj(x)=i=0dWjixi  or i=0dWjiϕ(xi) y j ( x ) = ∑ i = 0 d W j i ∗ x i     o r   ∑ i = 0 d W j i ∗ ϕ ( x i )
    the discriminant boundary is always linear in input space x or input space ϕ(x) ϕ ( x ) when input layer applied with ϕ(x) ϕ ( x ) , to be specific, the boundary can be a line, a plane and can not be a curve and so on. However, multi layers perceptron with hidden units can represent any continuous functions. 2. multi layers perceptron

  2. ϕ(x) ϕ ( x ) and g(x) g ( x ) are given before; They are fixed functions.

  3. There is always bias item in the linear discriminant function; (y = ax+b,b is the bias item and it has nothing to do with input x), thus the input layer always have d+1 input neural and the x0 x 0 is always 1, in a result y=ax1+bx0,x1=x,d=1 y = a x 1 + b x 0 , x 1 = x , d = 1 ;

2 multi layers perceptron

There are some hidden layers between input layer and output layer.
For example, perceptron with one hidden layer as followed,



figure2 the construction of multi layers perceptron

output:

yk(x)=g(2)[i=0hW(2)kig(1)[j=0dW(1)ijxj]] y k ( x ) = g ( 2 ) [ ∑ i = 0 h W k i ( 2 ) g ( 1 ) [ ∑ j = 0 d W i j ( 1 ) x j ] ]

In 1.2 we know how to learn the weight of 2 layers perceptron. As the same way, for multi layers, we also need to find the error function and using Gradient Decent to update all weights, but computing the gradient is more complex. So here are 2 main steps:

step1: computing the gradient 2.1Backpropogation
step2: adjusting the weight in the direction of gradient, same as 1.2 step3, we well later focus on some optimization techniques to improve the performance Machine Learning(7)Neural network–optimization techniques

2.1 Backpropagation
2.1.1 How to use backpropagation



figure3 the construction of multi layers perceptron

if the id of layer is m, n and q form top to bottom, the number of neural in each layer is lm,ln l m , l n and lq l q ; between 2 layers, the above layer is always the output layer with index of neural j and similarly, the bottom layer is always the input layer with i;

Our goal is to obtain the gradient of E(W)Wji(mn) ∂ E ( W ) ∂ W j i ( m n ) :

ymj=g(z(n)j)z(n)j=i=1lnW(mn)jiy(n)iE(W)Wji(mn)=E(W)y(m)jy(m)jWji(mn)=E(W)y(m)jy(m)jz(n)jz(n)jWji(mn) y j m = g ( z j ( n ) ) z j ( n ) = ∑ i = 1 l n W j i ( m n ) ∗ y i ( n ) ∂ E ( W ) ∂ W j i ( m n ) = ∂ E ( W ) ∂ y j ( m ) ∂ y j ( m ) ∂ W j i ( m n ) = ∂ E ( W ) ∂ y j ( m ) ∂ y j ( m ) ∂ z j ( n ) ∂ z j ( n ) ∂ W j i ( m n )
thus here are 3 gradients need to be computed to get the result.
E(W)z(n)j=E(W)y(m)jy(m)jz(n)j=E(W)y(m)jgz(n)jWji(mn)=y(n)iE(W)Wji(mn)=E(W)y(m)jgy(n)iE(W)y(n)i=j=1lmE(W)z(n)jz(n)jy(n)i=j=1lmW(mn)jiE(W)z(n)j ∂ E ( W ) ∂ z j ( n ) = ∂ E ( W ) ∂ y j ( m ) ∂ y j ( m ) ∂ z j ( n ) = ∂ E ( W ) ∂ y j ( m ) g ′ ∂ z j ( n ) ∂ W j i ( m n ) = y i ( n ) ⇒ ∂ E ( W ) ∂ W j i ( m n ) = ∂ E ( W ) ∂ y j ( m ) g ′ ∗ y i ( n ) ∂ E ( W ) ∂ y i ( n ) = ∑ j = 1 l m ∂ E ( W ) ∂ z j ( n ) ∂ z j ( n ) ∂ y i ( n ) = ∑ j = 1 l m W j i ( m n ) ∂ E ( W ) ∂ z j ( n )
Thus, 3 gradients above needs to be calculated between every 2 adjacent layers. Once we got the E(W)y(m)j ∂ E ( W ) ∂ y j ( m ) from the above m+1 layer then we can obtain E(W)Wji(mn) ∂ E ( W ) ∂ W j i ( m n ) and calculate the E(W)y(n)i ∂ E ( W ) ∂ y i ( n ) to prepare the next layer down calculation for E(W)z(q)j ∂ E ( W ) ∂ z j ( q ) ; It is called reverse-mode differentiation.

2.1.2 Why use backpropagation with reverse-mode differentiation

For all adjacent layers m and n, There are 2 ways to calculate E(W)W(mn)ij ∂ E ( W ) ∂ W i j ( m n ) . To simplify, suppose we want to calculate ZX ∂ Z ∂ X , one way is to apply operator X ∂ ∂ X to every node, which is called Forward-Mode Differentiation. The other way is to apply operator Z ∂ Z ∂ called reverse-mode differentiation;

*figure4 computation graph* *1: Forward - Mode - Differentiate*
*figure5 Forward - Mode - Differentiate computation graph*

eb=eccb+eddb=5 ∂ e ∂ b = ∂ e ∂ c ∂ c ∂ b + ∂ e ∂ d ∂ d ∂ b = 5
Visiting all red lines only get one gradient which is not efficient.

Forward-mode-differentiate apply operator b ∂ ∂ b to every node, in our cases, the operator is y(m)j ∂ ∂ y j ( m ) if the goal is to obtain E(W)W(m m1)jmi ∂ E ( W ) ∂ W j m i ( m   m − 1 ) ;the id of first layer down is 0;

E(W)W(10)j1i=E(W)y(1)j1y(1)j1W(10)ij1=E(W)y(1)j1y(1)j1W(10)j1i=[j2E(W)y(2)j2y(2)j2y(1)j1]y(1)j1W(10)j1i=[j3j2E(W)y(3)j3y(3)j3y(2)j2y(2)j2y(1)j1]y(1)j1W(10)j1i=[jq1...j3j2E(W)y(q1)jq1y(q1)jq1y(q2)jq2...y(2)j2y(1)j1]y(1)j1W(10)j1i ∂ E ( W ) ∂ W j 1 i ( 10 ) = ∂ E ( W ) ∂ y j 1 ( 1 ) ∂ y j 1 ( 1 ) ∂ W i j 1 ( 10 ) = ∂ E ( W ) ∂ y j 1 ( 1 ) ∂ y j 1 ( 1 ) ∂ W j 1 i ( 10 ) = [ ∑ j 2 ∂ E ( W ) ∂ y j 2 ( 2 ) ∂ y j 2 ( 2 ) ∂ y j 1 ( 1 ) ] ∂ y j 1 ( 1 ) ∂ W j 1 i ( 10 ) = [ ∑ j 3 ∑ j 2 ∂ E ( W ) ∂ y j 3 ( 3 ) ∂ y j 3 ( 3 ) ∂ y j 2 ( 2 ) ∂ y j 2 ( 2 ) ∂ y j 1 ( 1 ) ] ∂ y j 1 ( 1 ) ∂ W j 1 i ( 10 ) = [ ∑ j q − 1 . . . ∑ j 3 ∑ j 2 ∂ E ( W ) ∂ y j q − 1 ( q − 1 ) ∂ y j q − 1 ( q − 1 ) ∂ y j q − 2 ( q − 2 ) . . . ∂ y j 2 ( 2 ) ∂ y j 1 ( 1 ) ] ∂ y j 1 ( 1 ) ∂ W j 1 i ( 10 )

thus we need to visit every layer only to get E(W)W(10)j1i ∂ E ( W ) ∂ W j 1 i ( 10 ) , when it comes to E(W)W(10)j1i+1 ∂ E ( W ) ∂ W j 1 i + 1 ( 10 ) ,we need to visit every layer again!

2: reverse-mode differentiation



figure6 Reverse-mode differentiation computation graph

From the graph above, only one pass we know e ∂ e ∂ to all nodes. It is more efficient than Forward-mode-differentiate.
Reverse-mode differentiation apply e ∂ e ∂ to every node, in our case, it is E(W) ∂ E ( W ) ∂ ;That is to say, E(W)y(q1)jq1,E(W)y(q2)jq2...E(W)y(1)j1 ∂ E ( W ) ∂ y j q − 1 ( q − 1 ) , ∂ E ( W ) ∂ y j q − 2 ( q − 2 ) . . . ∂ E ( W ) ∂ y j 1 ( 1 ) are calculated in order.
Then E(W)W(q1,q2)jq1iq1,E(W)W(q2,q3)jq2,iq2...E(W)W(1,0)j1i1 ∂ E ( W ) ∂ W j q − 1 i q − 1 ( q − 1 , q − 2 ) , ∂ E ( W ) ∂ W j q − 2 , i q − 2 ( q − 2 , q − 3 ) . . . ∂ E ( W ) ∂ W j 1 i 1 ( 1 , 0 ) are also obtained ; As mentioned above, im i m is the id of neural in m-th layer when this layer is input layer, im i m can be 0 to lm l m ; jm j m is in similar way.

From all above, Reverse-mode differentiation can compute all derivatives in one single pass, that is why we use Back-propagation with reverse-mode differentiation;

Next topic will introduce some optimization techniques and how to implement these ideas with python.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值