Machine Learning 04 - Neural Networks

正在学习Stanford吴恩达的机器学习课程,常做笔记,以便复习巩固。
鄙人才疏学浅,如有错漏与想法,还请多包涵,指点迷津。

Week 04

4.1 Model Representation

4.1.1 Origin of model

Neural network be modelled from the neuron in the brain.

Neuron in the brain

4.1.2 Logistic unit

A basic model of neural network is as follow :

Logistic unit

Remark :

x=x0x1x2x3,θ=θ0θ1θ2θ3 x = [ x 0 x 1 x 2 x 3 ] , θ = [ θ 0 θ 1 θ 2 θ 3 ]

θ θ is also called “weights” in neural networks.

4.1.3 Neural network

(1) Schematic diagram

neural network

Symbol

sj s j - the number of the units in layer j j , not counting bias unit.

ai(j) - “activation” of unit i i in layer j

Θ(j) Θ ( j ) - matrix of weights controlling function mapping from layer j j to layer j+1, with dimension of sj+1×(sj+1) s j + 1 × ( s j + 1 )

L L - total number of layers in network

(2) Mathematical representation

Layer 2

a1(2)=g(Θ10(1)x0+Θ11(1)x1+Θ12(1)x2+Θ13(1)x3)a2(2)=g(Θ20(1)x0+Θ21(1)x1+Θ22(1)x2+Θ23(1)x3)a3(2)=g(Θ30(1)x0+Θ31(1)x1+Θ32(1)x2+Θ33(1)x3)

Layer 3

hΘ(x)=a(3)1=g(Θ(2)10a(2)0+Θ(2)11a(2)1+Θ(2)12a(2)2+Θ(2)13a(2)3) h Θ ( x ) = a 1 ( 3 ) = g ( Θ 10 ( 2 ) a 0 ( 2 ) + Θ 11 ( 2 ) a 1 ( 2 ) + Θ 12 ( 2 ) a 2 ( 2 ) + Θ 13 ( 2 ) a 3 ( 2 ) )

(3) Vectorization

Layer 1

a(1)=a(1)0a(1)1a(1)2a(1)3=x0x1x2x3=x a ( 1 ) = [ a 0 ( 1 ) a 1 ( 1 ) a 2 ( 1 ) a 3 ( 1 ) ] = [ x 0 x 1 x 2 x 3 ] = x

Layer 2

a(2)=a(2)1a(2)2a(2)3=g(z(2)1)g(z(2)2)g(z(2)3)=g(z(2)1z(2)2z(2)3)=g(z(2))=g(Θ(1)a(1))=g(Θ(1)10x0+Θ(1)11x1+Θ(1)12x2+Θ(1)13x3Θ(1)20x0+Θ(1)21x1+Θ(1)22x2+Θ(1)23x3Θ(1)30x0+Θ(1)31x1+Θ(1)32x2+Θ(1)33x3)=g(Θ(1)10x0+Θ(1)11x1+Θ(1)12x2+Θ(1)13x3)g(Θ(1)20x0+Θ(1)21x1+Θ(1)22x2+Θ(1)23x3)g(Θ(1)30x0+Θ(1)31x1+Θ(1)32x2+Θ(1)33x3) a ( 2 ) = [ a 1 ( 2 ) a 2 ( 2 ) a 3 ( 2 ) ] = [ g ( z 1 ( 2 ) ) g ( z 2 ( 2 ) ) g ( z 3 ( 2 ) ) ] = g ( [ z 1 ( 2 ) z 2 ( 2 ) z 3 ( 2 ) ] ) = g ( z ( 2 ) ) = g ( Θ ( 1 ) a ( 1 ) ) = g ( [ Θ 10 ( 1 ) x 0 + Θ 11 ( 1 ) x 1 + Θ 12 ( 1 ) x 2 + Θ 13 ( 1 ) x 3 Θ 20 ( 1 ) x 0 + Θ 21 ( 1 ) x 1 + Θ 22 ( 1 ) x 2 + Θ 23 ( 1 ) x 3 Θ 30 ( 1 ) x 0 + Θ 31 ( 1 ) x 1 + Θ 32 ( 1 ) x 2 + Θ 33 ( 1 ) x 3 ] ) = [ g ( Θ 10 ( 1 ) x 0 + Θ 11 ( 1 ) x 1 + Θ 12 ( 1 ) x 2 + Θ 13 ( 1 ) x 3 ) g ( Θ 20 ( 1 ) x 0 + Θ 21 ( 1 ) x 1 + Θ 22 ( 1 ) x 2 + Θ 23 ( 1 ) x 3 ) g ( Θ 30 ( 1 ) x 0 + Θ 31 ( 1 ) x 1 + Θ 32 ( 1 ) x 2 + Θ 33 ( 1 ) x 3 ) ]

Layer 3

hΘ(x)=a(3)=g(z(3))=g(Θ(2)a(2))=g(Θ(2)10a(2)0+Θ(2)11a(2)1+Θ(2)12a(2)2+Θ(2)13a(2)3) h Θ ( x ) = a ( 3 ) = g ( z ( 3 ) ) = g ( Θ ( 2 ) a ( 2 ) ) = g ( Θ 10 ( 2 ) a 0 ( 2 ) + Θ 11 ( 2 ) a 1 ( 2 ) + Θ 12 ( 2 ) a 2 ( 2 ) + Θ 13 ( 2 ) a 3 ( 2 ) )

Remark :

hΘ(x)[0,1] h Θ ( x ) ∈ [ 0 , 1 ] is not a logstic function compare to the logistic regression.

The key of the vectorization is a(j)=g(z(j))=g(Θ(j1)a(j1)) a ( j ) = g ( z ( j ) ) = g ( Θ ( j − 1 ) a ( j − 1 ) ) , it likes a “loop”.

4.1.4 Multiclass classification

To classify data into multiple types, let the hypothesis function return a vector of values.

Simlarly, using one-vs-all method to solve the mutiple classfication problem.

The multiple output units :

one-va-all

4.2 Backpropagation

4.2.1 Cost function

Review the cost function of logistic regression :

J(θ)=1m[i=1m(y(i)loghθ(x(i))+(1y(i))log(1hθ(x(i)))]+λ2mj=1nθ2j J ( θ ) = − 1 m [ ∑ i = 1 m ( y ( i ) l o g h θ ( x ( i ) ) + ( 1 − y ( i ) ) l o g ( 1 − h θ ( x ( i ) ) ) ] + λ 2 m ∑ j = 1 n θ j 2

In neural network, we have K K output, that is

hΘ(x)RK(hΘ(x))i=ithoutput

then the cost function of neural network is the sum of all K K logistic cost function :

J(Θ)=1mi=1mk=1K[yk(i)log((hΘ(x(i)))k)+(1yk(i))log((1hΘ(x(i)))k)]+λ2ml=1L1i=1sl+1j=1sl(Θij(l))2

Remark : in the regulation part, the columns inludes the bias unit, the rows exclude the bias unit.

4.2.2 Gradient of cost function and algorithm

Io order to use gradient descent or other algorithm, we need to compute J(Θ) J ( Θ ) and Θ(l)ijJ(Θ) ∂ ∂ Θ i j ( l ) J ( Θ )

Let

δL=a(L)y;δ(i)=Θ(i)Tδ(j).g(z(i)),2iL1 δ L = a ( L ) − y ; δ ( i ) = Θ ( i ) T δ ( j ) . ∗ g ′ ( z ( i ) ) , 2 ≤ i ≤ L − 1

(for a detailed deducing, there is a reference material BP算法的推导过程)

then

Θ(l)ijJ(Θ)=a(l)jδ(l+1)i,(λ=0) ∂ ∂ Θ i j ( l ) J ( Θ ) = a j ( l ) δ i ( l + 1 ) , ( λ = 0 )

  • Backpropagation algorithm for neural network - Algorthm 3

Training set {(x(1),y(1)),,(x(m),y(m))} { ( x ( 1 ) , y ( 1 ) ) , ⋯ , ( x ( m ) , y ( m ) ) }
Set Δ(l)ij=0 for all l,i,j Δ i j ( l ) = 0  for all  l , i , j
For i=1 i = 1 to m m
Set a(1)=x(i) a ( 1 ) = x ( i )
Perform forward propagation to compute a(l) a ( l ) for l=2,3,,L l = 2 , 3 , ⋯ , L
Using y(i) y ( i ) , compute δ(L)=a(L)y(i) δ ( L ) = a ( L ) − y ( i )
Compute δ(L1),δ(L2),,δ(2) δ ( L − 1 ) , δ ( L − 2 ) , ⋯ , δ ( 2 )
Δ(l):=Δ(l)+δ(l+1)(a(l))T Δ ( l ) := Δ ( l ) + δ ( l + 1 ) ( a ( l ) ) T
D(l)ij:=1m(Δ(l)ij+λΘ(l)ij), if j0 D i j ( l ) := 1 m ( Δ i j ( l ) + λ Θ i j ( l ) ) ,  if  j ≠ 0
D(l)ij:=1mΔ(l)ij                 ,  if j=0 D i j ( l ) := 1 m Δ i j ( l )                                   ,    if  j = 0
method of SGD

Thus we get Θ(l)ijJ(Θ)=D(l)ij ∂ ∂ Θ i j ( l ) J ( Θ ) = D i j ( l ) .

4.3 Implement in Practice

4.3.1 Unrolling paramrters

With neural network, we are working with sets of matrices, in order to use advanced optimization function, we need to transform them into one long vector.

Code : unroll

thetaVector = [ Theta1(:); Theta2(:); Theta3(:); ]
deltaVector = [D1(:); D2(:); D3(:)]

Code : roll

Theta1 = reshape(thetaVector(1:110), 10, 11)
Theta2 = reshape(thetaVector(111:220), 10, 11)
Thera3 = reshape(thetaVector(221:231), 1, 11)

4.3.2 Gradient checking

In order to assure that our backpropagation works as intended, we need to check the gradient.

We can approximate the derivative of our cost function with:

J(Θ)Θ(j)J(Θ(j)1,,Θ(j)k+ϵ,Θ(j)n)J(Θ(j)1,,Θ(j)kϵ,Θ(j)n)2ϵ ∂ J ( Θ ) ∂ Θ ( j ) ≈ J ( Θ 1 ( j ) , ⋯ , Θ k ( j ) + ϵ , Θ n ( j ) ) − J ( Θ 1 ( j ) , ⋯ , Θ k ( j ) − ϵ , Θ n ( j ) ) 2 ϵ

The ϵ ϵ is usually set 104 10 − 4 to guarantee the accuracy.

Code

epsilon = 1e-4;
for i:n
    thetaPlus = theta;
    thetaPlus += epsilon;
    thetaMinus = theta;
    thetaMinuw += epsilon;
    gradApprox(i) = (J(thetaPlus) - J(thetaMinus)) / (2*epsilon)
end

4.3.3 Random initialization

Initialize all theta weights to zero cause symmetry breaking, we can randomly initialize theta.

Initialize each Θ(l)ij Θ i j ( l ) to a random value between [ϵ,ϵ] [ − ϵ , ϵ ] .

Code

Theta1 = ran(10,11)*(2*INIT_EPSILON)-INIT_EPSILON;
Theta2 = rand(1,11)*(2*INIT_EPSILON)-INIT_EPSILON;
...

4.4 Summary

Pick a Network Architecture

  • number of input units = dimension of features x(i) x ( i )
  • number of output units = number of classes
  • number of hidden units per layer = ususlly more is better (must balance with cost function computation)

Training a Neural Network

  • Randomly initialize the weights
  • Implement forward propagation
  • Implement the cost function
  • Implement backpropagation
  • Gradient checking (remember to disable checking)
  • Use gradient descent or built-in optimization function
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值