Cost Function and Backpropagation
Cost Function
Neural Network(Classification)
有m组训练集{ ( x ( 1 ) , y ( 1 ) ) , ( x ( 2 ) , y ( 2 ) ) , . . . , ( x ( m ) , y ( m ) ) (x^{(1)},y^{(1)}),(x^{(2)},y^{(2)}),...,(x^{(m)},y^{(m)}) (x(1),y(1)),(x(2),y(2)),...,(x(m),y(m))}
L L L= total number of layers in network
s l s_l sl= number of units (not counting bias unit) in layer l l l
B i n a r y c l a s s i f i c a t i o n ‾ \underline{Binary \;classification} Binaryclassification
y=0 or 1
1 output unit
M u l t i − c l a s s c l a s s i f i c a t i o n ‾ \underline{Multi-class\; classification} Multi−classclassification(K classes)
y ∈ R K y\in {\R}^{K} y∈RK
K output units
Cost function
Logistic regression:
J ( θ ) = − 1 m ∑ i = 1 m [ y ( i ) l o g ( h θ ( x ( i ) ) ) + ( 1 − y ( i ) ) l o g ( 1 − h θ ( x ( i ) ) ) ] + λ 2 m ∑ j = 1 n θ j 2 J{(θ)}=−\frac{1}{m}∑_{i=1}^m[y^{(i)}log(h_θ(x^{(i)}))+(1−y^{(i)}) log(1−h_θ(x^{(i)}))]+\frac{\lambda}{2m}\sum^n_{j=1}{\theta^2_j} J(θ)=−m1∑i=1m[y(i)log(hθ(x(i)))+(1−y(i))log(1−hθ(x(i)))]+2mλ∑j=1nθj2
Neural network:
h Θ ( x ) ∈ R K h_\Theta(x)\in\R^K hΘ(x)∈RK ( h Θ ( x ) ) i = i t h (h_\Theta(x))_i=i^{th} (hΘ(x))i=ith output
J ( Θ ) = − 1 m [ ∑ i = 1 m ∑ k = 1 K y k ( i ) l o g ( h Θ ( x ( i ) ) ) k + ( 1 − y k ( i ) ) l o g ( 1 − h Θ ( x ( i ) ) ) k ] + λ 2 m ∑ l = 1 L − 1 ∑ i = 1 s l ∑ j = 1 s l + 1 ( Θ j i ( l ) ) 2 J{(\Theta)}=−\frac{1}{m}[∑_{i=1}^m\sum^{K}_{k=1}y_k^{(i)}log(h_\Theta(x^{(i)}))_k+(1−y_k^{(i)}) log(1−h_\Theta(x^{(i)}))_k]+\frac{\lambda}{2m}\sum_{l=1}^{L-1}\sum_{i=1}^{s_l}\sum_{j=1}^{s_{l+1}}{(\Theta_{ji}^{(l)})^2} J(Θ)=−m1[∑i=1m∑k=1Kyk(i)log(hΘ(x(i)))k+(1−yk(i))log(1−hΘ(x(i)))k]+2mλ∑l=1L−1∑i=1sl∑j=1sl+1(Θji(l))2
Note:
- the double sum simply adds up the logistic regression costs calculated for each cell in the output layer
- the triple sum simply adds up the squares of all the individual Θs in the entire network.
- the i in the triple sum does not refer to training example i
Gradient computation: Backpropagation algorithm
反向传播算法
δ j ( l ) \delta_j^{(l)} δj(l)= “error” of node j j j in layer l l l
Backpropagation algorithm
Training set { ( x ( 1 ) , y ( 1 ) ) , ( x ( 2 ) , y ( 2 ) ) , . . . , ( x ( m ) , y ( m ) ) (x^{(1)},y^{(1)}),(x^{(2)},y^{(2)}),...,(x^{(m)},y^{(m)}) (x(1),y(1)),(x(2),y(2)),...,(x(m),y(m))}
Set Δ i j ( l ) = 0 \Delta_{ij}^{(l)}=0 Δij(l)=0(for all l,i,j) (用来作累加项计算偏导数)
For i = 1 i=1 i=1 to m ( x ( i ) , y ( i ) ) (x^{(i)},y^{(i)}) (x(i),y(i))
Set a ( 1 ) = x ( i ) a^{(1)}=x^{(i)} a(1)=x(i)
Perform forward propagation to compute a ( l ) a^{(l)} a(l) for l = 2 , 3 , . . . , L l=2,3,...,L l=2,3,...,L
Using y ( i ) y^{(i)} y(i), compute δ ( L ) = a ( L ) − y ( i ) \delta^{(L)}=a^{(L)}-y^{(i)} δ(L)=a(L)−y(i)
Compute δ ( L − 1 ) , δ ( L − 2 ) , . . . , δ ( 2 ) \delta^{(L-1)},\delta^{(L-2)},...,\delta^{(2)} δ(L−1),δ(L−2),...,δ(2) using δ ( l ) = ( ( Θ ( l ) ) T δ ( l + 1 ) ) . ∗ a ( l ) . ∗ ( 1 − a ( l ) ) \delta^{(l)} = ((\Theta^{(l)})^T \delta^{(l+1)})\ .*\ a^{(l)}\ .*\ (1 - a^{(l)}) δ(l)=((Θ(l))Tδ(l+1)) .∗ a(l) .∗ (1−a(l))
Δ i j ( l ) : = Δ i j ( l ) + a j ( l ) δ i ( l + 1 ) \Delta_{ij}^{(l)}:=\Delta_{ij}^{(l)}+a_{j}^{(l)}\delta_i^{(l+1)} Δij(l):=Δij(l)+aj(l)δi(l+1)(with vectorization, Δ ( l ) : = Δ ( l ) + δ ( l + 1 ) ( a ( l ) ) T Δ^{(l)}:=Δ^{(l)}+δ^{(l+1)}(a^{(l)})^T Δ(l):=Δ(l)+δ(l+1)(a(l))T)
Hence update new Δ \Delta Δ matrix:
- D i , j ( l ) : = 1 m ( Δ i , j ( l ) + λ Θ i , j ( l ) ) , i f j ≠ 0 D^{(l)}_{i,j} := \dfrac{1}{m}\left(\Delta^{(l)}_{i,j} + \lambda\Theta^{(l)}_{i,j}\right), if \;j≠0 Di,j(l):=m1(Δi,j(l)+λΘi,j(l)),ifj=0.
- D i , j ( l ) : = 1 m Δ i , j ( l ) I f j = 0 D^{(l)}_{i,j} := \dfrac{1}{m}\Delta^{(l)}_{i,j} If \;j=0 Di,j(l):=m1Δi,j(l)Ifj=0
get ∂ ∂ Θ i j ( l ) J ( Θ ) = D i j ( l ) \frac \partial {\partial \Theta_{ij}^{(l)}} J(\Theta)= D_{ij}^{(l)} ∂Θij(l)∂J(Θ)=Dij(l)
Backpropagation Intuition
固定步骤:
forward:
第一层 ( x ( i ) , y ( i ) ) (x^{(i)},y^{(i)}) (x(i),y(i)), 传播到第一个隐藏层时,计算出输入单元的加权和 z ( 2 ) z^{(2)} z(2)
,通过sigmoid逻辑函数和sigmoid激活函数算得激活值 a ( 2 ) a^{(2)} a(2),继续同样的方法进行前向传播
第一个隐藏层输入到第二个隐藏层的权重为 Θ ( 2 ) \Theta^{(2)} Θ(2),,可以得到关系式:
第三层 z ( 3 ) = Θ 10 ( 2 ) × 1 + Θ 11 ( 2 ) × a 1 ( 2 ) + . . . z^{(3)}=\Theta^{(2)}_{10}\times1+\Theta^{(2)}_{11}\times a^{(2)}_1+... z(3)=Θ10(2)×1+Θ11(2)×a1(2)+...
反向传播则相反
代价函数表达式中:
J ( θ ) = − 1 m ∑ i = 1 m [ y k ( t ) l o g ( h θ ( x ( t ) ) ) k + ( 1 − y k ( t ) ) l o g ( 1 − h θ ( x ( t ) ) k ) ] + λ 2 m ∑ l = 1 L − 1 ∑ i = 1 s l ∑ j = 1 s l + 1 θ j , i 2 J{(θ)}=−\frac{1}{m}∑_{i=1}^m[y^{(t)}_klog(h_θ(x^{(t)}))_k+(1−y^{(t)}_k) log(1−h_θ(x^{(t)})_k)]+\frac{\lambda}{2m}\sum_{l=1}^{L-1}\sum_{i=1}^{sl}\sum^{sl+1}_{j=1}{\theta^2_{j,i}} J(θ)=−m1∑i=1m[yk(t)log(hθ(x(t)))k+(1−yk(t))log(1−hθ(x(t))k)]+2mλ∑l=1L−1∑i=1sl∑j=1sl+1θj,i2
c o s t ( t ) = y ( t ) l o g ( h Θ ( x ( t ) ) ) + ( 1 − y ( t ) ) l o g ( 1 − h Θ ( x ( t ) ) ) cost(t)=y(t) log(h_Θ(x^{(t)}))+(1−y(t)) log(1−h_Θ(x^{(t)})) cost(t)=y(t)log(hΘ(x(t)))+(1−y(t))log(1−hΘ(x(t)))可以看作一种方差函数 ( c o s t ( t ) ≈ ( h Θ ( x ( t ) ) − y ( t ) ) 2 cost(t)\approx(h_Θ(x^{(t)})-y^{(t)})^2 cost(t)≈(hΘ(x(t))−y(t))2)
可以直观地看到:
δ j ( l ) = ∂ ∂ z j ( l ) c o s t ( t ) \delta^{(l)}_j=\frac{\partial}{\partial{z^{(l)}_j}}cost(t) δj(l)=∂zj(l)∂cost(t)(for j ≥ 0 j\geq0 j≥0)
δ j ( l ) \delta^{(l)}_j δj(l)就是第l层第i个单元得到的激活项误差error
计算:从最后一层,假设为 δ 1 ( 4 ) = a ( 4 ) − y \delta_1^{(4)}=a^{(4)}-y δ1(4)=a(4)−y(预测值减去实际值)
继续向前传播算得 δ ( 3 ) \delta^{(3)} δ(3)等,计算过程:
δ ( 3 ) \delta^{(3)} δ(3)= δ ( 4 ) × Θ 3 \delta^{(4)}\times\Theta_3 δ(4)×Θ3
Notice:All the δ \delta δ values only for the hidden units but excluding the biased units.