Cost Function and Backpropagation

Cost Function and Backpropagation

Cost Function

Neural Network(Classification)

有m组训练集{ ( x ( 1 ) , y ( 1 ) ) , ( x ( 2 ) , y ( 2 ) ) , . . . , ( x ( m ) , y ( m ) ) (x^{(1)},y^{(1)}),(x^{(2)},y^{(2)}),...,(x^{(m)},y^{(m)}) (x(1),y(1)),(x(2),y(2)),...,(x(m),y(m))}

L L L= total number of layers in network

s l s_l sl= number of units (not counting bias unit) in layer l l l

B i n a r y    c l a s s i f i c a t i o n ‾ \underline{Binary \;classification} Binaryclassification

y=0 or 1

1 output unit

M u l t i − c l a s s    c l a s s i f i c a t i o n ‾ \underline{Multi-class\; classification} Multiclassclassification(K classes)

y ∈ R K y\in {\R}^{K} yRK

K output units

Cost function

​ Logistic regression:

J ( θ ) = − 1 m ∑ i = 1 m [ y ( i ) l o g ( h θ ( x ( i ) ) ) + ( 1 − y ( i ) ) l o g ( 1 − h θ ( x ( i ) ) ) ] + λ 2 m ∑ j = 1 n θ j 2 J{(θ)}=−\frac{1}{m}∑_{i=1}^m[y^{(i)}log(h_θ(x^{(i)}))+(1−y^{(i)}) log(1−h_θ(x^{(i)}))]+\frac{\lambda}{2m}\sum^n_{j=1}{\theta^2_j} J(θ)=m1i=1m[y(i)log(hθ(x(i)))+(1y(i))log(1hθ(x(i)))]+2mλj=1nθj2

​ Neural network:

h Θ ( x ) ∈ R K h_\Theta(x)\in\R^K hΘ(x)RK ( h Θ ( x ) ) i = i t h (h_\Theta(x))_i=i^{th} (hΘ(x))i=ith output

J ( Θ ) = − 1 m [ ∑ i = 1 m ∑ k = 1 K y k ( i ) l o g ( h Θ ( x ( i ) ) ) k + ( 1 − y k ( i ) ) l o g ( 1 − h Θ ( x ( i ) ) ) k ] + λ 2 m ∑ l = 1 L − 1 ∑ i = 1 s l ∑ j = 1 s l + 1 ( Θ j i ( l ) ) 2 J{(\Theta)}=−\frac{1}{m}[∑_{i=1}^m\sum^{K}_{k=1}y_k^{(i)}log(h_\Theta(x^{(i)}))_k+(1−y_k^{(i)}) log(1−h_\Theta(x^{(i)}))_k]+\frac{\lambda}{2m}\sum_{l=1}^{L-1}\sum_{i=1}^{s_l}\sum_{j=1}^{s_{l+1}}{(\Theta_{ji}^{(l)})^2} J(Θ)=m1[i=1mk=1Kyk(i)log(hΘ(x(i)))k+(1yk(i))log(1hΘ(x(i)))k]+2mλl=1L1i=1slj=1sl+1(Θji(l))2

Note:

  • the double sum simply adds up the logistic regression costs calculated for each cell in the output layer
  • the triple sum simply adds up the squares of all the individual Θs in the entire network.
  • the i in the triple sum does not refer to training example i

Gradient computation: Backpropagation algorithm

反向传播算法

δ j ( l ) \delta_j^{(l)} δj(l)= “error” of node j j j in layer l l l

Backpropagation algorithm

Training set { ( x ( 1 ) , y ( 1 ) ) , ( x ( 2 ) , y ( 2 ) ) , . . . , ( x ( m ) , y ( m ) ) (x^{(1)},y^{(1)}),(x^{(2)},y^{(2)}),...,(x^{(m)},y^{(m)}) (x(1),y(1)),(x(2),y(2)),...,(x(m),y(m))}

Set Δ i j ( l ) = 0 \Delta_{ij}^{(l)}=0 Δij(l)=0(for all l,i,j) (用来作累加项计算偏导数)

For i = 1 i=1 i=1 to m ( x ( i ) , y ( i ) ) (x^{(i)},y^{(i)}) (x(i),y(i))

​ Set a ( 1 ) = x ( i ) a^{(1)}=x^{(i)} a(1)=x(i)

​ Perform forward propagation to compute a ( l ) a^{(l)} a(l) for l = 2 , 3 , . . . , L l=2,3,...,L l=2,3,...,L

​ Using y ( i ) y^{(i)} y(i), compute δ ( L ) = a ( L ) − y ( i ) \delta^{(L)}=a^{(L)}-y^{(i)} δ(L)=a(L)y(i)

​ Compute δ ( L − 1 ) , δ ( L − 2 ) , . . . , δ ( 2 ) \delta^{(L-1)},\delta^{(L-2)},...,\delta^{(2)} δ(L1),δ(L2),...,δ(2) using δ ( l ) = ( ( Θ ( l ) ) T δ ( l + 1 ) )   . ∗   a ( l )   . ∗   ( 1 − a ( l ) ) \delta^{(l)} = ((\Theta^{(l)})^T \delta^{(l+1)})\ .*\ a^{(l)}\ .*\ (1 - a^{(l)}) δ(l)=((Θ(l))Tδ(l+1)) . a(l) . (1a(l))

Δ i j ( l ) : = Δ i j ( l ) + a j ( l ) δ i ( l + 1 ) \Delta_{ij}^{(l)}:=\Delta_{ij}^{(l)}+a_{j}^{(l)}\delta_i^{(l+1)} Δij(l):=Δij(l)+aj(l)δi(l+1)(with vectorization, Δ ( l ) : = Δ ( l ) + δ ( l + 1 ) ( a ( l ) ) T Δ^{(l)}:=Δ^{(l)}+δ^{(l+1)}(a^{(l)})^T Δ(l):=Δ(l)+δ(l+1)(a(l))T)

Hence update new Δ \Delta Δ matrix:

  • D i , j ( l ) : = 1 m ( Δ i , j ( l ) + λ Θ i , j ( l ) ) , i f    j ≠ 0 D^{(l)}_{i,j} := \dfrac{1}{m}\left(\Delta^{(l)}_{i,j} + \lambda\Theta^{(l)}_{i,j}\right), if \;j≠0 Di,j(l):=m1(Δi,j(l)+λΘi,j(l)),ifj=0.
  • D i , j ( l ) : = 1 m Δ i , j ( l ) I f    j = 0 D^{(l)}_{i,j} := \dfrac{1}{m}\Delta^{(l)}_{i,j} If \;j=0 Di,j(l):=m1Δi,j(l)Ifj=0

get ∂ ∂ Θ i j ( l ) J ( Θ ) = D i j ( l ) \frac \partial {\partial \Theta_{ij}^{(l)}} J(\Theta)= D_{ij}^{(l)} Θij(l)J(Θ)=Dij(l)

Backpropagation Intuition

固定步骤:

forward:

第一层 ( x ( i ) , y ( i ) ) (x^{(i)},y^{(i)}) (x(i),y(i)), 传播到第一个隐藏层时,计算出输入单元的加权和 z ( 2 ) z^{(2)} z(2)

,通过sigmoid逻辑函数和sigmoid激活函数算得激活值 a ( 2 ) a^{(2)} a(2),继续同样的方法进行前向传播

第一个隐藏层输入到第二个隐藏层的权重为 Θ ( 2 ) \Theta^{(2)} Θ(2),,可以得到关系式:

第三层 z ( 3 ) = Θ 10 ( 2 ) × 1 + Θ 11 ( 2 ) × a 1 ( 2 ) + . . . z^{(3)}=\Theta^{(2)}_{10}\times1+\Theta^{(2)}_{11}\times a^{(2)}_1+... z(3)=Θ10(2)×1+Θ11(2)×a1(2)+...

反向传播则相反

代价函数表达式中:

J ( θ ) = − 1 m ∑ i = 1 m [ y k ( t ) l o g ( h θ ( x ( t ) ) ) k + ( 1 − y k ( t ) ) l o g ( 1 − h θ ( x ( t ) ) k ) ] + λ 2 m ∑ l = 1 L − 1 ∑ i = 1 s l ∑ j = 1 s l + 1 θ j , i 2 J{(θ)}=−\frac{1}{m}∑_{i=1}^m[y^{(t)}_klog(h_θ(x^{(t)}))_k+(1−y^{(t)}_k) log(1−h_θ(x^{(t)})_k)]+\frac{\lambda}{2m}\sum_{l=1}^{L-1}\sum_{i=1}^{sl}\sum^{sl+1}_{j=1}{\theta^2_{j,i}} J(θ)=m1i=1m[yk(t)log(hθ(x(t)))k+(1yk(t))log(1hθ(x(t))k)]+2mλl=1L1i=1slj=1sl+1θj,i2

c o s t ( t ) = y ( t ) l o g ( h Θ ( x ( t ) ) ) + ( 1 − y ( t ) ) l o g ( 1 − h Θ ( x ( t ) ) ) cost(t)=y(t) log(h_Θ(x^{(t)}))+(1−y(t)) log(1−h_Θ(x^{(t)})) cost(t)=y(t)log(hΘ(x(t)))+(1y(t))log(1hΘ(x(t)))可以看作一种方差函数 ( c o s t ( t ) ≈ ( h Θ ( x ( t ) ) − y ( t ) ) 2 cost(t)\approx(h_Θ(x^{(t)})-y^{(t)})^2 cost(t)(hΘ(x(t))y(t))2)

可以直观地看到:

δ j ( l ) = ∂ ∂ z j ( l ) c o s t ( t ) \delta^{(l)}_j=\frac{\partial}{\partial{z^{(l)}_j}}cost(t) δj(l)=zj(l)cost(t)(for j ≥ 0 j\geq0 j0)

δ j ( l ) \delta^{(l)}_j δj(l)就是第l层第i个单元得到的激活项误差error

计算:从最后一层,假设为 δ 1 ( 4 ) = a ( 4 ) − y \delta_1^{(4)}=a^{(4)}-y δ1(4)=a(4)y(预测值减去实际值)

继续向前传播算得 δ ( 3 ) \delta^{(3)} δ(3)等,计算过程:

δ ( 3 ) \delta^{(3)} δ(3)= δ ( 4 ) × Θ 3 \delta^{(4)}\times\Theta_3 δ(4)×Θ3

Notice:All the δ \delta δ values only for the hidden units but excluding the biased units.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值