Darknet 正向预测与反向传播

Darknet 正向预测与反向传播

这篇文章是没有配图的,很多事情,配图反而说不清楚。
关于正向与反向传播,我仅仅类比梯度下降一个做计算,并结合源码。但这样做的意义和效果不做分析。能力有限。更不知道网络每一层到底干了什么。玄学

∑ k = 1 n ∂ f ( x ) ∂ x k c o s β k \sum_{k=1}^n\frac{\partial f(x)}{\partial x_k}cos \beta_k k=1nxkf(x)cosβk
其中:
∑ k = 1 n c o s 2 β k = 1 \sum_{k=1}^ncos^2\beta_k=1 k=1ncos2βk=1
则:
∑ k = 1 n ∂ f ( x ) ∂ x i c o s β i ≤ ( ∑ k = 1 n ( ∂ f ( x ) ∂ x i ) 2 ) ( ∑ k = 1 n c o s 2 β k ) \sum_{k=1}^n\frac{\partial f(x)}{\partial x_i}cos \beta_i\leq \Big(\sum_{k=1}^{n}\Big(\frac{\partial f(x)}{\partial x_i}\Big)^2\Big)\Big(\sum_{k=1}^{n}cos^2\beta_k\Big) k=1nxif(x)cosβi(k=1n(xif(x))2)(k=1ncos2βk)
梯度下降很简单,根据施瓦茨不等式和全微分知识,我门可以知道,对于连续函数,梯度方向是其曾长最快的方向,这里最快有点瞬时的味道,就是说只仅限于当前点,那么反向是其下降最快的方向。而沿着与梯度正交的方向运动,则相当于沿着等势面运动。此时函数值不变。
沿着梯度反方向搜索,可以保证函数值下降,直到梯度消失,当梯度消失时,算法结束在一个极小值点。有人说不存在最优结果,对于训练集来说。这样理解是不对的。其实只能说有时候不存在解析解,最优解还是存在的,因为 l o s s loss loss函数是有下界的。

全联接层其实可以看作是特殊的卷积层。你也可以叫他不全联接。我觉得这个名字不错。

全联接层

全联接层的前向

(如果不知道什么是全联接可以百度,介绍还是比较多的)
c t c_t ct t t t层神经元个数,全联接运算矩阵表示:
[ w t ( 0 , 0 ) w t ( 0 , 1 ) … w t ( 0 , c t − 1 ) w t ( 1 , 0 ) w t ( 1 , 1 ) … w t ( 1 , c t − 1 ) ⋮ ⋮ ⋱ ⋮ w t ( c t , 0 ) w t ( c t , 1 ) … w t ( c t , c t − 1 ) ] [ y t − 1 ( 0 ) y t − 1 ( 1 ) ⋮ y t − 1 ( c t − 1 ) ] = v t \left[\begin{matrix} w_t(0,0)&w_t(0,1)&\dots&w_t(0,c_{t-1})\\ w_t(1,0)&w_t(1,1)&\dots&w_t(1,c_{t-1})\\ \vdots&\vdots&\ddots &\vdots\\ w_t(c_t,0)&w_t(c_t,1)&\dots&w_t(c_t,c_{t-1}) \end{matrix}\right]\left[\begin{matrix}y_{t-1}(0)\\ y_{t-1}(1)\\ \vdots\\ y_{t-1}(c_{t-1}) \end{matrix}\right]=v_t wt(0,0)wt(1,0)wt(ct,0)wt(0,1)wt(1,1)wt(ct,1)wt(0,ct1)wt(1,ct1)wt(ct,ct1)yt1(0)yt1(1)yt1(ct1)=vt
这里, y t − 1 y_{t-1} yt1是上一层输出, w t w_t wt表示各个突触权重。 v t v_t vt就是还未激活的信号(局部诱导域)。

记: φ t ( v t ) = [ φ t ( v t ( 0 ) ) φ t ( v t ( 1 ) ) ⋮ φ t ( v t ( c t ) ) ] \varphi_t(v_t)=\left[\begin{matrix}\varphi_t(v_{t}(0))\\ \varphi_t(v_{t}(1))\\ \vdots\\ \varphi_t(v_{t}(c_t)) \end{matrix}\right] φt(vt)=φt(vt(0))φt(vt(1))φt(vt(ct))
其中 φ t \varphi_t φt表示第 t t t层的激活函数。
简化矩阵表示为:
v t = w t y t − 1 y t = φ t ( v t ) v_t=w_ty_{t-1}\\ y_t=\varphi_t(v_t) vt=wtyt1yt=φt(vt)

这部分操作于代码是对应的。

void forward_connected_layer(connected_layer l, network_state state)
{
    int i;
    fill_cpu(l.outputs*l.batch, 0, l.output, 1);
    int m = l.batch;
    int k = l.inputs;
    int n = l.outputs;
    float *a = state.input;
    float *b = l.weights;
    float *c = l.output;
    gemm(0, 1, m, n, k, 1, a, k, b, k, 1, c, n);
    activate_array(l.output, l.outputs*l.batch, l.activation);
}

上面代码删掉了 b a t c h   n o r m a l i z e batch\ normalize batch normalize部分的代码。这部分代码可以先不用看。
其中, g e m m gemm gemm函数负责矩阵乘法。 b a t c h batch batch是训练用的参数,可以自行百度。在预测时,这个数值被强行置为 1 1 1.
g e m m gemm gemm的前两个参数表示矩阵是否进行转置。 m m m表述数据个数。 n , k n,k n,k为维度答案保存在 c c c中,也就是 l . o u t p u t l.output l.output.
对应关系:
c : v t s : y t − 1 a c t i v a t e _ a r r a y ( ) : φ ( v t ) c :v_t \\ s : y_{t-1}\\ activate\_array():\varphi(v_t) c:vts:yt1activate_array():φ(vt)

全联接层的反向传播

这里以 b a t c h = 1 batch=1 batch=1来考虑定义每一层的代价函数:
J t = 1 2 ∑ k = 0 c t ( d t ( k ) − y t ( k ) ) 2 = 1 2 [ e t , e t ] e t ( k ) = d t ( k ) − y t ( k ) J_t=\frac{1}{2}\sum_{k=0}^{c_t}(d_t(k)-y_t(k))^2=\frac{1}{2}[e_t,e_t]\\ e_t(k)=d_t(k)-y_t(k) Jt=21k=0ct(dt(k)yt(k))2=21[et,et]et(k)=dt(k)yt(k)
其中, d t ( k ) d_t(k) dt(k)是我门期望的第 t t t层的输出。对于最后一层输出层来说,这个期望就是我门的标注数据。
t t t为输出层,也就是说,我门可以直接获得期望输出和实输出的误差 e t e_t et
那么可以很方便的计算最外层的权重梯度 ∇ J t \nabla J_t Jt
∂ J t ∂ w t ( a , b ) = ∑ k = 0 c t e t ( k ) ∂ e t ( k ) ∂ w t ( a , b ) = ∑ k = 0 c t e t ( k ) ∂ e t ( k ) ∂ v t ( k ) ∂ v t ( k ) ∂ w t ( a , b ) = − e t ( a ) φ t ′ ( v t ( a ) ) ∂ ∑ i w t ( a , i ) y t − 1 ( i ) ∂ w t ( a , b ) = − e t ( a ) φ t ′ ( v t ( a ) ) y t − 1 ( b ) \frac{\partial J_t}{\partial w_t(a,b)}=\sum_{k=0}^{c_t}e_t(k)\frac{\partial e_t(k)}{\partial w_t(a,b)}=\sum_{k=0}^{c_t}e_t(k)\frac{\partial e_t(k)}{\partial v_t(k)}\frac{\partial v_t(k)}{\partial w_t(a,b)}\\ =-e_t(a)\varphi'_t(v_t(a))\frac{\partial \sum_{i}w_{t}(a,i)y_{t-1}(i)}{\partial w_t(a,b)}\\= -e_t(a)\varphi'_t(v_t(a))y_{t-1}(b) wt(a,b)Jt=k=0ctet(k)wt(a,b)et(k)=k=0ctet(k)vt(k)et(k)wt(a,b)vt(k)=et(a)φt(vt(a))wt(a,b)iwt(a,i)yt1(i)=et(a)φt(vt(a))yt1(b)
但这毕竟是外层神经元,对于内层,虽然假设了每一层的期望输出,但内层的期望输出是不知道的,这也是现阶段对网络了解过少所限制的。
此时算法是使用最外层的误差作为内层误差的。
接着上一部分计算,记: g t ( a ) = e t ( a ) φ t ′ ( v t ( a ) ) g_t(a)=e_t(a)\varphi'_t(v_t(a)) gt(a)=et(a)φt(vt(a))
则: ∂ J t ∂ w t ( a , b ) = − g t ( a ) y t − 1 ( b ) \frac{\partial J_t}{\partial w_t(a,b)}=-g_t(a)y_{t-1}(b) wt(a,b)Jt=gt(a)yt1(b)
那么考虑计算:
∂ J t ∂ w t − 1 ( a , b ) = ∑ k = 0 c t e t ( k ) ∂ e t ( k ) ∂ v t ( k ) ∂ v t ( k ) ∂ w t − 1 ( a , b ) = ∑ k = 0 c t − g t ( k ) ∂ ∑ i w t ( k , i ) y t − 1 ( i ) ∂ w t − 1 ( a , b ) = ∑ k = 0 c t − g t ( k ) ∑ i = 0 c t − 1 ∂ w t ( k , i ) y t − 1 ( i ) ∂ y t − 1 ( i ) ∂ y t − 1 ( i ) ∂ w t − 1 ( a , b ) = ∑ k = 0 c t − g t ( k ) ∑ i = 0 c t − 1 w t ( k , i ) φ t − 1 ( v t − 1 ( i ) ) ∂ ∑ j w t − 1 ( i , j ) y t − 2 ( j ) ∂ w t − 1 ( a , b ) = ∑ k = 0 c t − g t ( k ) w t ( k , a ) φ t − 1 ( v t − 1 ( a ) ) y t − 2 ( b ) = y t − 2 ( b ) φ t − 1 ′ ( v t − 1 ( a ) ) ∑ k = 0 c t − g t ( k ) w t ( k , a ) \frac{\partial J_{t}}{\partial w_{t-1}(a,b)}=\sum_{k=0}^{c_t}e_t(k)\frac{\partial e_t(k)}{\partial v_t(k)}\frac{\partial v_t(k)}{\partial w_{t -1}(a,b)}\\ =\sum_{k=0}^{c_t}-g_t(k)\frac{\partial \sum_iw_t(k,i)y_{t-1}(i)}{\partial w_{t-1}(a,b)}\\ =\sum_{k=0}^{c_t}-g_t(k)\sum_{i=0}^{c_{t-1}}\frac{\partial w_t(k,i)y_{t-1}(i)}{\partial y_{t-1}(i)}\frac{\partial y_{t-1}(i)}{\partial w_{t-1}(a,b)}\\ = \sum_{k=0}^{c_t}-g_t(k)\sum_{i=0}^{c_{t-1}}w_{t}(k,i)\varphi_{t-1}(v_{t-1}(i))\frac{\partial \sum_jw_{t-1}(i,j)y_{t-2}(j)}{\partial w_{t-1}(a,b)}\\ =\sum_{k=0}^{c_t}-g_{t}(k)w_t(k,a)\varphi_{t-1}(v_{t-1}(a))y_{t-2}(b)\\=y_{t-2}(b)\varphi'_{t-1}(v_{t-1}(a))\sum_{k=0}^{c_t}-g_t(k)w_t(k,a) wt1(a,b)Jt=k=0ctet(k)vt(k)et(k)wt1(a,b)vt(k)=k=0ctgt(k)wt1(a,b)iwt(k,i)yt1(i)=k=0ctgt(k)i=0ct1yt1(i)wt(k,i)yt1(i)wt1(a,b)yt1(i)=k=0ctgt(k)i=0ct1wt(k,i)φt1(vt1(i))wt1(a,b)jwt1(i,j)yt2(j)=k=0ctgt(k)wt(k,a)φt1(vt1(a))yt2(b)=yt2(b)φt1(vt1(a))k=0ctgt(k)wt(k,a)

进而:
g t − 1 ( a ) = φ t − 1 ′ ( v t − 1 ( a ) ) ∑ k = 0 c t g t ( k ) w t ( k , a ) ∂ J t ∂ w t − 1 ( a , b ) = − g t − 1 ( a ) y t − 2 ( b ) g_{t-1}(a)=\varphi'_{t-1}(v_{t-1}(a))\sum_{k=0}^{c_t}g_t(k)w_t(k,a)\\ \frac{\partial J_t}{\partial w_{t-1}(a,b)}=-g_{t-1}(a)y_{t-2}(b) gt1(a)=φt1(vt1(a))k=0ctgt(k)wt(k,a)wt1(a,b)Jt=gt1(a)yt2(b)
这种关系并非只存在 t t t为输出层时的 t − 1 t-1 t1层,而是一直在传递。
考虑第 l l l层全联接以输出层 t t t的损失函数为目标函数进行梯度计算:
∂ J t ∂ w l ( a , b ) = ∂ J t ∂ y l ( a ) ∂ y l ( a ) ∂ w l ( a , b ) = ∂ J t ∂ y l ( a ) φ l ′ ( v l ( a ) ) y l − 1 ( b ) \frac{\partial J_t}{\partial w_{l}(a,b)}=\frac{\partial J_t}{\partial y_{l}(a)}\frac{\partial y_l(a)}{\partial w_{l}(a,b)}\\=\frac{\partial J_t}{\partial y_{l}(a)}\varphi_l'(v_l(a))y_{l-1}(b) wl(a,b)Jt=yl(a)Jtwl(a,b)yl(a)=yl(a)Jtφl(vl(a))yl1(b)
由于误差的相互独立,回顾前向传播,可以有(非常不好理解): ∂ J t ∂ y l ( a ) = ∑ k = 0 c l + 1 ∂ J t ∂ y l + 1 ( k ) ∂ y l + 1 ( k ) ∂ y l ( a ) = ∑ k = 0 c l + 1 ∂ J t ∂ y l + 1 ( k ) ∂ y l + 1 ( k ) ∂ v l + 1 ( k ) w l + 1 ( k , a ) \frac{\partial J_t}{\partial y_l(a)} = \sum_{k=0}^{c_{l+1}}\frac{\partial J_t}{\partial y_{l+1}(k)}\frac{\partial y_{l+1}(k)}{\partial y_l(a)}\\=\sum_{k=0}^{c_{l+1}}\frac{\partial J_t}{\partial y_{l+1}(k)}\frac{ \partial y_{l+1}(k)}{\partial v_{l+1}(k)}w_{l+1}(k,a) yl(a)Jt=k=0cl+1yl+1(k)Jtyl(a)yl+1(k)=k=0cl+1yl+1(k)Jtvl+1(k)yl+1(k)wl+1(k,a)
其中:
∂ J t ∂ w l ( k , a ) = ∂ J t ∂ y l ( k ) ∂ y l ( k ) ∂ v l ( k ) ∂ v l ( k ) ∂ w l ( k , a ) = ∂ J t ∂ y l ( k ) ∂ y l ( k ) ∂ v l ( k ) y ( a ) \frac{\partial J_t}{\partial w_{l}(k,a)}=\frac{\partial J_t}{\partial y_l(k)}\frac{\partial y_{l}(k)}{\partial v_l(k)}\frac{\partial v_l(k)}{\partial w_l(k,a)}\\ =\frac{\partial J_t}{\partial y_l(k)}\frac{\partial y_{l}(k)}{\partial v_l(k)}y(a) wl(k,a)Jt=yl(k)Jtvl(k)yl(k)wl(k,a)vl(k)=yl(k)Jtvl(k)yl(k)y(a)
此时相当于重新定义了 g g g,可知:
∂ J t ∂ w l ( k , a ) = − g ( k ) y ( a ) \frac{\partial J_t}{\partial w_l(k,a)}=-g(k)y(a) wl(k,a)Jt=g(k)y(a)
综上,上述证明中,假设 t t t作为了输出层,下面的总结以 n n n作为输出层。:
∂ J n ∂ w t ( a , b ) = − g t ( a ) y t − 1 ( b ) g t ( a ) = φ t ′ ( v t ( a ) ) ∑ k = 0 c t + 1 g t + 1 ( k ) w t + 1 ( k , a ) \frac{\partial J_n}{\partial w_t(a,b)}=-g_t(a)y_{t-1}(b)\\ g_{t}(a)=\varphi'_{t}(v_{t}(a))\sum_{k=0}^{c_{t+1}}g_{t+1}(k)w_{t+1}(k,a)\\ wt(a,b)Jn=gt(a)yt1(b)gt(a)=φt(vt(a))k=0ct+1gt+1(k)wt+1(k,a)
注意这里 g ( t ) g(t) g(t)不是误差。它隐士的包含了误差传递。

反向传播的矩阵形式

∂ J n ∂ w t ( a , b ) = − g t ( a ) y t − 1 ( b ) g t ( a ) = φ t ′ ( v t ( a ) ) ∑ k = 0 c t + 1 g t + 1 ( k ) w t + 1 ( k , a ) \frac{\partial J_n}{\partial w_t(a,b)}=-g_t(a)y_{t-1}(b)\\ g_{t}(a)=\varphi'_{t}(v_{t}(a))\sum_{k=0}^{c_{t+1}}g_{t+1}(k)w_{t+1}(k,a)\\ wt(a,b)Jn=gt(a)yt1(b)gt(a)=φt(vt(a))k=0ct+1gt+1(k)wt+1(k,a)
根据这组关系,令:
g t = [ g t ( 0 ) g t ( 1 ) ⋮ g t ( c t ) ] g_t = \left[\begin{matrix}g_{t}(0)\\g_{t}(1)\\ \vdots\\g_{t}(c_t)\end{matrix}\right] gt=gt(0)gt(1)gt(ct)
令所有向量都为列向量。
则:
∂ J n ∂ w t = − g t y t − 1 T   g t = d i n g ( φ t ′ ( v t ) ) g t + 1 T w t + 1 = φ t ′ ( v t ) ⨀ g t + 1 T w t + 1 \frac{\partial J_n}{\partial w_t}=-g_ty_{t-1}^T\\ \ \\ g_t=ding(\varphi'_t(v_t))g_{t+1}^Tw_{t+1} \\=\varphi'_t(v_t)\bigodot g_{t+1}^Tw_{t+1} wtJn=gtyt1T gt=ding(φt(vt))gt+1Twt+1=φt(vt)gt+1Twt+1
⨀ \bigodot 这个运算符表示矩阵逐元素相乘得到的新矩阵: c i , j = a i , j b i , j c_{i,j}=a_{i,j}b_{i,j} ci,j=ai,jbi,j

卷积层(非全联接层)

void forward_convolutional_layer(convolutional_layer l, network_state state)
{
    int out_h = convolutional_out_height(l);
    int out_w = convolutional_out_width(l);
    int i;
    fill_cpu(l.outputs*l.batch, 0, l.output, 1);
    int m = l.n;
    int k = l.size*l.size*l.c;
    int n = out_h*out_w;
    float *a = l.weights;
    float *b = state.workspace;
    float *c = l.output;
    static int u = 0;
    u++;
    for(i = 0; i < l.batch; ++i){
        im2col_cpu_custom(state.input, l.c, l.h, l.w, l.size, l.stride, l.pad, b);
        gemm(0, 0, m, n, k, 1, a, k, b, n, 1, c, n);
        c += n*m;
        state.input += l.c*l.h*l.w;
    }
    add_bias(l.output, l.biases, l.batch, l.n, out_h*out_w);
    activate_array_cpu_custom(l.output, m*n*l.batch, l.activation);
}

基本操作: i m 2 c o l _ c p u _ c u s t o m im2col\_cpu\_custom im2col_cpu_custom
这个操作对上一层的输入进行处理,得到形如 [ ( s i z e ) 2 l . c ] × [ o u t _ h × o u t _ w ] [(size)^2l.c] \times[out\_h\times out\_w] [(size)2l.c]×[out_h×out_w]的矩阵,即为当前层输入 y y y
当前层权重 w w w f i l t e r s × ( s i z e ) 2 l . c filters \times (size)^2l.c filters×(size)2l.c
这里, f i l t e r s filters filters代表下一层通道数,你也可以理解为神经元个数或者卷积核个数。
s i z e size size表示卷积核的长度(大小)。
这样很好理解,卷积操变成矩阵乘法。 w × y w\times y w×y得到 f i l t e r s × o u t _ h × o u t _ w filters\times out\_h\times out\_w filters×out_h×out_w的输出。
考虑计算当卷积层作为输出层时,其最外层权重梯度:
∂ J n ∂ w n ( a , b ) = 1 2 ∑ k ∑ m ∂ e n 2 ( k , m ) ∂ w n ( a , b ) = ∑ k ∑ m e n ( k , m ) ∂ e n ( k , m ) ∂ v n ( k , m ) ∂ v n ( k , m ) ∂ w n ( a , b ) = ∑ k ∑ m − e n ( k , m ) φ n ′ ( v n ( k , m ) ) ∂ ∑ i w n ( k , i ) y n − 1 ( i , m ) ∂ w n ( a , b ) = ∑ m − e n ( a , m ) φ n ′ ( v n ( a , m ) ) y n − 1 ( b , m ) \frac{\partial J_n}{\partial w_n(a,b)}=\frac{1}{2}\sum_{k}\sum_{m}\frac{\partial e^2_n(k,m)}{\partial w_n(a,b)}\\ =\sum_{k}\sum_{m}e_n(k,m)\frac{\partial e_n(k,m)}{\partial v_n(k,m)}\frac{\partial v_n(k,m)}{\partial w_n(a,b)} \\=\sum_{k}\sum_{m}-e_n(k,m)\varphi&#x27;_n(v_n(k,m))\frac{\partial \sum_{i}w_n(k,i)y_{n-1}(i,m)}{\partial w_n(a,b)} \\=\sum_{m}-e_n(a,m)\varphi&#x27;_n(v_n(a,m))y_{n-1}(b,m) wn(a,b)Jn=21kmwn(a,b)en2(k,m)=kmen(k,m)vn(k,m)en(k,m)wn(a,b)vn(k,m)=kmen(k,m)φn(vn(k,m))wn(a,b)iwn(k,i)yn1(i,m)=men(a,m)φn(vn(a,m))yn1(b,m)

对于 n − 1 n-1 n1层,依然以最外层作为损失作为目标函数:
∂ J n ∂ w n − 1 ( a , b ) = 1 2 ∑ k ∑ m ∂ e n 2 ( k , m ) ∂ w n − 1 ( a , b ) = ∑ k ∑ m − e n ( k , m ) ∂ e n ( k , m ) ∂ v n ( k , m ) ∂ v n ( k , m ) ∂ w n − 1 ( a , b ) = ∑ k ∑ m − e n ( k , m ) φ n ′ ( v n ( k , m ) ) ∂ v n ( k , m ) ∂ w n − 1 ( a , b ) = ∑ k ∑ m − e n ( k , m ) φ n ′ ( v n ( k , m ) ) ∂ ∑ i w n ( k , i ) y n − 1 ( i , m ) ∂ w n − 1 ( a , b ) = ∑ k ∑ m − e n ( k , m ) φ n ′ ( v n ( k , m ) ) ∑ i ∂ w n ( k , i ) y n − 1 ( i , m ) ∂ v n − 1 ( i , m ) ∂ v n − 1 ( i , m ) ∂ w n − 1 ( a , b ) = ∑ k ∑ m − e n ( k , m ) φ n ′ ( v n ( k , m ) ) ∑ i w n ( k , i ) φ n − 1 ′ ( v n − 1 ( i , m ) ) ∂ ∑ j w n − 1 ( i , j ) y n − 2 ( j , m ) ∂ w n − 1 ( a , b ) = ∑ k ∑ m − e n ( k , m ) φ n ′ ( v n ( k , m ) ) w n ( k , a ) φ n − 1 ′ ( v n − 1 ( a , m ) ) y n − 2 ( b , m ) \frac{\partial J_n}{\partial w_{n-1}(a,b)}=\frac{1}{2}\sum_{k}\sum_{m}\frac{\partial e^2_n(k,m)}{\partial w_{n-1}(a,b)}\\ =\sum_{k}\sum_{m}-e_n(k,m)\frac{\partial e_n(k,m)}{\partial v_{n}(k,m)}\frac{\partial v_n(k,m)}{\partial w_{n-1}(a,b)}\\ =\sum_{k}\sum_{m}-e_n(k,m)\varphi&#x27;_n(v_n(k,m))\frac{\partial v_n(k,m)}{\partial w_{n-1}(a,b)}\\ =\sum_{k}\sum_{m}-e_n(k,m)\varphi&#x27;_n(v_n(k,m))\frac{\partial \sum_iw_n(k,i)y_{n-1}(i,m)}{\partial w_{n-1}(a,b)}\\ =\sum_{k}\sum_{m}-e_n(k,m)\varphi&#x27;_n(v_n(k,m))\sum_i\frac{\partial w_n(k,i)y_{n-1}(i,m)}{\partial v_{n-1}(i,m)}\frac{\partial v_{n-1}(i,m)}{\partial w_{n-1}(a,b)}\\ =\sum_{k}\sum_{m}-e_n(k,m)\varphi&#x27;_n(v_n(k,m))\sum_iw_n(k,i)\varphi&#x27;_{n-1}(v_{n-1}(i,m))\frac{\partial \sum_{j}w_{n-1}(i,j)y_{n-2}(j,m)}{\partial w_{n-1}(a,b)}\\ =\sum_{k}\sum_{m}-e_n(k,m)\varphi&#x27;_n(v_n(k,m))w_n(k,a)\varphi&#x27;_{n-1}(v_{n-1}(a,m))y_{n-2}(b,m) wn1(a,b)Jn=21kmwn1(a,b)en2(k,m)=kmen(k,m)vn(k,m)en(k,m)wn1(a,b)vn(k,m)=kmen(k,m)φn(vn(k,m))wn1(a,b)vn(k,m)=kmen(k,m)φn(vn(k,m))wn1(a,b)iwn(k,i)yn1(i,m)=kmen(k,m)φn(vn(k,m))ivn1(i,m)wn(k,i)yn1(i,m)wn1(a,b)vn1(i,m)=kmen(k,m)φn(vn(k,m))iwn(k,i)φn1(vn1(i,m))wn1(a,b)jwn1(i,j)yn2(j,m)=kmen(k,m)φn(vn(k,m))wn(k,a)φn1(vn1(a,m))yn2(b,m)
根据上述计算,令:
g n ( a , m ) = e n ( a , m ) φ n ′ ( v n ( a , m ) ) g_n(a,m)=e_n(a,m)\varphi&#x27;_n(v_n(a,m)) gn(a,m)=en(a,m)φn(vn(a,m))
则:
∂ J n ∂ w n ( a , b ) = ∑ m − g n ( a , m ) y n − 1 ( b , m ) \frac{\partial J_n}{\partial w_n(a,b)}=\sum_{m}-g_n(a,m)y_{n-1}(b,m) wn(a,b)Jn=mgn(a,m)yn1(b,m)
∂ J n ∂ w n − 1 ( a , b ) = ∑ k ∑ m − g n ( k , m ) w n ( k , a ) φ n − 1 ′ ( v n − 1 ( a , m ) ) y n − 2 ( b , m ) = ∑ m ( ∑ k − g n ( k , m ) w n ( k , a ) ) φ n − 1 ′ ( v n − 1 ( a , m ) ) y n − 2 ( b , m ) \frac{\partial J_n}{\partial w_{n-1}(a,b)}=\sum_{k}\sum_{m}-g_n(k,m)w_n(k,a)\varphi&#x27;_{n-1}(v_{n-1}(a,m))y_{n-2}(b,m)\\ =\sum_{m}\Big(\sum_{k}-g_n(k,m)w_n(k,a)\Big)\varphi&#x27;_{n-1}(v_{n-1}(a,m))y_{n-2}(b,m) wn1(a,b)Jn=kmgn(k,m)wn(k,a)φn1(vn1(a,m))yn2(b,m)=m(kgn(k,m)wn(k,a))φn1(vn1(a,m))yn2(b,m)
这里,我们令:
g n − 1 ( a , m ) = φ n − 1 ′ ( v n − 1 ( a , m ) ) ∑ k g n ( k , m ) w n ( k , a ) g_{n-1}(a,m)=\varphi&#x27;_{n-1}(v_{n-1}(a,m))\sum_{k}g_n(k,m)w_{n}(k,a) gn1(a,m)=φn1(vn1(a,m))kgn(k,m)wn(k,a)
那么:
∂ J n ∂ w n ( a , b ) = ∑ m − g n − 1 ( a , m ) y n − 2 ( b , m ) \frac{\partial J_n}{\partial w_n(a,b)}=\sum_{m}-g_{n-1}(a,m)y_{n-2}(b,m) wn(a,b)Jn=mgn1(a,m)yn2(b,m)
其实 g g g的这种传递性依然可以保持。重新定义 g g g
∂ J n ∂ w t ( a , b ) = ∑ k ∑ m ∂ J n ∂ y t ( k , m ) ∂ y t ( k , m ) ∂ w t ( a , b ) = ∑ k ∑ m ∂ J n ∂ y t ( k , m ) ∂ y t ( k , m ) ∂ v t ( k , m ) ∂ v t ( k , m ) ∂ w t ( a , b ) = ∑ m ∂ J n ∂ y t ( a , m ) ∂ y t ( a , m ) ∂ v t ( a , m ) y t − 1 ( b , m ) \frac{\partial J_n}{\partial w_{t}(a,b)}=\sum_{k}\sum_{m}\frac{\partial J_n}{\partial y_{t}(k,m)}\frac{\partial y_t(k,m)}{\partial w_{t}(a,b)}\\ =\sum_{k}\sum_{m}\frac{\partial J_n}{\partial y_{t}(k,m)}\frac{\partial y_t(k,m)}{\partial v_{t}(k,m)}\frac{\partial v_t(k,m)}{\partial w_t(a,b)}\\ =\sum_{m}\frac{\partial J_n}{\partial y_{t}(a,m)}\frac{\partial y_t(a,m)}{\partial v_{t}(a,m)}y_{t-1}(b,m) wt(a,b)Jn=kmyt(k,m)Jnwt(a,b)yt(k,m)=kmyt(k,m)Jnvt(k,m)yt(k,m)wt(a,b)vt(k,m)=myt(a,m)Jnvt(a,m)yt(a,m)yt1(b,m)
这里,令: g t ( a , m ) = ∂ J n ∂ y t ( a , m ) ∂ y t ( a , m ) ∂ v t ( a , m ) g_t(a,m)=\frac{\partial J_n}{\partial y_{t}(a,m)}\frac{\partial y_t(a,m)}{\partial v_{t}(a,m)} gt(a,m)=yt(a,m)Jnvt(a,m)yt(a,m)
显然, t = n , n − 1 t=n,n-1 t=n,n1时,是成立的。
归纳有:
∂ J n ∂ w t ( a , b ) = ∑ m ∂ J n ∂ y t ( a , m ) ∂ y t ( a , m ) ∂ v t ( a , m ) y t − 1 ( b , m ) \frac{\partial J_n}{\partial w_{t}(a,b)}=\sum_{m}\frac{\partial J_n}{\partial y_{t}(a,m)}\frac{\partial y_t(a,m)}{\partial v_{t}(a,m)}y_{t-1}(b,m) wt(a,b)Jn=myt(a,m)Jnvt(a,m)yt(a,m)yt1(b,m)
对于(非常不好理解):
∂ J n ∂ y t ( a , m ) ∂ y t ( a , m ) ∂ v t ( a , m ) = ∂ J n ∂ y t ( a , m ) φ t ′ ( v t ( a , m ) ) = φ t ′ ( v t ( a , m ) ) ∑ i ∂ J n ∂ y t + 1 ( i , m ) ∂ y t + 1 ( i , m ) ∂ v t + 1 ( i , m ) ∂ v t + 1 ( i , m ) ∂ y t ( a , m ) = φ t ′ ( v t ( a , m ) ) ∑ i ∂ J n ∂ y t + 1 ( i , m ) ∂ y t + 1 ( i , m ) ∂ v t + 1 ( i , m ) w t + 1 ( i , a ) \frac{\partial J_n}{\partial y_{t}(a,m)}\frac{\partial y_t(a,m)}{\partial v_{t}(a,m)}=\frac{\partial J_n}{\partial y_{t}(a,m)}\varphi_{t}&#x27;(v_{t}(a,m))\\ =\varphi_{t}&#x27;(v_{t}(a,m)) \sum_{i}\frac{\partial J_n}{\partial y_{t+1}(i,m)}\frac{\partial y_{t+1}(i,m)}{\partial v_{t+1}(i,m)}\frac{\partial v_{t+1}(i,m)}{\partial y_t(a,m)}\\ =\varphi_{t}&#x27;(v_{t}(a,m)) \sum_{i}\frac{\partial J_n}{\partial y_{t+1}(i,m)}\frac{\partial y_{t+1}(i,m)}{\partial v_{t+1}(i,m)}w_{t+1}(i,a) yt(a,m)Jnvt(a,m)yt(a,m)=yt(a,m)Jnφt(vt(a,m))=φt(vt(a,m))iyt+1(i,m)Jnvt+1(i,m)yt+1(i,m)yt(a,m)vt+1(i,m)=φt(vt(a,m))iyt+1(i,m)Jnvt+1(i,m)yt+1(i,m)wt+1(i,a)
故: g t ( a , m ) = φ t ′ ( v t ( a , m ) ) ∑ k g t + 1 ( k , m ) w t + 1 ( k , a ) g_t(a,m)=\varphi&#x27;_t(v_t(a,m))\sum_{k}g_{t+1}(k,m)w_{t+1}(k,a) gt(a,m)=φt(vt(a,m))kgt+1(k,m)wt+1(k,a)
卷积反向传播也可以用矩阵表示:
g t = φ t ′ ( v t ) ⨀ g t + 1 T w t + 1 ∂ J n ∂ w t = − g t y t − 1 T g_t=\varphi&#x27;_t(v_t)\bigodot g_{t+1}^Tw_{t+1}\\ \frac{\partial J_n}{\partial w_t}=-g_ty_{t-1}^T gt=φt(vt)gt+1Twt+1wtJn=gtyt1T

卷积与全连接更新方式一样

对于部分不好理解的地方,做一个解释:
对于上文中,两处不好理解的地方,其实你可以这样理解为连续函数 h ( x 1 , x 2 , … , x n ) h(x_1,x_2,\dots,x_n) h(x1,x2,,xn)

令: f ( t ) = h ( t , t , … , t ) f(t)=h(t,t,\dots,t) f(t)=h(t,t,,t) 则:
d f ( t ) d t = ∑ k = 1 n ∂ h ∂ x k ∣ x k = t \frac{df(t)}{dt}=\sum_{k=1}^n\frac{\partial h}{\partial x_k}\Big|_{x_k=t} dtdf(t)=k=1nxkhxk=t
对于 t t t层第 k k k个神经元输出 y t ( a ) y_t(a) yt(a)对输出层 n n n的损失函数的影响是多维度的。只不过每个维度的输入变量是一个.显然,这种影响又是可微的。
∂ J n ∂ y t ( a ) = ∑ k = 0 c t + 1 ∂ J n ∂ y t + 1 ( k ) ∂ y t + 1 ( k ) ∂ y t ( a ) \frac{\partial J_n}{\partial y_t(a)}=\sum_{k=0}^{c_{t+1}}\frac{\partial J_n}{\partial y_{t+1}(k)}\frac{\partial y_{t+1}(k)}{\partial y_t(a)} yt(a)Jn=k=0ct+1yt+1(k)Jnyt(a)yt+1(k)

  • 3
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值