基础知识
一般来说我们会使用一种叫混合布局的思路,即如果是向量或者矩阵对标量求导,则使用分子布局为准,如果是标量对向量或者矩阵求导,则以分母布局为准。对于向量对对向量求导,有些分歧,文章中会以分子布局的雅克比矩阵为主。
- 标量对矩阵求导(分母布局):
d f = ∑ i = 1 m ∑ j = 1 n ∂ f ∂ X i j d X i j = t r ( ( ∂ f ∂ X ) T d X ) df=\sum\limits_{i=1}^m\sum\limits_{j=1}^n\frac{\partial f}{\partial X_{ij}}dX_{ij} = tr((\frac{\partial f}{\partial \mathbf{X}})^Td\mathbf{X}) df=i=1∑mj=1∑n∂Xij∂fdXij=tr((∂X∂f)TdX)
因为矩阵乘法是行和列相乘,所以为了 ∂ f ∂ X i j \frac{\partial f}{\partial X_{ij}} ∂Xij∂f和 d X i j dX_{ij} dXij下标相同,所以这里 ∂ f ∂ X \frac{\partial f}{\partial \mathbf{X}} ∂X∂f要转置。
所以求标量对某个矩阵的导数时,先把df化为某个矩阵的迹的形式。 - 常见公式:
d ( X + Y ) = d X + d Y , d ( X − Y ) = d X − d Y d(X+Y) =dX+dY, d(X-Y) =dX-dY d(X+Y)=dX+dY,d(X−Y)=dX−dY
d ( X Y ) = ( d X ) Y + X ( d Y ) d(XY) =(dX)Y + X(dY) d(XY)=(dX)Y+X(dY)
d ( X T ) = ( d X ) T d(X^T) =(dX)^T d(XT)=(dX)T
d t r ( X ) = t r ( d X ) dtr(X) =tr(dX) dtr(X)=tr(dX)
d ( X ⊙ Y ) = X ⊙ d Y + d X ⊙ Y d(X \odot Y) = X\odot dY + dX \odot Y d(X⊙Y)=X⊙dY+dX⊙Y
d σ ( X ) = σ ′ ( X ) ⊙ d X d \sigma(X) =\sigma'(X) \odot dX dσ(X)=σ′(X)⊙dX 逐元素标量函数运算
d X − 1 = − X − 1 d X X − 1 d X^{-1}= -X^{-1}dXX^{-1} dX−1=−X−1dXX−1
d ∣ X ∣ = ∣ X ∣ t r ( X − 1 d X ) d |X|= |X|tr(X^{-1}dX) d∣X∣=∣X∣tr(X−1dX)
t r ( x ) = x tr(x) =x tr(x)=x
t r ( A T ) = t r ( A ) tr(A^T) =tr(A) tr(AT)=tr(A)
t r ( A B ) = t r ( B A ) tr(AB) =tr(BA) tr(AB)=tr(BA)
t r ( X + Y ) = t r ( X ) + t r ( Y ) , t r ( X − Y ) = t r ( X ) − t r ( Y ) tr(X+Y) =tr(X)+tr(Y), tr(X-Y) =tr(X)-tr(Y) tr(X+Y)=tr(X)+tr(Y),tr(X−Y)=tr(X)−tr(Y)
t r ( ( A ⊙ B ) T C ) = t r ( A T ( B ⊙ C ) ) tr((A\odot B)^TC)= tr(A^T(B \odot C)) tr((A⊙B)TC)=tr(AT(B⊙C)) - 例子
d y = t r ( d y ) = t r ( a T d e x p ( X b ) ) = t r ( a T ( e x p ( X b ) ⊙ d ( X b ) ) ) = t r ( ( a ⊙ e x p ( X b ) ) T d X b ) = t r ( b ( a ⊙ e x p ( X b ) ) T d X ) dy =tr(dy) = tr(\mathbf{a}^Tdexp(\mathbf{X}\mathbf{b})) = tr(\mathbf{a}^T (exp(\mathbf{X}\mathbf{b}) \odot d(\mathbf{X}\mathbf{b}))) = tr((\mathbf{a} \odot exp(\mathbf{X}\mathbf{b}) )^T d\mathbf{X}\mathbf{b}) = tr(\mathbf{b}(\mathbf{a} \odot exp(\mathbf{X}\mathbf{b}) )^T d\mathbf{X}) dy=tr(dy)=tr(aTdexp(Xb))=tr(aT(exp(Xb)⊙d(Xb)))=tr((a⊙exp(Xb))TdXb)=tr(b(a⊙exp(Xb))TdX) - 标量对向量的求导法则 (混合布局)
∂ z ∂ y 1 = ( ∂ y n ∂ y n − 1 ∂ y n − 1 ∂ y n − 2 . . . ∂ y 2 ∂ y 1 ) T ∂ z ∂ y n \frac{\partial z}{\partial \mathbf{y_1}} = (\frac{\partial \mathbf{y_n}}{\partial \mathbf{y_{n-1}}} \frac{\partial \mathbf{y_{n-1}}}{\partial \mathbf{y_{n-2}}} ...\frac{\partial \mathbf{y_2}}{\partial \mathbf{y_1}})^T\frac{\partial z}{\partial \mathbf{y_n}} ∂y1∂z=(∂yn−1∂yn∂yn−2∂yn−1...∂y1∂y2)T∂yn∂z
这里的每一个yn都是列向量 - 标量对矩阵链式法则
z = f ( Y ) , Y = A X + B → ∂ z ∂ X = A T ∂ z ∂ Y z= f(Y), Y=AX+B \to \frac{\partial z}{\partial X} = A^T\frac{\partial z}{\partial Y} z=f(Y),Y=AX+B→∂X∂z=AT∂Y∂z
z = f ( Y ) , Y = X A + B → ∂ z ∂ X = ∂ z ∂ Y A T z= f(Y), Y=XA+B \to \frac{\partial z}{\partial X} = \frac{\partial z}{\partial Y}A^T z=f(Y),Y=XA+B→∂X∂z=∂Y∂zAT
证明:
d
z
=
t
r
(
d
f
(
Y
)
)
=
t
r
(
f
′
(
Y
)
⊙
(
A
d
X
)
)
=
t
r
(
(
f
′
(
Y
)
⊙
A
d
X
)
T
E
)
=
t
r
(
f
′
(
Y
)
T
(
A
d
X
⊙
E
)
)
=
t
r
(
(
A
T
f
′
(
Y
)
)
T
d
X
)
dz=tr(df(\mathbf{Y}))=tr(f'(\mathbf{Y})\odot(\mathbf{A}d\mathbf{X}))=tr((f'(\mathbf{Y})\odot\mathbf{A}d\mathbf{X})^{T}\mathbf{E})=tr(f'(\mathbf{Y})^{T}(\mathbf{A}d\mathbf{X}\odot\mathbf{E}))=tr((\mathbf{A}^{T}f'(\mathbf{Y}))^{T}d\mathbf{X})
dz=tr(df(Y))=tr(f′(Y)⊙(AdX))=tr((f′(Y)⊙AdX)TE)=tr(f′(Y)T(AdX⊙E))=tr((ATf′(Y))TdX)
- DNN反向传播-求导法 (混合布局)
a l = σ ( z l ) = σ ( W l a l − 1 + b l ) a^l = \sigma(z^l) = \sigma(W^la^{l-1} + b^l) al=σ(zl)=σ(Wlal−1+bl)
先求最后一层误差
再递推
其中,根据向量与向量求导的性质(分子布局),有:
∂ z l + 1 ∂ z l = W l + 1 d i a g ( σ ′ ( z l ) ) \frac{\partial z^{l+1}}{\partial z^{l}} = W^{l+1}diag(\sigma^{'}(z^l)) ∂zl∂zl+1=Wl+1diag(σ′(zl))
参数的导数
- 微分法
d a l = ( g ′ ( z l ) ⊙ d z l ) da^{l}=(g'(\mathbf{z}^{l})\odot d\mathbf{z}^{l}) dal=(g′(zl)⊙dzl)
d z l + 1 = W l + 1 d a l = W l + 1 ( g ′ ( z l ) ⊙ d z l ) dz^{l+1}=\mathbf{W}^{l+1}da^{l}=\mathbf{W}^{l+1}(g'(\mathbf{z}^{l})\odot d\mathbf{z}^{l}) dzl+1=Wl+1dal=Wl+1(g′(zl)⊙dzl)
CNN反向传播
-
池化误差传递
对于误差
δ k l = ( 2 8 4 6 ) \delta_k^l = \left( \begin{array}{ccc} 2& 8 \\ 4& 6 \end{array} \right) δkl=(2486)
如果是max池化,upsample后为
( 0 0 0 0 0 2 8 0 0 4 6 0 0 0 0 0 ) \left( \begin{array}{ccc} 0&0&0&0 \\ 0&2& 8&0 \\ 0&4&6&0 \\ 0&0&0&0 \end{array} \right) 0000024008600000
如果是average池化,upsample后为
( 2 0 0 0 0 0 0 8 0 4 0 0 0 0 6 0 ) \left( \begin{array}{ccc} 2&0&0&0 \\ 0&0& 0&8 \\ 0&4&0&0 \\ 0&0&6&0 \end{array} \right) 2000004000060800
δ k l − 1 = ( ∂ a k l − 1 ∂ z k l − 1 ) T ∂ J ( W , b ) ∂ a k l − 1 = u p s a m p l e ( δ k l ) ⊙ σ ′ ( z k l − 1 ) \delta_k^{l-1} = (\frac{\partial a_k^{l-1}}{\partial z_k^{l-1}})^T\frac{\partial J(W,b)}{\partial a_k^{l-1}} = upsample(\delta_k^l) \odot \sigma^{'}(z_k^{l-1}) δkl−1=(∂zkl−1∂akl−1)T∂akl−1∂J(W,b)=upsample(δkl)⊙σ′(zkl−1) -
卷积误差传递
z l = a l − 1 ∗ W l + b z^l = a^{l-1}*W^l +b zl=al−1∗Wl+b
δ l − 1 = ( ∂ z l ∂ z l − 1 ) T δ l = δ l ∗ r o t 180 ( W l ) ⊙ σ ′ ( z l − 1 ) \delta^{l-1} = (\frac{\partial z^{l}}{\partial z^{l-1}})^T\delta^{l} = \delta^{l}*rot180(W^{l}) \odot \sigma^{'}(z^{l-1}) δl−1=(∂zl−1∂zl)Tδl=δl∗rot180(Wl)⊙σ′(zl−1)
如果stride不等与1,则在 δ l \delta^{l} δl中填充相应的间隔:
如果pad不为0
- 求卷积层w,b的梯度
∂ J ( W , b ) ∂ W l = a l − 1 ∗ δ l \frac{\partial J(W,b)}{\partial W^{l}}=a^{l-1} *\delta^l ∂Wl∂J(W,b)=al−1∗δl
如果stride不等与1,同样是在 δ l \delta^{l} δl中填充相应的间隔
∂ J ( W , b ) ∂ b l = ∑ u , v ( δ l ) u , v \frac{\partial J(W,b)}{\partial b^{l}} = \sum\limits_{u,v}(\delta^l)_{u,v} ∂bl∂J(W,b)=u,v∑(δl)u,v
激活函数
对于用于分类的softmax激活函数,对应的损失函数一般都是用对数似然函数,对于第i类的损失函数为:
a
i
L
=
e
z
i
L
∑
j
=
1
n
L
e
z
j
L
a_i^L = \frac{e^{z_i^L}}{\sum\limits_{j=1}^{n_L}e^{z_j^L}}
aiL=j=1∑nLezjLeziL
J
(
W
,
b
,
a
L
,
y
)
=
−
l
n
a
i
L
J(W,b,a^L,y) = -lna_i^L
J(W,b,aL,y)=−lnaiL
对于输出层参数w的梯度为:
向量形式为:
∂
L
∂
b
=
a
L
−
a
∗
\frac{\partial L}{\partial b} = {a}^{L} -a^{*}
∂b∂L=aL−a∗
∂
L
∂
W
=
(
a
L
−
a
∗
)
(
a
L
−
1
)
T
\frac{\partial L}{\partial W} =({a}^{L} - a^{*}) (a^{L-1})^T
∂W∂L=(aL−a∗)(aL−1)T
可以看出跟激活函数没有关系,避免了激活函数的影响
tanh的导数:
[tanh(x)]'=[cosh²(x)-sinh²(x)]/cosh²(x) =1-tanh²(x)
初始化
激活值饱和:根据
δ
l
=
(
W
l
+
1
)
T
δ
l
+
1
⊙
σ
′
(
z
l
)
\delta^{l} =(W^{l+1})^T\delta^{l+1}\odot \sigma^{'}(z^l)
δl=(Wl+1)Tδl+1⊙σ′(zl),使用sigma函数
σ
(
z
)
=
1
1
+
e
−
z
\sigma(z) = \frac{1}{1+e^{-z}}
σ(z)=1+e−z1时,z过大或过小会使激活函数梯度为0,进而导致上一层误差梯度为0.
根据
∂
J
(
W
,
b
,
x
,
y
)
∂
W
l
=
δ
l
(
a
l
−
1
)
T
\frac{\partial J(W,b,x,y)}{\partial W^l} = \delta^{l}(a^{l-1})^T
∂Wl∂J(W,b,x,y)=δl(al−1)T,上一层的激活值为0(z为负无穷)会直接导致这一层的参梯度为0.
一个好的初始化要求各层的激活值a和状态值梯度
δ
\delta
δ的方差在正向和反向传播过程中保持不变,
Xavier初始化做了glorot假设,即激活函数是对称的,初始时状态值z落在激活函数线性区域,即梯度为1.
所以Xavier初始化只能用于tanh
而he 初始化时专门针对relu提出的。
xavier初始化:
N
(
0
,
2
f
a
n
i
n
+
f
a
n
out
)
\mathcal{N}\left(0, \frac{2}{f_{a n_{i n}}+f_{a n_{\text {out }}}}\right)
N(0,fanin+fanout 2)
he初始化:
由于relu只有一半的方差能传递到下一层,所以方差为xavier的两倍
CNN实现
im2col
im2col的作用就是优化卷积运算,如何优化呢,我们先学习一下这个函数的原理。
我们假设卷积核的尺寸为22,输入图像尺寸为33.im2col做的事情就是对于卷积核每一次要处理的小窗,将其展开到新矩阵的一行(列),新矩阵的列(行)数,就是对于一副输入图像,卷积运算的次数(卷积核滑动的次数),如下图所示
C++ 实现
__global__ void im2col_h(const int n, const float *data_im, const int height,
const int width, const int kernel_h,
const int kernel_w, const int pad_h, const int pad_w,
const int stride_h, const int stride_w,
const int height_col, const int width_col,
float *data_col, int im_stride, int col_stride) {
int index = blockIdx.x * blockDim.x + threadIdx.x;
if (index < n) {
const int batch_idx = blockIdx.y;
data_im += batch_idx * im_stride; //原始图片数据 C.H.W
data_col += batch_idx * col_stride;//展开后的数据 (C.H_k.W_k).(H_col.W_col)
const int h_index = index / width_col;
const int h_col = h_index % height_col;
const int w_col = index % width_col;
const int c_im = h_index / height_col;
const int c_col = c_im * kernel_h * kernel_w;
const int h_offset = h_col * stride_h - pad_h;
const int w_offset = w_col * stride_w - pad_w;
// channel offset
float *data_col_ptr = data_col;
data_col_ptr += (c_col * height_col + h_col) * width_col + w_col;
const float *data_im_ptr = data_im;
data_im_ptr += (c_im * height + h_offset) * width + w_offset;
// copy to col
for (int i = 0; i < kernel_h; ++i) {
for (int j = 0; j < kernel_w; ++j) {
int h_im = h_offset + i;
int w_im = w_offset + j;
*data_col_ptr =
(h_im >= 0 && w_im >= 0 && h_im < height && w_im < width)
? data_im_ptr[i * width + j]
: 0;
data_col_ptr += height_col * width_col;
}
}
}
}
其中,griddim.x
⋅
\cdot
⋅ blockdim.x的布局为
col2im
__global__ void col2im_h(const int n, const float *data_col, const int height,
const int width, const int channels,
const int kernel_h, const int kernel_w,
const int pad_h, const int pad_w, const int stride_h,
const int stride_w, const int height_col,
const int width_col, float *data_im,
const int im_stride, const int col_stride) {
int index = blockIdx.x * blockDim.x + threadIdx.x;
if (index < n) {
const int batch_idx = blockIdx.y;
data_im += batch_idx * im_stride;
data_col += batch_idx * col_stride;
float val = 0;
const int w_im = index % width + pad_w;
const int h_im = (index / width) % height + pad_h;
const int c_im = index / (width * height);
// compute the start and end of the col
const int w_col_start =
(w_im < kernel_w) ? 0 : (w_im - kernel_w) / stride_w + 1; // w_col_start 代表最靠左能包含w_in的窗口
const int w_col_end = fminf(w_im / stride_w + 1, width_col); // w_col_end 代表最靠右能包含w_in的窗口
const int h_col_start =
(h_im < kernel_h) ? 0 : (h_im - kernel_h) / stride_h + 1;
const int h_col_end = fminf(h_im / stride_h + 1, height_col);
// copy to im
for (int h_col = h_col_start; h_col < h_col_end; h_col += 1) {
for (int w_col = w_col_start; w_col < w_col_end; w_col += 1) {
int h_k = (h_im - h_col * stride_h);
int w_k = (w_im - w_col * stride_w); //wim在w_col所代表窗口中的位置
int data_col_index =
(((c_im * kernel_h + h_k) * kernel_w + w_k) * height_col + h_col) *
width_col +
w_col;
val += data_col[data_col_index];
}
}
data_im[index] = val;
}
}
deformable convolution and Deformable RoI pooling
首先有一个额外的channel为72(对应着3X3的kernel size , 每个点会有 X 方向 和 Y 方向的偏移,(x,y)组合起来就对应着一个方向向量,3*3 =9 个像素点则需要 18 个output channel, 这样的 offset 需要预测4个group,因此最后的输出channel 是72) 的传统卷积学习offset(偏移量),然后前面的input feature maps和offset共同作为deformable conv层的输入,deformable conv层操作采样点发生偏移,再进行卷积。
deconvolution 反卷积
可以看成col2im和im2col对公式
z
=
f
(
Y
)
,
Y
=
A
X
+
B
→
∂
z
∂
X
=
A
T
∂
z
∂
Y
z= f(Y), Y=AX+B \to \frac{\partial z}{\partial X} = A^T\frac{\partial z}{\partial Y}
z=f(Y),Y=AX+B→∂X∂z=AT∂Y∂z的结果并没有影响。