cs231n - assignment1要求计算SVM-loss的梯度,因为涉及到一些矩阵求导方面的知识,可是官方笔记里面的资料比较少,而且这一块内容本身也不是很好理解,故搜集相关资料以便加深理解,如下:
初探——标量对矩阵求导
标量f对矩阵X的导数,定义为 ∂ f ∂ X = [ ∂ f ∂ X i j ] \frac{\partial f}{\partial X} = \left[\frac{\partial f }{\partial X_{ij}}\right] ∂X∂f=[∂Xij∂f]即f对X逐元素求导排成与X尺寸相同的矩阵,下面看一个例子:
假设 L = f ( Y ) , Y = X W , L=f(Y),Y=XW, L=f(Y),Y=XW, 其中 X ( 2 , 2 ) / W ( 2 , 3 ) / Y ( 2 , 3 ) / L X(2,2)/ W(2,3) /Y(2,3) /L X(2,2)/W(2,3)/Y(2,3)/L为标量。
根据矩阵的乘法定义,可知:
Y
i
,
j
=
∑
k
=
1
D
X
i
,
k
W
k
,
j
,
Y_{i,j}=\sum_{k=1}^{D}X_{i,k}W_{k,j},
Yi,j=∑k=1DXi,kWk,j, 则
∂
Y
i
,
j
∂
X
m
,
k
=
{
0
i
≠
m
W
k
,
j
i
=
m
\frac{{\partial {Y_{i,j}}}}{{\partial {X_{m,k}}}} = \left\{ {\begin{array}{lc} {\begin{array}{cc} 0&{i \ne m} \end{array}}\\ {\begin{array}{cc} {{W_{k,j}}}&{i = m} \end{array}} \end{array}} \right.
∂Xm,k∂Yi,j={0i=mWk,ji=m
同理,
∂
Y
i
,
j
∂
W
k
,
m
=
{
0
j
≠
m
X
i
,
k
j
=
m
(
1
)
\frac{{\partial {Y_{i,j}}}}{{\partial {W_{k,m}}}} = \left\{ {\begin{array}{lc} {\begin{array}{cc} 0&{j \ne m} \end{array}}\\ {\begin{array}{cc} {{X_{i,k}}}&{j = m} \end{array}} \end{array}} \right. \quad (1)
∂Wk,m∂Yi,j={0j=mXi,kj=m(1) 详见Vector, Matrix, and Tensor Derivatives
根据链式法,标量L对 W 1 , 1 W_{1,1} W1,1的导数为 ∂ L ∂ W 1 , 1 = ∂ L ∂ Y ⋅ ∂ Y ∂ W 1 , 1 ( 2 ) \frac{\partial L}{\partial W_{1,1}}=\frac{\partial L}{\partial Y}\cdot \frac{\partial Y}{\partial W_{1,1}} \quad (2) ∂W1,1∂L=∂Y∂L⋅∂W1,1∂Y(2)
其中 ∂ L ∂ Y = ( ∂ L ∂ Y 1 , 1 ; ∂ L ∂ Y 1 , 2 ; ∂ L ∂ Y 1 , 3 ∂ L ∂ Y 2 , 1 ; ∂ L ∂ Y 2 , 2 ; ∂ L ∂ Y 2 , 3 ) ( 3 ) \frac{\partial L}{\partial Y}=\begin{pmatrix} \frac{\partial L}{\partial Y_{1,1}}; \frac{\partial L}{\partial Y_{1,2}}; \frac{\partial L}{\partial Y_{1,3}}\\ \frac{\partial L}{\partial Y_{2,1}}; \frac{\partial L}{\partial Y_{2,2}}; \frac{\partial L}{\partial Y_{2,3}} \end{pmatrix} \quad (3) ∂Y∂L=(∂Y1,1∂L;∂Y1,2∂L;∂Y1,3∂L∂Y2,1∂L;∂Y2,2∂L;∂Y2,3∂L)(3) 并根据公式(1)得, ∂ Y ∂ W 1 , 1 = ( X 1 , 1 ; 0 ; 0 X 2 , 1 ; 0 ; 0 ) ( 4 ) \frac{\partial Y}{\partial W_{1,1}}=\begin{pmatrix} X_{1,1}; 0; 0\\ X_{2,1}; 0; 0 \end{pmatrix} \quad (4) ∂W1,1∂Y=(X1,1;0;0X2,1;0;0)(4)
注意,因为L是标量, W 1 , 1 W_{1,1} W1,1也是标量,因此对于公式(2)来说,是dot product,而不是矩阵乘法。
公式(2)可进一步展开: ∂ L ∂ W 1 , 1 = ( ∂ L ∂ Y 1 , 1 ; ∂ L ∂ Y 1 , 2 ; ∂ L ∂ Y 1 , 3 ∂ L ∂ Y 2 , 1 ; ∂ L ∂ Y 2 , 2 ; ∂ L ∂ Y 2 , 3 ) ( X 1 , 1 ; 0 ; 0 X 2 , 1 ; 0 ; 0 ) = ∂ L ∂ Y 1 , 1 X 1 , 1 + ∂ L ∂ Y 2 , 1 X 2 , 1 ( 5 ) \frac{\partial L} {\partial W_{1,1}}=\begin{pmatrix} \frac{\partial L}{\partial Y_{1,1}}; \frac{\partial L}{\partial Y_{1,2}}; \frac{\partial L}{\partial Y_{1,3}}\\ \frac{\partial L}{\partial Y_{2,1}}; \frac{\partial L}{\partial Y_{2,2}}; \frac{\partial L}{\partial Y_{2,3}} \end{pmatrix}\begin{pmatrix} X_{1,1}; 0; 0\\ X_{2,1}; 0; 0 \end{pmatrix} = \frac{\partial L}{\partial Y_{1,1}}X_{1,1}+\frac{\partial L}{\partial Y_{2,1}}X_{2,1} \quad (5) ∂W1,1∂L=(∂Y1,1∂L;∂Y1,2∂L;∂Y1,3∂L∂Y2,1∂L;∂Y2,2∂L;∂Y2,3∂L)(X1,1;0;0X2,1;0;0)=∂Y1,1∂LX1,1+∂Y2,1∂LX2,1(5)
同理 ∂ L ∂ W 1 , 2 = ( ∂ L ∂ Y 1 , 1 ; ∂ L ∂ Y 1 , 2 ; ∂ L ∂ Y 1 , 3 ∂ L ∂ Y 2 , 1 ; ∂ L ∂ Y 2 , 2 ; ∂ L ∂ Y 2 , 3 ) ( 0 ; X 1 , 1 ; 0 0 ; X 2 , 1 ; 0 ) = ∂ L ∂ Y 1 , 2 X 1 , 1 + ∂ L ∂ Y 2 , 2 X 2 , 1 ( 6 ) \frac{\partial L}{\partial W_{1,2}}=\begin{pmatrix} \frac{\partial L}{\partial Y_{1,1}}; \frac{\partial L}{\partial Y_{1,2}}; \frac{\partial L}{\partial Y_{1,3}}\\ \frac{\partial L}{\partial Y_{2,1}}; \frac{\partial L}{\partial Y_{2,2}}; \frac{\partial L}{\partial Y_{2,3}} \end{pmatrix}\begin{pmatrix} 0; X_{1,1}; 0\\ 0; X_{2,1}; 0 \end{pmatrix} = \frac{\partial L}{\partial Y_{1,2}}X_{1,1}+\frac{\partial L}{\partial Y_{2,2}}X_{2,1} \quad (6) ∂W1,2∂L=(∂Y1,1∂L;∂Y1,2∂L;∂Y1,3∂L∂Y2,1∂L;∂Y2,2∂L;∂Y2,3∂L)(0;X1,1;00;X2,1;0)=∂Y1,2∂LX1,1+∂Y2,2∂LX2,1(6)
∂ L ∂ W 1 , 3 = ( ∂ L ∂ Y 1 , 1 ; ∂ L ∂ Y 1 , 2 ; ∂ L ∂ Y 1 , 3 ∂ L ∂ Y 2 , 1 ; ∂ L ∂ Y 2 , 2 ; ∂ L ∂ Y 2 , 3 ) ( 0 ; 0 ; X 1 , 1 0 ; 0 ; X 2 , 1 ) = ∂ L ∂ Y 1 , 3 X 1 , 1 + ∂ L ∂ Y 2 , 3 X 2 , 1 ( 7 ) \frac{\partial L}{\partial W_{1,3}}=\begin{pmatrix} \frac{\partial L}{\partial Y_{1,1}}; \frac{\partial L}{\partial Y_{1,2}}; \frac{\partial L}{\partial Y_{1,3}}\\ \frac{\partial L}{\partial Y_{2,1}}; \frac{\partial L}{\partial Y_{2,2}}; \frac{\partial L}{\partial Y_{2,3}} \end{pmatrix}\begin{pmatrix} 0; 0; X_{1,1}\\ 0; 0; X_{2,1}\end{pmatrix} = \frac{\partial L}{\partial Y_{1,3}}X_{1,1}+\frac{\partial L}{\partial Y_{2,3}}X_{2,1} \quad (7) ∂W1,3∂L=(∂Y1,1∂L;∂Y1,2∂L;∂Y1,3∂L∂Y2,1∂L;∂Y2,2∂L;∂Y2,3∂L)(0;0;X1,10;0;X2,1)=∂Y1,3∂LX1,1+∂Y2,3∂LX2,1(7)
∂ L ∂ W 2 , 1 = ( ∂ L ∂ Y 1 , 1 ; ∂ L ∂ Y 1 , 2 ; ∂ L ∂ Y 1 , 3 ∂ L ∂ Y 2 , 1 ; ∂ L ∂ Y 2 , 2 ; ∂ L ∂ Y 2 , 3 ) ( X 1 , 2 ; 0 ; 0 X 2 , 2 ; 0 ; 0 ) = ∂ L ∂ Y 1 , 1 X 1 , 2 + ∂ L ∂ Y 2 , 1 X 2 , 2 ( 8 ) \frac{\partial L}{\partial W_{2,1}}=\begin{pmatrix} \frac{\partial L}{\partial Y_{1,1}}; \frac{\partial L}{\partial Y_{1,2}}; \frac{\partial L}{\partial Y_{1,3}}\\ \frac{\partial L}{\partial Y_{2,1}}; \frac{\partial L}{\partial Y_{2,2}}; \frac{\partial L}{\partial Y_{2,3}} \end{pmatrix}\begin{pmatrix} X_{1,2}; 0; 0\\ X_{2,2}; 0; 0 \end{pmatrix} = \frac{\partial L}{\partial Y_{1,1}}X_{1,2}+\frac{\partial L}{\partial Y_{2,1}}X_{2,2} \quad (8) ∂W2,1∂L=(∂Y1,1∂L;∂Y1,2∂L;∂Y1,3∂L∂Y2,1∂L;∂Y2,2∂L;∂Y2,3∂L)(X1,2;0;0X2,2;0;0)=∂Y1,1∂LX1,2+∂Y2,1∂LX2,2(8)
∂ L ∂ W 2 , 2 = ( ∂ L ∂ Y 1 , 1 ; ∂ L ∂ Y 1 , 2 ; ∂ L ∂ Y 1 , 3 ∂ L ∂ Y 2 , 1 ; ∂ L ∂ Y 2 , 2 ; ∂ L ∂ Y 2 , 3 ) ( 0 ; X 1 , 2 ; 0 0 ; X 2 , 2 ; 0 ) = ∂ L ∂ Y 1 , 2 X 1 , 2 + ∂ L ∂ Y 2 , 2 X 2 , 2 ( 9 ) \frac{\partial L}{\partial W_{2,2}}=\begin{pmatrix} \frac{\partial L}{\partial Y_{1,1}}; \frac{\partial L}{\partial Y_{1,2}}; \frac{\partial L}{\partial Y_{1,3}}\\ \frac{\partial L}{\partial Y_{2,1}}; \frac{\partial L}{\partial Y_{2,2}}; \frac{\partial L}{\partial Y_{2,3}} \end{pmatrix}\begin{pmatrix} 0; X_{1,2}; 0\\ 0; X_{2,2}; 0 \end{pmatrix} = \frac{\partial L}{\partial Y_{1,2}}X_{1,2}+\frac{\partial L}{\partial Y_{2,2}}X_{2,2} \quad (9) ∂W2,2∂L=(∂Y1,1∂L;∂Y1,2∂L;∂Y1,3∂L∂Y2,1∂L;∂Y2,2∂L;∂Y2,3∂L)(0;X1,2;00;X2,2;0)=∂Y1,2∂LX1,2+∂Y2,2∂LX2,2(9)
∂ L ∂ W 2 , 3 = ( ∂ L ∂ Y 1 , 1 ; ∂ L ∂ Y 1 , 2 ; ∂ L ∂ Y 1 , 3 ∂ L ∂ Y 2 , 1 ; ∂ L ∂ Y 2 , 2 ; ∂ L ∂ Y 2 , 3 ) ( 0 ; 0 ; X 1 , 2 0 ; 0 ; X 2 , 2 ) = ∂ L ∂ Y 1 , 3 X 1 , 2 + ∂ L ∂ Y 2 , 3 X 2 , 2 ( 10 ) \frac{\partial L}{\partial W_{2,3}}=\begin{pmatrix} \frac{\partial L}{\partial Y_{1,1}}; \frac{\partial L}{\partial Y_{1,2}}; \frac{\partial L}{\partial Y_{1,3}}\\ \frac{\partial L}{\partial Y_{2,1}}; \frac{\partial L}{\partial Y_{2,2}}; \frac{\partial L}{\partial Y_{2,3}} \end{pmatrix}\begin{pmatrix} 0; 0; X_{1,2}\\ 0; 0; X_{2,2} \end{pmatrix} = \frac{\partial L}{\partial Y_{1,3}}X_{1,2}+\frac{\partial L}{\partial Y_{2,3}}X_{2,2} \quad (10) ∂W2,3∂L=(∂Y1,1∂L;∂Y1,2∂L;∂Y1,3∂L∂Y2,1∂L;∂Y2,2∂L;∂Y2,3∂L)(0;0;X1,20;0;X2,2)=∂Y1,3∂LX1,2+∂Y2,3∂LX2,2(10)
根据公式(5)~(10),可最终得到标量L对矩阵W的导数:
∂ L ∂ W = ( ∂ L ∂ Y 1 , 1 X 1 , 1 + ∂ L ∂ Y 2 , 1 X 2 , 1 ; ∂ L ∂ Y 1 , 2 X 1 , 1 + ∂ L ∂ Y 2 , 2 X 2 , 1 ; ∂ L ∂ Y 1 , 3 X 1 , 1 + ∂ L ∂ Y 2 , 3 X 2 , 1 ∂ L ∂ Y 1 , 1 X 1 , 2 + ∂ L ∂ Y 2 , 1 X 2 , 2 ; ∂ L ∂ Y 1 , 2 X 1 , 2 + ∂ L ∂ Y 2 , 2 X 2 , 2 ; ∂ L ∂ Y 1 , 3 X 1 , 2 + ∂ L ∂ Y 2 , 3 X 2 , 2 ) = ( X 1 , 1 ; X 2 , 1 X 1 , 2 ; X 2 , 2 ) ( ∂ L ∂ Y 1 , 1 ; ∂ L ∂ Y 1 , 2 ; ∂ L ∂ Y 1 , 3 ∂ L ∂ Y 2 , 1 ; ∂ L ∂ Y 2 , 2 ; ∂ L ∂ Y 2 , 3 ) = X T ∂ L ∂ Y ( 11 ) \frac{\partial L}{\partial W}=\begin{pmatrix} \frac{\partial L}{\partial Y_{1,1}}X_{1,1}+\frac{\partial L}{\partial Y_{2,1}}X_{2,1}; \frac{\partial L}{\partial Y_{1,2}}X_{1,1}+\frac{\partial L}{\partial Y_{2,2}}X_{2,1}; \frac{\partial L}{\partial Y_{1,3}}X_{1,1}+\frac{\partial L}{\partial Y_{2,3}}X_{2,1}\\ \frac{\partial L}{\partial Y_{1,1}}X_{1,2}+\frac{\partial L}{\partial Y_{2,1}}X_{2,2} ; \frac{\partial L}{\partial Y_{1,2}}X_{1,2}+\frac{\partial L}{\partial Y_{2,2}}X_{2,2}; \frac{\partial L}{\partial Y_{1,3}}X_{1,2}+\frac{\partial L}{\partial Y_{2,3}}X_{2,2} \end{pmatrix} \\ = \begin{pmatrix} X_{1,1}; X_{2,1}\\ X_{1,2}; X_{2,2} \end{pmatrix} \begin{pmatrix} \frac{\partial L}{\partial Y_{1,1}}; \frac{\partial L}{\partial Y_{1,2}};\frac{\partial L}{\partial Y_{1,3}}\\ \frac{\partial L}{\partial Y_{2,1}}; \frac{\partial L}{\partial Y_{2,2}};\frac{\partial L}{\partial Y_{2,3}} \end{pmatrix} =X^T\frac{\partial L}{\partial Y}\quad (11) ∂W∂L=(∂Y1,1∂LX1,1+∂Y2,1∂LX2,1;∂Y1,2∂LX1,1+∂Y2,2∂LX2,1;∂Y1,3∂LX1,1+∂Y2,3∂LX2,1∂Y1,1∂LX1,2+∂Y2,1∂LX2,2;∂Y1,2∂LX1,2+∂Y2,2∂LX2,2;∂Y1,3∂LX1,2+∂Y2,3∂LX2,2)=(X1,1;X2,1X1,2;X2,2)(∂Y1,1∂L;∂Y1,2∂L;∂Y1,3∂L∂Y2,1∂L;∂Y2,2∂L;∂Y2,3∂L)=XT∂Y∂L(11)
*在cs231n的作业SVM里,L可定义为损失函数。即:
L = 1 N ∑ i L i + λ ∑ k ∑ l W k , l 2 = 1 N ∑ i ∑ j ≠ y i [ max ( 0 , S i j − S i y i + Δ ) ] + λ ∑ k ∑ l W k , l 2 ( 12 ) L =\frac { 1 } { N } \sum _ { i } L_i+ \lambda \sum _ { k } \sum _ { l } W _ { k , l } ^ { 2 }= \frac { 1 } { N } \sum _ { i } \sum _ { j \neq y _ { i } } \left[ \max \left( 0 , S_ {i j } - S_ {i y _ { i } } + \Delta \right) \right] + \lambda \sum _ { k } \sum _ { l } W _ { k , l } ^ { 2 }\quad (12) L=N1i∑Li+λk∑l∑Wk,l2=N1i∑j=yi∑[max(0,Sij−Siyi+Δ)]+λk∑l∑Wk,l2(12)
不过由于L包含求和符号,为方便起见,我们将其展开并对各项分别求导。对于正则化损失项,这里不多赘述,仅仅考量 L i = ∑ j ≠ y i [ max ( 0 , S i j − S i y i + Δ ) ] L_{i} = \sum _ { j \neq y _ { i } } \left[ \max \left( 0 , S_ {i j } - S_ {i y _ { i } } + \Delta \right) \right] Li=∑j=yi[max(0,Sij−Siyi+Δ)] 对矩阵W的导数。
根据公式(11),可得 ∂ L i ∂ W = X T ∂ L i ∂ S , \frac{\partial L_{i}}{\partial W} =X^T\frac{\partial L_{i}}{\partial S}, ∂W∂Li=XT∂S∂Li,由公式(12)可知 L i { L_{i} } Li仅与下标 i i i有关,即 ∂ L i ∂ S m j = { ∂ L i ∂ S i j m = i 0 m ≠ i \frac { \partial L_{i} } { \partial S_{mj} } =\left\{ \begin{array} {l } {\frac { \partial L_{i} } { \partial S_{ij} }} \qquad m = i \\{0 \qquad\quad m \neq i} \end{array} \right. ∂Smj∂Li={∂Sij∂Lim=i0m=i
进一步地,矩阵形式如下:
∂
L
i
∂
W
=
X
T
∂
L
i
∂
S
=
=
>
∂
L
i
∂
S
i
j
=
{
0
(
S
i
j
−
S
i
y
i
+
Δ
)
<
0
1
j
≠
y
i
&
(
S
i
j
−
S
i
y
i
+
Δ
)
>
0
−
1
∗
n
u
m
j
=
y
i
&
(
S
i
j
−
S
i
y
i
+
Δ
)
>
0
\frac{\partial L_{i}}{\partial W} = X^T\frac { \partial L_{i} } { \partial S } ==>\frac { \partial L_{i} } { \partial S_{ij} } =\left\{ \begin{array} {l} { 0 \qquad\qquad\qquad \left( S_ {i j } - S_ {i y _ { i } } + \Delta \right)\lt0} \\ { 1 \qquad\qquad j \neq y_i \ \& \left( S_ {i j } - S_ {i y _ { i } } + \Delta \right)\gt0 } \\ {-1*num \qquad j = y_i \ \&\left( S_ {ij} - S_ {i y_i } + \Delta \right)\gt0 } \end{array} \right.
∂W∂Li=XT∂S∂Li==>∂Sij∂Li=⎩⎨⎧0(Sij−Siyi+Δ)<01j=yi &(Sij−Siyi+Δ)>0−1∗numj=yi &(Sij−Siyi+Δ)>0
其中
n
u
m
num
num为
(
S
i
j
−
S
i
y
i
+
Δ
)
j
≠
y
i
>
0
(S_ {ij} - S_ {iy_i } + \Delta)_{ j \neq y_i }\gt0
(Sij−Siyi+Δ)j=yi>0的个数
∂ L ∂ W = 1 N ∑ i ∂ L i ∂ W = 1 N ∑ i X T [ ∂ L 0 ∂ S 00 ∂ L 0 ∂ S 01 ⋯ ∂ L 0 ∂ S 0 C ∂ L 1 ∂ S 10 ∂ L 1 ∂ S 11 ⋯ ∂ L 1 ∂ S 1 C ⋮ ⋮ ∂ L i ∂ S i j ⋮ ∂ L N ∂ S N 0 ∂ L N ∂ S N 1 ⋯ ∂ L N ∂ S N C ] \frac{\partial L}{\partial W} =\frac { 1 } { N } \sum _ { i } \frac{\partial L_{i}}{\partial W} =\frac { 1 } { N } \sum _ { i }X^T \left[ \begin{matrix} \frac { \partial L_0 } { \partial S_{00} } & \frac { \partial L_0 } { \partial S_{01} } & \cdots&\frac { \partial L_0 } { \partial S_{0C} } \\ \frac { \partial L_1 } { \partial S_{10} }&\frac { \partial L_1 } { \partial S_{11} } &\cdots &\frac { \partial L_1 } { \partial S_{1C} } \\ \vdots&\vdots&\frac { \partial L_i } { \partial S_{ij} }&\vdots \\ \frac { \partial L_N } { \partial S_{N0} } &\frac { \partial L_N } { \partial S_{N1} }&\cdots &\frac { \partial L_N } { \partial S_{NC} } \end{matrix} \right] ∂W∂L=N1i∑∂W∂Li=N1i∑XT⎣⎢⎢⎢⎢⎡∂S00∂L0∂S10∂L1⋮∂SN0∂LN∂S01∂L0∂S11∂L1⋮∂SN1∂LN⋯⋯∂Sij∂Li⋯∂S0C∂L0∂S1C∂L1⋮∂SNC∂LN⎦⎥⎥⎥⎥⎤
循环形式如下: ∂ L i ∂ W m , j = ∑ k X m , k T ∂ L i ∂ S k , j = { 0 k ≠ i X m , i T ∂ L i ∂ S i , j k = i \frac{\partial L_{i}}{\partial W_{m,j}} =\sum _ { k }X^T_{m,k}\frac { \partial L_{i} } { \partial S_{k,j} } =\left\{ \begin{array} {c } { 0 \qquad\qquad k \neq i } \\ {X^T_{m,i}\frac { \partial L_{i} } { \partial S_{i,j} } \quad\quad k = i } \end{array} \right. ∂Wm,j∂Li=k∑Xm,kT∂Sk,j∂Li={0k=iXm,iT∂Si,j∂Lik=i