cs231n - assignment1补充:矩阵求导

cs231n - assignment1要求计算SVM-loss的梯度,因为涉及到一些矩阵求导方面的知识,可是官方笔记里面的资料比较少,而且这一块内容本身也不是很好理解,故搜集相关资料以便加深理解,如下:

初探——标量对矩阵求导

标量f对矩阵X的导数,定义为 ∂ f ∂ X = [ ∂ f ∂ X i j ] \frac{\partial f}{\partial X} = \left[\frac{\partial f }{\partial X_{ij}}\right] Xf=[Xijf]即f对X逐元素求导排成与X尺寸相同的矩阵,下面看一个例子:

假设 L = f ( Y ) , Y = X W , L=f(Y),Y=XW, L=f(Y),Y=XW, 其中 X ( 2 , 2 ) / W ( 2 , 3 ) / Y ( 2 , 3 ) / L X(2,2)/ W(2,3) /Y(2,3) /L X(2,2)/W(2,3)/Y(2,3)/L为标量。

根据矩阵的乘法定义,可知: Y i , j = ∑ k = 1 D X i , k W k , j , Y_{i,j}=\sum_{k=1}^{D}X_{i,k}W_{k,j}, Yi,j=k=1DXi,kWk,j ∂ Y i , j ∂ X m , k = { 0 i ≠ m W k , j i = m \frac{{\partial {Y_{i,j}}}}{{\partial {X_{m,k}}}} = \left\{ {\begin{array}{lc} {\begin{array}{cc} 0&{i \ne m} \end{array}}\\ {\begin{array}{cc} {{W_{k,j}}}&{i = m} \end{array}} \end{array}} \right. Xm,kYi,j={0i=mWk,ji=m
同理, ∂ Y i , j ∂ W k , m = { 0 j ≠ m X i , k j = m ( 1 ) \frac{{\partial {Y_{i,j}}}}{{\partial {W_{k,m}}}} = \left\{ {\begin{array}{lc} {\begin{array}{cc} 0&{j \ne m} \end{array}}\\ {\begin{array}{cc} {{X_{i,k}}}&{j = m} \end{array}} \end{array}} \right. \quad (1) Wk,mYi,j={0j=mXi,kj=m(1) 详见Vector, Matrix, and Tensor Derivatives

根据链式法,标量L对 W 1 , 1 W_{1,1} W1,1的导数为 ∂ L ∂ W 1 , 1 = ∂ L ∂ Y ⋅ ∂ Y ∂ W 1 , 1 ( 2 ) \frac{\partial L}{\partial W_{1,1}}=\frac{\partial L}{\partial Y}\cdot \frac{\partial Y}{\partial W_{1,1}} \quad (2) W1,1L=YLW1,1Y(2)

其中 ∂ L ∂ Y = ( ∂ L ∂ Y 1 , 1 ; ∂ L ∂ Y 1 , 2 ; ∂ L ∂ Y 1 , 3 ∂ L ∂ Y 2 , 1 ; ∂ L ∂ Y 2 , 2 ; ∂ L ∂ Y 2 , 3 ) ( 3 ) \frac{\partial L}{\partial Y}=\begin{pmatrix} \frac{\partial L}{\partial Y_{1,1}}; \frac{\partial L}{\partial Y_{1,2}}; \frac{\partial L}{\partial Y_{1,3}}\\ \frac{\partial L}{\partial Y_{2,1}}; \frac{\partial L}{\partial Y_{2,2}}; \frac{\partial L}{\partial Y_{2,3}} \end{pmatrix} \quad (3) YL=(Y1,1L;Y1,2L;Y1,3LY2,1L;Y2,2L;Y2,3L)(3) 并根据公式(1)得, ∂ Y ∂ W 1 , 1 = ( X 1 , 1 ; 0 ; 0 X 2 , 1 ; 0 ; 0 ) ( 4 ) \frac{\partial Y}{\partial W_{1,1}}=\begin{pmatrix} X_{1,1}; 0; 0\\ X_{2,1}; 0; 0 \end{pmatrix} \quad (4) W1,1Y=(X1,1;0;0X2,1;0;0)(4)

注意,因为L是标量, W 1 , 1 W_{1,1} W1,1也是标量,因此对于公式(2)来说,是dot product,而不是矩阵乘法。

公式(2)可进一步展开: ∂ L ∂ W 1 , 1 = ( ∂ L ∂ Y 1 , 1 ; ∂ L ∂ Y 1 , 2 ; ∂ L ∂ Y 1 , 3 ∂ L ∂ Y 2 , 1 ; ∂ L ∂ Y 2 , 2 ; ∂ L ∂ Y 2 , 3 ) ( X 1 , 1 ; 0 ; 0 X 2 , 1 ; 0 ; 0 ) = ∂ L ∂ Y 1 , 1 X 1 , 1 + ∂ L ∂ Y 2 , 1 X 2 , 1 ( 5 ) \frac{\partial L} {\partial W_{1,1}}=\begin{pmatrix} \frac{\partial L}{\partial Y_{1,1}}; \frac{\partial L}{\partial Y_{1,2}}; \frac{\partial L}{\partial Y_{1,3}}\\ \frac{\partial L}{\partial Y_{2,1}}; \frac{\partial L}{\partial Y_{2,2}}; \frac{\partial L}{\partial Y_{2,3}} \end{pmatrix}\begin{pmatrix} X_{1,1}; 0; 0\\ X_{2,1}; 0; 0 \end{pmatrix} = \frac{\partial L}{\partial Y_{1,1}}X_{1,1}+\frac{\partial L}{\partial Y_{2,1}}X_{2,1} \quad (5) W1,1L=(Y1,1L;Y1,2L;Y1,3LY2,1L;Y2,2L;Y2,3L)(X1,1;0;0X2,1;0;0)=Y1,1LX1,1+Y2,1LX2,1(5)

同理 ∂ L ∂ W 1 , 2 = ( ∂ L ∂ Y 1 , 1 ; ∂ L ∂ Y 1 , 2 ; ∂ L ∂ Y 1 , 3 ∂ L ∂ Y 2 , 1 ; ∂ L ∂ Y 2 , 2 ; ∂ L ∂ Y 2 , 3 ) ( 0 ; X 1 , 1 ; 0 0 ; X 2 , 1 ; 0 ) = ∂ L ∂ Y 1 , 2 X 1 , 1 + ∂ L ∂ Y 2 , 2 X 2 , 1 ( 6 ) \frac{\partial L}{\partial W_{1,2}}=\begin{pmatrix} \frac{\partial L}{\partial Y_{1,1}}; \frac{\partial L}{\partial Y_{1,2}}; \frac{\partial L}{\partial Y_{1,3}}\\ \frac{\partial L}{\partial Y_{2,1}}; \frac{\partial L}{\partial Y_{2,2}}; \frac{\partial L}{\partial Y_{2,3}} \end{pmatrix}\begin{pmatrix} 0; X_{1,1}; 0\\ 0; X_{2,1}; 0 \end{pmatrix} = \frac{\partial L}{\partial Y_{1,2}}X_{1,1}+\frac{\partial L}{\partial Y_{2,2}}X_{2,1} \quad (6) W1,2L=(Y1,1L;Y1,2L;Y1,3LY2,1L;Y2,2L;Y2,3L)(0;X1,1;00;X2,1;0)=Y1,2LX1,1+Y2,2LX2,1(6)

∂ L ∂ W 1 , 3 = ( ∂ L ∂ Y 1 , 1 ; ∂ L ∂ Y 1 , 2 ; ∂ L ∂ Y 1 , 3 ∂ L ∂ Y 2 , 1 ; ∂ L ∂ Y 2 , 2 ; ∂ L ∂ Y 2 , 3 ) ( 0 ; 0 ; X 1 , 1 0 ; 0 ; X 2 , 1 ) = ∂ L ∂ Y 1 , 3 X 1 , 1 + ∂ L ∂ Y 2 , 3 X 2 , 1 ( 7 ) \frac{\partial L}{\partial W_{1,3}}=\begin{pmatrix} \frac{\partial L}{\partial Y_{1,1}}; \frac{\partial L}{\partial Y_{1,2}}; \frac{\partial L}{\partial Y_{1,3}}\\ \frac{\partial L}{\partial Y_{2,1}}; \frac{\partial L}{\partial Y_{2,2}}; \frac{\partial L}{\partial Y_{2,3}} \end{pmatrix}\begin{pmatrix} 0; 0; X_{1,1}\\ 0; 0; X_{2,1}\end{pmatrix} = \frac{\partial L}{\partial Y_{1,3}}X_{1,1}+\frac{\partial L}{\partial Y_{2,3}}X_{2,1} \quad (7) W1,3L=(Y1,1L;Y1,2L;Y1,3LY2,1L;Y2,2L;Y2,3L)(0;0;X1,10;0;X2,1)=Y1,3LX1,1+Y2,3LX2,1(7)

∂ L ∂ W 2 , 1 = ( ∂ L ∂ Y 1 , 1 ; ∂ L ∂ Y 1 , 2 ; ∂ L ∂ Y 1 , 3 ∂ L ∂ Y 2 , 1 ; ∂ L ∂ Y 2 , 2 ; ∂ L ∂ Y 2 , 3 ) ( X 1 , 2 ; 0 ; 0 X 2 , 2 ; 0 ; 0 ) = ∂ L ∂ Y 1 , 1 X 1 , 2 + ∂ L ∂ Y 2 , 1 X 2 , 2 ( 8 ) \frac{\partial L}{\partial W_{2,1}}=\begin{pmatrix} \frac{\partial L}{\partial Y_{1,1}}; \frac{\partial L}{\partial Y_{1,2}}; \frac{\partial L}{\partial Y_{1,3}}\\ \frac{\partial L}{\partial Y_{2,1}}; \frac{\partial L}{\partial Y_{2,2}}; \frac{\partial L}{\partial Y_{2,3}} \end{pmatrix}\begin{pmatrix} X_{1,2}; 0; 0\\ X_{2,2}; 0; 0 \end{pmatrix} = \frac{\partial L}{\partial Y_{1,1}}X_{1,2}+\frac{\partial L}{\partial Y_{2,1}}X_{2,2} \quad (8) W2,1L=(Y1,1L;Y1,2L;Y1,3LY2,1L;Y2,2L;Y2,3L)(X1,2;0;0X2,2;0;0)=Y1,1LX1,2+Y2,1LX2,2(8)

∂ L ∂ W 2 , 2 = ( ∂ L ∂ Y 1 , 1 ; ∂ L ∂ Y 1 , 2 ; ∂ L ∂ Y 1 , 3 ∂ L ∂ Y 2 , 1 ; ∂ L ∂ Y 2 , 2 ; ∂ L ∂ Y 2 , 3 ) ( 0 ; X 1 , 2 ; 0 0 ; X 2 , 2 ; 0 ) = ∂ L ∂ Y 1 , 2 X 1 , 2 + ∂ L ∂ Y 2 , 2 X 2 , 2 ( 9 ) \frac{\partial L}{\partial W_{2,2}}=\begin{pmatrix} \frac{\partial L}{\partial Y_{1,1}}; \frac{\partial L}{\partial Y_{1,2}}; \frac{\partial L}{\partial Y_{1,3}}\\ \frac{\partial L}{\partial Y_{2,1}}; \frac{\partial L}{\partial Y_{2,2}}; \frac{\partial L}{\partial Y_{2,3}} \end{pmatrix}\begin{pmatrix} 0; X_{1,2}; 0\\ 0; X_{2,2}; 0 \end{pmatrix} = \frac{\partial L}{\partial Y_{1,2}}X_{1,2}+\frac{\partial L}{\partial Y_{2,2}}X_{2,2} \quad (9) W2,2L=(Y1,1L;Y1,2L;Y1,3LY2,1L;Y2,2L;Y2,3L)(0;X1,2;00;X2,2;0)=Y1,2LX1,2+Y2,2LX2,2(9)

∂ L ∂ W 2 , 3 = ( ∂ L ∂ Y 1 , 1 ; ∂ L ∂ Y 1 , 2 ; ∂ L ∂ Y 1 , 3 ∂ L ∂ Y 2 , 1 ; ∂ L ∂ Y 2 , 2 ; ∂ L ∂ Y 2 , 3 ) ( 0 ; 0 ; X 1 , 2 0 ; 0 ; X 2 , 2 ) = ∂ L ∂ Y 1 , 3 X 1 , 2 + ∂ L ∂ Y 2 , 3 X 2 , 2 ( 10 ) \frac{\partial L}{\partial W_{2,3}}=\begin{pmatrix} \frac{\partial L}{\partial Y_{1,1}}; \frac{\partial L}{\partial Y_{1,2}}; \frac{\partial L}{\partial Y_{1,3}}\\ \frac{\partial L}{\partial Y_{2,1}}; \frac{\partial L}{\partial Y_{2,2}}; \frac{\partial L}{\partial Y_{2,3}} \end{pmatrix}\begin{pmatrix} 0; 0; X_{1,2}\\ 0; 0; X_{2,2} \end{pmatrix} = \frac{\partial L}{\partial Y_{1,3}}X_{1,2}+\frac{\partial L}{\partial Y_{2,3}}X_{2,2} \quad (10) W2,3L=(Y1,1L;Y1,2L;Y1,3LY2,1L;Y2,2L;Y2,3L)(0;0;X1,20;0;X2,2)=Y1,3LX1,2+Y2,3LX2,2(10)

根据公式(5)~(10),可最终得到标量L对矩阵W的导数:

∂ L ∂ W = ( ∂ L ∂ Y 1 , 1 X 1 , 1 + ∂ L ∂ Y 2 , 1 X 2 , 1 ; ∂ L ∂ Y 1 , 2 X 1 , 1 + ∂ L ∂ Y 2 , 2 X 2 , 1 ; ∂ L ∂ Y 1 , 3 X 1 , 1 + ∂ L ∂ Y 2 , 3 X 2 , 1 ∂ L ∂ Y 1 , 1 X 1 , 2 + ∂ L ∂ Y 2 , 1 X 2 , 2 ; ∂ L ∂ Y 1 , 2 X 1 , 2 + ∂ L ∂ Y 2 , 2 X 2 , 2 ; ∂ L ∂ Y 1 , 3 X 1 , 2 + ∂ L ∂ Y 2 , 3 X 2 , 2 ) = ( X 1 , 1 ; X 2 , 1 X 1 , 2 ; X 2 , 2 ) ( ∂ L ∂ Y 1 , 1 ; ∂ L ∂ Y 1 , 2 ; ∂ L ∂ Y 1 , 3 ∂ L ∂ Y 2 , 1 ; ∂ L ∂ Y 2 , 2 ; ∂ L ∂ Y 2 , 3 ) = X T ∂ L ∂ Y ( 11 ) \frac{\partial L}{\partial W}=\begin{pmatrix} \frac{\partial L}{\partial Y_{1,1}}X_{1,1}+\frac{\partial L}{\partial Y_{2,1}}X_{2,1}; \frac{\partial L}{\partial Y_{1,2}}X_{1,1}+\frac{\partial L}{\partial Y_{2,2}}X_{2,1}; \frac{\partial L}{\partial Y_{1,3}}X_{1,1}+\frac{\partial L}{\partial Y_{2,3}}X_{2,1}\\ \frac{\partial L}{\partial Y_{1,1}}X_{1,2}+\frac{\partial L}{\partial Y_{2,1}}X_{2,2} ; \frac{\partial L}{\partial Y_{1,2}}X_{1,2}+\frac{\partial L}{\partial Y_{2,2}}X_{2,2}; \frac{\partial L}{\partial Y_{1,3}}X_{1,2}+\frac{\partial L}{\partial Y_{2,3}}X_{2,2} \end{pmatrix} \\ = \begin{pmatrix} X_{1,1}; X_{2,1}\\ X_{1,2}; X_{2,2} \end{pmatrix} \begin{pmatrix} \frac{\partial L}{\partial Y_{1,1}}; \frac{\partial L}{\partial Y_{1,2}};\frac{\partial L}{\partial Y_{1,3}}\\ \frac{\partial L}{\partial Y_{2,1}}; \frac{\partial L}{\partial Y_{2,2}};\frac{\partial L}{\partial Y_{2,3}} \end{pmatrix} =X^T\frac{\partial L}{\partial Y}\quad (11) WL=(Y1,1LX1,1+Y2,1LX2,1;Y1,2LX1,1+Y2,2LX2,1;Y1,3LX1,1+Y2,3LX2,1Y1,1LX1,2+Y2,1LX2,2;Y1,2LX1,2+Y2,2LX2,2;Y1,3LX1,2+Y2,3LX2,2)=(X1,1;X2,1X1,2;X2,2)(Y1,1L;Y1,2L;Y1,3LY2,1L;Y2,2L;Y2,3L)=XTYL(11)

*在cs231n的作业SVM里,L可定义为损失函数。即:

L = 1 N ∑ i L i + λ ∑ k ∑ l W k , l 2 = 1 N ∑ i ∑ j ≠ y i [ max ⁡ ( 0 , S i j − S i y i + Δ ) ] + λ ∑ k ∑ l W k , l 2 ( 12 ) L =\frac { 1 } { N } \sum _ { i } L_i+ \lambda \sum _ { k } \sum _ { l } W _ { k , l } ^ { 2 }= \frac { 1 } { N } \sum _ { i } \sum _ { j \neq y _ { i } } \left[ \max \left( 0 , S_ {i j } - S_ {i y _ { i } } + \Delta \right) \right] + \lambda \sum _ { k } \sum _ { l } W _ { k , l } ^ { 2 }\quad (12) L=N1iLi+λklWk,l2=N1ij=yi[max(0,SijSiyi+Δ)]+λklWk,l2(12)

不过由于L包含求和符号,为方便起见,我们将其展开并对各项分别求导。对于正则化损失项,这里不多赘述,仅仅考量 L i = ∑ j ≠ y i [ max ⁡ ( 0 , S i j − S i y i + Δ ) ] L_{i} = \sum _ { j \neq y _ { i } } \left[ \max \left( 0 , S_ {i j } - S_ {i y _ { i } } + \Delta \right) \right] Li=j=yi[max(0,SijSiyi+Δ)] 对矩阵W的导数。

根据公式(11),可得 ∂ L i ∂ W = X T ∂ L i ∂ S , \frac{\partial L_{i}}{\partial W} =X^T\frac{\partial L_{i}}{\partial S}, WLi=XTSLi由公式(12)可知 L i { L_{i} } Li仅与下标 i i i有关,即 ∂ L i ∂ S m j = { ∂ L i ∂ S i j m = i 0 m ≠ i \frac { \partial L_{i} } { \partial S_{mj} } =\left\{ \begin{array} {l } {\frac { \partial L_{i} } { \partial S_{ij} }} \qquad m = i \\{0 \qquad\quad m \neq i} \end{array} \right. SmjLi={SijLim=i0m=i

进一步地,矩阵形式如下: ∂ L i ∂ W = X T ∂ L i ∂ S = = > ∂ L i ∂ S i j = { 0 ( S i j − S i y i + Δ ) < 0 1 j ≠ y i   & ( S i j − S i y i + Δ ) > 0 − 1 ∗ n u m j = y i   & ( S i j − S i y i + Δ ) > 0 \frac{\partial L_{i}}{\partial W} = X^T\frac { \partial L_{i} } { \partial S } ==>\frac { \partial L_{i} } { \partial S_{ij} } =\left\{ \begin{array} {l} { 0 \qquad\qquad\qquad \left( S_ {i j } - S_ {i y _ { i } } + \Delta \right)\lt0} \\ { 1 \qquad\qquad j \neq y_i \ \& \left( S_ {i j } - S_ {i y _ { i } } + \Delta \right)\gt0 } \\ {-1*num \qquad j = y_i \ \&\left( S_ {ij} - S_ {i y_i } + \Delta \right)\gt0 } \end{array} \right. WLi=XTSLi==>SijLi=0(SijSiyi+Δ)<01j=yi &(SijSiyi+Δ)>01numj=yi &(SijSiyi+Δ)>0
其中 n u m num num ( S i j − S i y i + Δ ) j ≠ y i > 0 (S_ {ij} - S_ {iy_i } + \Delta)_{ j \neq y_i }\gt0 (SijSiyi+Δ)j=yi>0的个数

∂ L ∂ W = 1 N ∑ i ∂ L i ∂ W = 1 N ∑ i X T [ ∂ L 0 ∂ S 00 ∂ L 0 ∂ S 01 ⋯ ∂ L 0 ∂ S 0 C ∂ L 1 ∂ S 10 ∂ L 1 ∂ S 11 ⋯ ∂ L 1 ∂ S 1 C ⋮ ⋮ ∂ L i ∂ S i j ⋮ ∂ L N ∂ S N 0 ∂ L N ∂ S N 1 ⋯ ∂ L N ∂ S N C ] \frac{\partial L}{\partial W} =\frac { 1 } { N } \sum _ { i } \frac{\partial L_{i}}{\partial W} =\frac { 1 } { N } \sum _ { i }X^T \left[ \begin{matrix} \frac { \partial L_0 } { \partial S_{00} } & \frac { \partial L_0 } { \partial S_{01} } & \cdots&\frac { \partial L_0 } { \partial S_{0C} } \\ \frac { \partial L_1 } { \partial S_{10} }&\frac { \partial L_1 } { \partial S_{11} } &\cdots &\frac { \partial L_1 } { \partial S_{1C} } \\ \vdots&\vdots&\frac { \partial L_i } { \partial S_{ij} }&\vdots \\ \frac { \partial L_N } { \partial S_{N0} } &\frac { \partial L_N } { \partial S_{N1} }&\cdots &\frac { \partial L_N } { \partial S_{NC} } \end{matrix} \right] WL=N1iWLi=N1iXTS00L0S10L1SN0LNS01L0S11L1SN1LNSijLiS0CL0S1CL1SNCLN

循环形式如下: ∂ L i ∂ W m , j = ∑ k X m , k T ∂ L i ∂ S k , j = { 0 k ≠ i X m , i T ∂ L i ∂ S i , j k = i \frac{\partial L_{i}}{\partial W_{m,j}} =\sum _ { k }X^T_{m,k}\frac { \partial L_{i} } { \partial S_{k,j} } =\left\{ \begin{array} {c } { 0 \qquad\qquad k \neq i } \\ {X^T_{m,i}\frac { \partial L_{i} } { \partial S_{i,j} } \quad\quad k = i } \end{array} \right. Wm,jLi=kXm,kTSk,jLi={0k=iXm,iTSi,jLik=i

  • 2
    点赞
  • 4
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值