Matrix Derivative

Matrix Derivative

1.1 Definition

For a function f ( X ) f(X) f(X), where X ∈ R m × n X\in R^{m\times n} XRm×n is a matrix and f ( X ) ∈ R f(X)\in R f(X)R is scalar. We define the derivative of the function as ∂ f ∂ X = [ ∂ f ∂ X i j ] \frac{\partial f}{\partial X}=\left[\frac{\partial f}{\partial X_{ij}}\right] Xf=[Xijf], where ∂ f ∂ X \frac{\partial f}{\partial X} Xf is a matrix with the same shape as X X X. However, sometimes it’s not easy to calculate the result of it, since we have to calculate the derivative of the elements one by one. Next we will see how to use the trace trick to simplify the result.

1.2 Trace of Matrix

We define the trace of square matrix A ∈ R n × n A\in R^{n\times n} ARn×n, which has the same number of rows as columns, as t r ( A ) = ∑ i = 1 n A i i tr(A) = \sum_{i=1}^n A_{ii} tr(A)=i=1nAii, where A i i , i = 1 , . . . , n A_{ii}, i=1,...,n Aii,i=1,...,n is the diagonal elements of A A A. The trace of a square matrix is the sum of its diagonal elements. For a scalar a a a, we regard it as a matrix whose shape is 1 × 1 1\times 1 1×1, and then t r ( a ) = a tr(a)=a tr(a)=a.
The trace is not so trivial so far. While the multiplication of two matrices
t r ( A T B ) = ∑ i = 0 n ∑ j = 0 m A i j B i j (1) tr(A^T B) =\sum_{i=0}^n\sum_{j=0}^m A_{ij}B_{ij} \tag{1} tr(ATB)=i=0nj=0mAijBij(1)
where A , B ∈ R m × n A, B \in R^{m\times n} A,BRm×n. It is the sum of the multiplication of corresponding elements. In fact, it is similar with the inner product of two vectors.
To verify equation ( 1 ) (1) (1) is not difficult, we can use C = A T B C = A^T B C=ATB, then c i i = a i T b i c_{ii} = a_i^T b_i cii=aiTbi, where c i i c_{ii} cii is the i i ith rows and i i ith column of C C C, a i T a_i^T aiT is the i i ith row of A T A^T AT, e.q. i i ith column of A A A, and b i b_i bi is i i ith column of B B B. So c i i c_{ii} cii is the inner product of i i ith column of A A A and i i ith column of B B B.
Then t r ( A T B ) = ∑ i = 0 n c i i tr(A^T B) = \sum_{i=0} ^n c_{ii} tr(ATB)=i=0ncii is the “inner product” of A A A and B B B.

Here we list some trace tricks which are useful.

  • a = t r ( a ) a = tr(a) a=tr(a), where a a a is a scalar.
  • t r ( A T ) = t r ( A ) tr(A^T) = tr(A) tr(AT)=tr(A).
  • t r ( A ± B ) = t r ( A ) ± t r ( B ) tr(A\plusmn B) = tr(A)\plusmn tr(B) tr(A±B)=tr(A)±tr(B).
  • t r ( A T B ) = t r ( B A T ) = ∑ i , j A i j B i j tr(A^T B) = tr(BA^T)=\sum_{i,j}A_{ij}B_{ij} tr(ATB)=tr(BAT)=i,jAijBij, where A , B ∈ R n × n A, B\in R^{n\times n} A,BRn×n.
  • t r ( A B C ) = t r ( B C A ) = t r ( C A B ) tr(ABC)=tr(BCA)=tr(CAB) tr(ABC)=tr(BCA)=tr(CAB), where A ∈ R r × m , B ∈ R m × n , C ∈ R n × r A\in R^{r\times m}, B\in R^{m\times n}, C\in R^{n\times r} ARr×m,BRm×n,CRn×r. Particularly, if a ∈ R m , c ∈ R n a\in R^m, c\in R^n aRm,cRn, a a a and c c c are both column vectors. Then t r ( a T B c ) = t r ( c a T B ) = t r ( B c a T ) tr(a^T Bc)=tr(ca^TB)=tr(Bca^T) tr(aTBc)=tr(caTB)=tr(BcaT).
  • t r ( A T ( B ⨀ C ) = t r ( ( A ⨀ B ) T C ) tr(A^T(B\bigodot C) = tr((A\bigodot B)^T C) tr(AT(BC)=tr((AB)TC), where A , B , C ∈ R m × n A, B, C \in R^{m\times n} A,B,CRm×n, and ⨀ \bigodot means multiply of element-wise.

1.3 Trace and Derivative

You should mention that in this article the f f f is function which map a scalar, vector or matrix to a scalar. Besides, the vector is column vector by default.
First we show the total differential of scalar function. If there is a function f ( x ) f(x) f(x), then d f ( x ) = f ′ ( x ) d x df(x) = f'(x)dx df(x)=f(x)dx, where x ∈ R x\in R xR.
Second, if x ∈ R n x\in R^n xRn is scalar, then d f ( x ) = ∇ f T d x df(x) = \nabla f^T dx df(x)=fTdx, where ∇ f ∈ R n \nabla f \in R^n fRn is the gradient of f ( x ) f(x) f(x), its i i ith component is ( ∇ f ) i = ∂ f ∂ x i (\nabla f) _i=\frac{\partial f}{\partial x_i} (f)i=xif. d x ∈ R n dx\in R^n dxRn is also a vector and ( d x ) i = d x i (dx)_i = dx_i (dx)i=dxi. We can also rewrite d f ( x ) = ∑ i ∂ f ∂ x i d x i df(x) = \sum_i \frac{\partial f}{\partial x_i}dx_i df(x)=ixifdxi.
At last, similar to the vector condition, the total differential of f ( X ) f(X) f(X) can be write as
d f ( X ) = ∑ i , j ∂ f ∂ X i j d X i j (2) df(X) = \sum_{i,j} \frac{\partial f}{\partial X_{ij}}dX_{ij} \tag{2} df(X)=i,jXijfdXij(2)
Where X ∈ R m × n X\in R^{m\times n} XRm×n and f ( X ) f(X) f(X) is a scalar. ∂ f ∂ X \frac{\partial f}{\partial X} Xf and d X d X dX are both matrices with shape m × n m\times n m×n and ( ∂ f ∂ X ) i j = ∂ f ∂ X i j (\frac{\partial f}{\partial X})_{ij}=\frac{\partial f}{\partial X_{ij}} (Xf)ij=Xijf and ( d X ) i j = d X i j (dX)_{ij}=dX_{ij} (dX)ij=dXij.

Compare equation ( 1 ) (1) (1) and equation ( 2 ) (2) (2), we can see that
d f ( X ) = t r ( ( ∂ f ∂ X ) T d X ) (3) df(X) = tr\left(\left( \frac{\partial f}{\partial X} \right)^T dX \right) \tag{3} df(X)=tr((Xf)TdX)(3)
Equation ( 3 ) (3) (3) inspires us that if we can calculate the total differential of f f f with the format of ( 3 ) (3) (3), then we can get the derivative of f f f, which is ∂ f ∂ X \frac{\partial f}{\partial X} Xf. The trick is that using trace and let d X dX dX at the end of the equation, just like eqution (3).

1.4 Total Differential

If f = u v f=uv f=uv, then d f = v d u + u d v df=vdu + udv df=vdu+udv, where u , v u, v u,v are scalars.
If u , v ∈ R n u,v\in R^n u,vRn and f = u T v = ∑ i u i v i f=u^T v = \sum_{i}u_iv_i f=uTv=iuivi, where u i u_i ui and v i v_i vi is the i i ith element of u u u and v v v respectively. Then
d f = d ( ∑ i u i v i ) = ∑ i d ( u i v i ) = ∑ i ( u i d v i + v i d u i ) = ∑ i u i d v i + ∑ i v i d u i = u T d v + v T d u \begin{aligned} df &=d(\sum_{i}u_iv_i) \\ &= \sum_{i}d(u_iv_i) \\ &= \sum_{i} (u_i dv_i + v_i du_i)\\ &= \sum_{i}u_i dv_i + \sum_{i}v_i du_i \\ &= u^T dv + v^T du \end{aligned} df=d(iuivi)=id(uivi)=i(uidvi+vidui)=iuidvi+ividui=uTdv+vTdu
Similarly, if U , V ∈ R m × n U, V\in R^{m\times n} U,VRm×n, and f = t r ( U T V ) f=tr(U^T V) f=tr(UTV), then
d f = t r ( V d U T ) + t r ( U T d V ) df = tr(VdU^T) + tr(U^T dV) df=tr(VdUT)+tr(UTdV)

1.5 Examples:

(1) Linear Combination

f ( X ) = a T X b f(X)=a^T X b f(X)=aTXb, where a ∈ R m , b ∈ R n , X ∈ R m × n a\in R^m, b\in R^n, X\in R^{m\times n} aRm,bRn,XRm×n. We need to calculate ∂ f ∂ X \frac{\partial f}{\partial X} Xf. Hint: f f f is scalar and d f df df is a scalar as well, then d f = t r ( d f ) df = tr(df) df=tr(df)
Solution:
d f = t r ( d f ) = t r ( d ( a T X b ) ) = t r ( a T d X b ) = t r ( b a T d X ) df = tr(df) = tr(d(a^T X b))=tr(a^T dX b)=tr(ba^T dX) df=tr(df)=tr(d(aTXb))=tr(aTdXb)=tr(baTdX)
Compare equation (3), we have ( ∂ f ∂ X ) T = b a T (\frac{\partial f}{\partial X})^T=ba^T (Xf)T=baT, which means ∂ f ∂ X = a b T \frac{\partial f}{\partial X}=ab^T Xf=abT.
We can check it by the shape, since x ∈ R m × n x\in R^{m\times n} xRm×n, then ∂ f ∂ X ∈ R m × n \frac{\partial f}{\partial X}\in R^{m\times n} XfRm×n as well. Besides, a b n ∈ R m × n ab^n\in R^{m\times n} abnRm×n can be verified easily. It show that the result is reasonable.

(2) Least Square

l ( w ) = ∥ X w − y ∥ 2 l(w)=\lVert Xw - y\rVert^2 l(w)=Xwy2, find ∂ l ∂ w \frac{\partial l}{\partial w} wl. In the least square method, X ∈ R m × n X\in R^{m\times n} XRm×n is a data matrix, w ∈ R n w\in R^n wRn is weight vector, y ∈ R m y\in R^m yRm is target label.
Solution:
l = ∥ X w − y ∥ 2 = ( X w − y ) T ( X w − y ) l=\lVert Xw - y\rVert^2 = (Xw - y)^T (Xw - y) l=Xwy2=(Xwy)T(Xwy)
Notice that X w − y Xw - y Xwy is a vector now.
t r ( d l ) = d l = ( X d w ) T ( X w − y ) + ( X w − y ) T ( X d w ) = 2 ( X w − y ) T X d w \begin{aligned} tr(dl) &= dl \\ &=(Xdw)^T(Xw-y)+(Xw-y)^T(Xdw)\\ &= 2(Xw-y)^T X dw\\ \end{aligned} tr(dl)=dl=(Xdw)T(Xwy)+(Xwy)T(Xdw)=2(Xwy)TXdw
Then ∂ l ∂ w = 2 X T ( X w − y ) \frac{\partial l}{\partial w} =2X^T (Xw-y) wl=2XT(Xwy).

(3) PCA

l ( w ) = w T Σ w + λ ( 1 − w T w ) l(w) = w^T \Sigma w + \lambda (1 - w^T w) l(w)=wTΣw+λ(1wTw), where w ∈ R n , Σ ∈ R n × n w\in R^n, \Sigma\in R^{n\times n} wRn,ΣRn×n, Σ \Sigma Σ is symmetric, which means Σ T = Σ \Sigma^T = \Sigma ΣT=Σ, and λ \lambda λ is a scalar. Find ∂ l ∂ w \frac{\partial l}{\partial w} wl.
Solution:
d l = t r ( d l ) = t r ( d w T Σ w ) + t r ( w T Σ d w ) − λ ( t r ( d w T w ) + t r ( w T d w ) ) = t r ( w T Σ T d w ) + t r ( w T Σ d w ) − λ ( t r ( w T d w ) + t r ( w T d w ) ) = t r ( 2 ( Σ w − λ w ) T d w ) \begin{aligned} dl &= tr(dl) \\ &= tr(dw^T \Sigma w) + tr(w^T\Sigma dw) - \lambda(tr(dw^T w) + tr(w^T dw)) \\ &= tr(w^T\Sigma^T dw) + tr(w^T\Sigma dw) - \lambda(tr(w^T dw) + tr(w^T dw))\\ & = tr(2(\Sigma w - \lambda w)^Tdw) \end{aligned} dl=tr(dl)=tr(dwTΣw)+tr(wTΣdw)λ(tr(dwTw)+tr(wTdw))=tr(wTΣTdw)+tr(wTΣdw)λ(tr(wTdw)+tr(wTdw))=tr(2(Σwλw)Tdw)
Then we have ∂ l ∂ w = 2 ( Σ w − λ w ) \frac{\partial l}{\partial w} = 2(\Sigma w-\lambda w) wl=2(Σwλw).

(4) SVM

What th SVM need do is find the proper weight w w w, which
min ⁡ w 1 2 ∥ w ∥ 2 s . t . 1 − y i ( w T x i ) ≤ 0 , i = 1 , 2 , 3 , ⋯   , M \min_{w} \quad \frac{1}{2}\lVert w \rVert ^2 \\ s.t.\quad 1 - y_i(w^T x_i) \le 0, i=1,2,3,\cdots, M wmin21w2s.t.1yi(wTxi)0,i=1,2,3,,M
Where w , x i ∈ R n , y i ∈ { − 1 , 1 } w,x_i\in R^n, y_i\in \{-1,1\} w,xiRn,yi{1,1}, M M M is the number of data. The Lagrangian
L ( w , λ ) = 1 2 ∥ w ∥ 2 + ∑ i = 1 M λ i ( 1 − y i ( w T x i ) ) L(w,\lambda) = \frac{1}{2}\lVert w \rVert ^2 +\sum_{i=1}^M \lambda_i(1 - y_i(w^T x_i) ) L(w,λ)=21w2+i=1Mλi(1yi(wTxi))
where λ ∈ R M \lambda\in R^M λRM and λ i \lambda_i λi is i i ith component of λ \lambda λ, here λ ≽ 0 \lambda \succcurlyeq0 λ0, which means λ i ≥ 0 \lambda_i\ge 0 λi0.
Find ∂ L ∂ w \frac{\partial L}{\partial w} wL.
Solution:
∂ L ∂ w = w − ∑ i = 1 M λ i y i x i \begin{aligned} \frac{\partial L}{\partial w} = w - \sum_{i=1}^M \lambda_i y_i x_i \end{aligned} wL=wi=1Mλiyixi
I find it is simple enough to use the trace tricks…

  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值