Matrix Derivative
1.1 Definition
For a function f ( X ) f(X) f(X), where X ∈ R m × n X\in R^{m\times n} X∈Rm×n is a matrix and f ( X ) ∈ R f(X)\in R f(X)∈R is scalar. We define the derivative of the function as ∂ f ∂ X = [ ∂ f ∂ X i j ] \frac{\partial f}{\partial X}=\left[\frac{\partial f}{\partial X_{ij}}\right] ∂X∂f=[∂Xij∂f], where ∂ f ∂ X \frac{\partial f}{\partial X} ∂X∂f is a matrix with the same shape as X X X. However, sometimes it’s not easy to calculate the result of it, since we have to calculate the derivative of the elements one by one. Next we will see how to use the trace trick to simplify the result.
1.2 Trace of Matrix
We define the trace of square matrix
A
∈
R
n
×
n
A\in R^{n\times n}
A∈Rn×n, which has the same number of rows as columns, as
t
r
(
A
)
=
∑
i
=
1
n
A
i
i
tr(A) = \sum_{i=1}^n A_{ii}
tr(A)=∑i=1nAii, where
A
i
i
,
i
=
1
,
.
.
.
,
n
A_{ii}, i=1,...,n
Aii,i=1,...,n is the diagonal elements of
A
A
A. The trace of a square matrix is the sum of its diagonal elements. For a scalar
a
a
a, we regard it as a matrix whose shape is
1
×
1
1\times 1
1×1, and then
t
r
(
a
)
=
a
tr(a)=a
tr(a)=a.
The trace is not so trivial so far. While the multiplication of two matrices
t
r
(
A
T
B
)
=
∑
i
=
0
n
∑
j
=
0
m
A
i
j
B
i
j
(1)
tr(A^T B) =\sum_{i=0}^n\sum_{j=0}^m A_{ij}B_{ij} \tag{1}
tr(ATB)=i=0∑nj=0∑mAijBij(1)
where
A
,
B
∈
R
m
×
n
A, B \in R^{m\times n}
A,B∈Rm×n. It is the sum of the multiplication of corresponding elements. In fact, it is similar with the inner product of two vectors.
To verify equation
(
1
)
(1)
(1) is not difficult, we can use
C
=
A
T
B
C = A^T B
C=ATB, then
c
i
i
=
a
i
T
b
i
c_{ii} = a_i^T b_i
cii=aiTbi, where
c
i
i
c_{ii}
cii is the
i
i
ith rows and
i
i
ith column of
C
C
C,
a
i
T
a_i^T
aiT is the
i
i
ith row of
A
T
A^T
AT, e.q.
i
i
ith column of
A
A
A, and
b
i
b_i
bi is
i
i
ith column of
B
B
B. So
c
i
i
c_{ii}
cii is the inner product of
i
i
ith column of
A
A
A and
i
i
ith column of
B
B
B.
Then
t
r
(
A
T
B
)
=
∑
i
=
0
n
c
i
i
tr(A^T B) = \sum_{i=0} ^n c_{ii}
tr(ATB)=∑i=0ncii is the “inner product” of
A
A
A and
B
B
B.
Here we list some trace tricks which are useful.
- a = t r ( a ) a = tr(a) a=tr(a), where a a a is a scalar.
- t r ( A T ) = t r ( A ) tr(A^T) = tr(A) tr(AT)=tr(A).
- t r ( A ± B ) = t r ( A ) ± t r ( B ) tr(A\plusmn B) = tr(A)\plusmn tr(B) tr(A±B)=tr(A)±tr(B).
- t r ( A T B ) = t r ( B A T ) = ∑ i , j A i j B i j tr(A^T B) = tr(BA^T)=\sum_{i,j}A_{ij}B_{ij} tr(ATB)=tr(BAT)=∑i,jAijBij, where A , B ∈ R n × n A, B\in R^{n\times n} A,B∈Rn×n.
- t r ( A B C ) = t r ( B C A ) = t r ( C A B ) tr(ABC)=tr(BCA)=tr(CAB) tr(ABC)=tr(BCA)=tr(CAB), where A ∈ R r × m , B ∈ R m × n , C ∈ R n × r A\in R^{r\times m}, B\in R^{m\times n}, C\in R^{n\times r} A∈Rr×m,B∈Rm×n,C∈Rn×r. Particularly, if a ∈ R m , c ∈ R n a\in R^m, c\in R^n a∈Rm,c∈Rn, a a a and c c c are both column vectors. Then t r ( a T B c ) = t r ( c a T B ) = t r ( B c a T ) tr(a^T Bc)=tr(ca^TB)=tr(Bca^T) tr(aTBc)=tr(caTB)=tr(BcaT).
- t r ( A T ( B ⨀ C ) = t r ( ( A ⨀ B ) T C ) tr(A^T(B\bigodot C) = tr((A\bigodot B)^T C) tr(AT(B⨀C)=tr((A⨀B)TC), where A , B , C ∈ R m × n A, B, C \in R^{m\times n} A,B,C∈Rm×n, and ⨀ \bigodot ⨀ means multiply of element-wise.
1.3 Trace and Derivative
You should mention that in this article the
f
f
f is function which map a scalar, vector or matrix to a scalar. Besides, the vector is column vector by default.
First we show the total differential of scalar function. If there is a function
f
(
x
)
f(x)
f(x), then
d
f
(
x
)
=
f
′
(
x
)
d
x
df(x) = f'(x)dx
df(x)=f′(x)dx, where
x
∈
R
x\in R
x∈R.
Second, if
x
∈
R
n
x\in R^n
x∈Rn is scalar, then
d
f
(
x
)
=
∇
f
T
d
x
df(x) = \nabla f^T dx
df(x)=∇fTdx, where
∇
f
∈
R
n
\nabla f \in R^n
∇f∈Rn is the gradient of
f
(
x
)
f(x)
f(x), its
i
i
ith component is
(
∇
f
)
i
=
∂
f
∂
x
i
(\nabla f) _i=\frac{\partial f}{\partial x_i}
(∇f)i=∂xi∂f.
d
x
∈
R
n
dx\in R^n
dx∈Rn is also a vector and
(
d
x
)
i
=
d
x
i
(dx)_i = dx_i
(dx)i=dxi. We can also rewrite
d
f
(
x
)
=
∑
i
∂
f
∂
x
i
d
x
i
df(x) = \sum_i \frac{\partial f}{\partial x_i}dx_i
df(x)=∑i∂xi∂fdxi.
At last, similar to the vector condition, the total differential of
f
(
X
)
f(X)
f(X) can be write as
d
f
(
X
)
=
∑
i
,
j
∂
f
∂
X
i
j
d
X
i
j
(2)
df(X) = \sum_{i,j} \frac{\partial f}{\partial X_{ij}}dX_{ij} \tag{2}
df(X)=i,j∑∂Xij∂fdXij(2)
Where
X
∈
R
m
×
n
X\in R^{m\times n}
X∈Rm×n and
f
(
X
)
f(X)
f(X) is a scalar.
∂
f
∂
X
\frac{\partial f}{\partial X}
∂X∂f and
d
X
d X
dX are both matrices with shape
m
×
n
m\times n
m×n and
(
∂
f
∂
X
)
i
j
=
∂
f
∂
X
i
j
(\frac{\partial f}{\partial X})_{ij}=\frac{\partial f}{\partial X_{ij}}
(∂X∂f)ij=∂Xij∂f and
(
d
X
)
i
j
=
d
X
i
j
(dX)_{ij}=dX_{ij}
(dX)ij=dXij.
Compare equation
(
1
)
(1)
(1) and equation
(
2
)
(2)
(2), we can see that
d
f
(
X
)
=
t
r
(
(
∂
f
∂
X
)
T
d
X
)
(3)
df(X) = tr\left(\left( \frac{\partial f}{\partial X} \right)^T dX \right) \tag{3}
df(X)=tr((∂X∂f)TdX)(3)
Equation
(
3
)
(3)
(3) inspires us that if we can calculate the total differential of
f
f
f with the format of
(
3
)
(3)
(3), then we can get the derivative of
f
f
f, which is
∂
f
∂
X
\frac{\partial f}{\partial X}
∂X∂f. The trick is that using trace and let
d
X
dX
dX at the end of the equation, just like eqution (3).
1.4 Total Differential
If
f
=
u
v
f=uv
f=uv, then
d
f
=
v
d
u
+
u
d
v
df=vdu + udv
df=vdu+udv, where
u
,
v
u, v
u,v are scalars.
If
u
,
v
∈
R
n
u,v\in R^n
u,v∈Rn and
f
=
u
T
v
=
∑
i
u
i
v
i
f=u^T v = \sum_{i}u_iv_i
f=uTv=∑iuivi, where
u
i
u_i
ui and
v
i
v_i
vi is the
i
i
ith element of
u
u
u and
v
v
v respectively. Then
d
f
=
d
(
∑
i
u
i
v
i
)
=
∑
i
d
(
u
i
v
i
)
=
∑
i
(
u
i
d
v
i
+
v
i
d
u
i
)
=
∑
i
u
i
d
v
i
+
∑
i
v
i
d
u
i
=
u
T
d
v
+
v
T
d
u
\begin{aligned} df &=d(\sum_{i}u_iv_i) \\ &= \sum_{i}d(u_iv_i) \\ &= \sum_{i} (u_i dv_i + v_i du_i)\\ &= \sum_{i}u_i dv_i + \sum_{i}v_i du_i \\ &= u^T dv + v^T du \end{aligned}
df=d(i∑uivi)=i∑d(uivi)=i∑(uidvi+vidui)=i∑uidvi+i∑vidui=uTdv+vTdu
Similarly, if
U
,
V
∈
R
m
×
n
U, V\in R^{m\times n}
U,V∈Rm×n, and
f
=
t
r
(
U
T
V
)
f=tr(U^T V)
f=tr(UTV), then
d
f
=
t
r
(
V
d
U
T
)
+
t
r
(
U
T
d
V
)
df = tr(VdU^T) + tr(U^T dV)
df=tr(VdUT)+tr(UTdV)
1.5 Examples:
(1) Linear Combination
f
(
X
)
=
a
T
X
b
f(X)=a^T X b
f(X)=aTXb, where
a
∈
R
m
,
b
∈
R
n
,
X
∈
R
m
×
n
a\in R^m, b\in R^n, X\in R^{m\times n}
a∈Rm,b∈Rn,X∈Rm×n. We need to calculate
∂
f
∂
X
\frac{\partial f}{\partial X}
∂X∂f. Hint:
f
f
f is scalar and
d
f
df
df is a scalar as well, then
d
f
=
t
r
(
d
f
)
df = tr(df)
df=tr(df)
Solution:
d
f
=
t
r
(
d
f
)
=
t
r
(
d
(
a
T
X
b
)
)
=
t
r
(
a
T
d
X
b
)
=
t
r
(
b
a
T
d
X
)
df = tr(df) = tr(d(a^T X b))=tr(a^T dX b)=tr(ba^T dX)
df=tr(df)=tr(d(aTXb))=tr(aTdXb)=tr(baTdX)
Compare equation (3), we have
(
∂
f
∂
X
)
T
=
b
a
T
(\frac{\partial f}{\partial X})^T=ba^T
(∂X∂f)T=baT, which means
∂
f
∂
X
=
a
b
T
\frac{\partial f}{\partial X}=ab^T
∂X∂f=abT.
We can check it by the shape, since
x
∈
R
m
×
n
x\in R^{m\times n}
x∈Rm×n, then
∂
f
∂
X
∈
R
m
×
n
\frac{\partial f}{\partial X}\in R^{m\times n}
∂X∂f∈Rm×n as well. Besides,
a
b
n
∈
R
m
×
n
ab^n\in R^{m\times n}
abn∈Rm×n can be verified easily. It show that the result is reasonable.
(2) Least Square
l
(
w
)
=
∥
X
w
−
y
∥
2
l(w)=\lVert Xw - y\rVert^2
l(w)=∥Xw−y∥2, find
∂
l
∂
w
\frac{\partial l}{\partial w}
∂w∂l. In the least square method,
X
∈
R
m
×
n
X\in R^{m\times n}
X∈Rm×n is a data matrix,
w
∈
R
n
w\in R^n
w∈Rn is weight vector,
y
∈
R
m
y\in R^m
y∈Rm is target label.
Solution:
l
=
∥
X
w
−
y
∥
2
=
(
X
w
−
y
)
T
(
X
w
−
y
)
l=\lVert Xw - y\rVert^2 = (Xw - y)^T (Xw - y)
l=∥Xw−y∥2=(Xw−y)T(Xw−y)
Notice that
X
w
−
y
Xw - y
Xw−y is a vector now.
t
r
(
d
l
)
=
d
l
=
(
X
d
w
)
T
(
X
w
−
y
)
+
(
X
w
−
y
)
T
(
X
d
w
)
=
2
(
X
w
−
y
)
T
X
d
w
\begin{aligned} tr(dl) &= dl \\ &=(Xdw)^T(Xw-y)+(Xw-y)^T(Xdw)\\ &= 2(Xw-y)^T X dw\\ \end{aligned}
tr(dl)=dl=(Xdw)T(Xw−y)+(Xw−y)T(Xdw)=2(Xw−y)TXdw
Then
∂
l
∂
w
=
2
X
T
(
X
w
−
y
)
\frac{\partial l}{\partial w} =2X^T (Xw-y)
∂w∂l=2XT(Xw−y).
(3) PCA
l
(
w
)
=
w
T
Σ
w
+
λ
(
1
−
w
T
w
)
l(w) = w^T \Sigma w + \lambda (1 - w^T w)
l(w)=wTΣw+λ(1−wTw), where
w
∈
R
n
,
Σ
∈
R
n
×
n
w\in R^n, \Sigma\in R^{n\times n}
w∈Rn,Σ∈Rn×n,
Σ
\Sigma
Σ is symmetric, which means
Σ
T
=
Σ
\Sigma^T = \Sigma
ΣT=Σ, and
λ
\lambda
λ is a scalar. Find
∂
l
∂
w
\frac{\partial l}{\partial w}
∂w∂l.
Solution:
d
l
=
t
r
(
d
l
)
=
t
r
(
d
w
T
Σ
w
)
+
t
r
(
w
T
Σ
d
w
)
−
λ
(
t
r
(
d
w
T
w
)
+
t
r
(
w
T
d
w
)
)
=
t
r
(
w
T
Σ
T
d
w
)
+
t
r
(
w
T
Σ
d
w
)
−
λ
(
t
r
(
w
T
d
w
)
+
t
r
(
w
T
d
w
)
)
=
t
r
(
2
(
Σ
w
−
λ
w
)
T
d
w
)
\begin{aligned} dl &= tr(dl) \\ &= tr(dw^T \Sigma w) + tr(w^T\Sigma dw) - \lambda(tr(dw^T w) + tr(w^T dw)) \\ &= tr(w^T\Sigma^T dw) + tr(w^T\Sigma dw) - \lambda(tr(w^T dw) + tr(w^T dw))\\ & = tr(2(\Sigma w - \lambda w)^Tdw) \end{aligned}
dl=tr(dl)=tr(dwTΣw)+tr(wTΣdw)−λ(tr(dwTw)+tr(wTdw))=tr(wTΣTdw)+tr(wTΣdw)−λ(tr(wTdw)+tr(wTdw))=tr(2(Σw−λw)Tdw)
Then we have
∂
l
∂
w
=
2
(
Σ
w
−
λ
w
)
\frac{\partial l}{\partial w} = 2(\Sigma w-\lambda w)
∂w∂l=2(Σw−λw).
(4) SVM
What th SVM need do is find the proper weight
w
w
w, which
min
w
1
2
∥
w
∥
2
s
.
t
.
1
−
y
i
(
w
T
x
i
)
≤
0
,
i
=
1
,
2
,
3
,
⋯
,
M
\min_{w} \quad \frac{1}{2}\lVert w \rVert ^2 \\ s.t.\quad 1 - y_i(w^T x_i) \le 0, i=1,2,3,\cdots, M
wmin21∥w∥2s.t.1−yi(wTxi)≤0,i=1,2,3,⋯,M
Where
w
,
x
i
∈
R
n
,
y
i
∈
{
−
1
,
1
}
w,x_i\in R^n, y_i\in \{-1,1\}
w,xi∈Rn,yi∈{−1,1},
M
M
M is the number of data. The Lagrangian
L
(
w
,
λ
)
=
1
2
∥
w
∥
2
+
∑
i
=
1
M
λ
i
(
1
−
y
i
(
w
T
x
i
)
)
L(w,\lambda) = \frac{1}{2}\lVert w \rVert ^2 +\sum_{i=1}^M \lambda_i(1 - y_i(w^T x_i) )
L(w,λ)=21∥w∥2+i=1∑Mλi(1−yi(wTxi))
where
λ
∈
R
M
\lambda\in R^M
λ∈RM and
λ
i
\lambda_i
λi is
i
i
ith component of
λ
\lambda
λ, here
λ
≽
0
\lambda \succcurlyeq0
λ≽0, which means
λ
i
≥
0
\lambda_i\ge 0
λi≥0.
Find
∂
L
∂
w
\frac{\partial L}{\partial w}
∂w∂L.
Solution:
∂
L
∂
w
=
w
−
∑
i
=
1
M
λ
i
y
i
x
i
\begin{aligned} \frac{\partial L}{\partial w} = w - \sum_{i=1}^M \lambda_i y_i x_i \end{aligned}
∂w∂L=w−i=1∑Mλiyixi
I find it is simple enough to use the trace tricks…