Matrix Derivative

最新推荐文章于 2024-04-16 09:54:47 发布

有硬币就是土豪

最新推荐文章于 2024-04-16 09:54:47 发布

阅读量466

点赞数 1

分类专栏：凸优化文章标签：矩阵求导 PCA SVM 最小二乘法求导

本文链接：https://blog.csdn.net/gengli2017/article/details/115128639

版权

凸优化专栏收录该内容

2 篇文章

订阅专栏

本文介绍了如何通过迹运算简化矩阵函数的导数计算，包括定义、迹的性质、矩阵与迹的结合，以及几个关键例子如线性组合、最小二乘、PCA和SVM中的应用。通过迹和梯度的关系，展示了如何利用迹来计算总微分并找到导数。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Matrix Derivative

1.1 Definition

For a function $f (X)$ , where $X\in R^{m\times n}$ is a matrix and $f(X)\in R$ is scalar. We define the derivative of the function as $\frac{\partial f}{\partial X}=\left[\frac{\partial f}{\partial X_{ij}}\right]$ , where $\frac{\partial f}{\partial X}$ is a matrix with the same shape as $X$ . However, sometimes it’s not easy to calculate the result of it, since we have to calculate the derivative of the elements one by one. Next we will see how to use the trace trick to simplify the result.

1.2 Trace of Matrix

We define the trace of square matrix $A\in R^{n\times n}$ , which has the same number of rows as columns, as $\sum_{i=1}^n A_{ii}$ , where $A_{ii}, i=1,...,n$ is the diagonal elements of $A$ . The trace of a square matrix is the sum of its diagonal elements. For a scalar $a$ , we regard it as a matrix whose shape is $1\times 1$ , and then $t r (a) = a$ .
The trace is not so trivial so far. While the multiplication of two matrices
$tr(A^T B) =\sum_{i=0}^n\sum_{j=0}^m A_{ij}B_{ij} \tag{1}$
where $\in R^{m\times n}$ . It is the sum of the multiplication of corresponding elements. In fact, it is similar with the inner product of two vectors.
To verify equation $(1)$ is not difficult, we can use $C = A^T B$ , then $c_{ii} = a_i^T b_i$ , where $c_{ii}$ is the $i$ th rows and $i$ th column of $C$ , $a_i^T$ is the $i$ th row of $A^T$ , e.q. $i$ th column of $A$ , and $b_i$ is $i$ th column of $B$ . So $c_{ii}$ is the inner product of $i$ th column of $A$ and $i$ th column of $B$ .
Then $tr(A^T B) = \sum_{i=0} ^n c_{ii}$ is the “inner product” of $A$ and $B$ .

Here we list some trace tricks which are useful.

$a = t r (a)$ , where $a$ is a scalar.
$tr(A^T) = tr(A)$ .
$tr(A\plusmn B) = tr(A)\plusmn tr(B)$ .
$tr(A^T B) = tr(BA^T)=\sum_{i,j}A_{ij}B_{ij}$ , where $B\in R^{n\times n}$ .
$t r (A B C) = t r (B C A) = t r (C A B)$ , where $A\in R^{r\times m}, B\in R^{m\times n}, C\in R^{n\times r}$ . Particularly, if $a\in R^m, c\in R^n$ , $a$ and $c$ are both column vectors. Then $tr(a^T Bc)=tr(ca^TB)=tr(Bca^T)$ .
$tr(A^T(B\bigodot C) = tr((A\bigodot B)^T C)$ , where $\in R^{m\times n}$ , and $\bigodot$ means multiply of element-wise.

1.3 Trace and Derivative

You should mention that in this article the $f$ is function which map a scalar, vector or matrix to a scalar. Besides, the vector is column vector by default.
First we show the total differential of scalar function. If there is a function $f (x)$ , then $d f (x) = f^{'} (x) d x$ , where $x\in R$ .
Second, if $x\in R^n$ is scalar, then $\nabla f^T dx$ , where $\nabla f \in R^n$ is the gradient of $f (x)$ , its $i$ th component is $(\nabla f) _i=\frac{\partial f}{\partial x_i}$ . $dx\in R^n$ is also a vector and $dx)_i = dx_i$ . We can also rewrite $\sum_i \frac{\partial f}{\partial x_i}dx_i$ .
At last, similar to the vector condition, the total differential of $f (X)$ can be write as
$\sum_{i,j} \frac{\partial f}{\partial X_{ij}}dX_{ij} \tag{2}$
Where $X\in R^{m\times n}$ and $f (X)$ is a scalar. $\frac{\partial f}{\partial X}$ and $d X$ are both matrices with shape $m\times n$ and $(\frac{\partial f}{\partial X})_{ij}=\frac{\partial f}{\partial X_{ij}}$ and $dX)_{ij}=dX_{ij}$ .

Compare equation $(1)$ and equation $(2)$ , we can see that
$tr\left(\left( \frac{\partial f}{\partial X} \right)^T dX \right) \tag{3}$
Equation $(3)$ inspires us that if we can calculate the total differential of $f$ with the format of $(3)$ , then we can get the derivative of $f$ , which is $\frac{\partial f}{\partial X}$ . The trick is that using trace and let $d X$ at the end of the equation, just like eqution (3).

1.4 Total Differential

If $f = u v$ , then $d f = v d u + u d v$ , where $u, v$ are scalars.
If $u,v\in R^n$ and $f=u^T v = \sum_{i}u_iv_i$ , where $u_i$ and $v_i$ is the $i$ th element of $u$ and $v$ respectively. Then
$\begin{aligned} df &=d(\sum_{i}u_iv_i) \\ &= \sum_{i}d(u_iv_i) \\ &= \sum_{i} (u_i dv_i + v_i du_i)\\ &= \sum_{i}u_i dv_i + \sum_{i}v_i du_i \\ &= u^T dv + v^T du \end{aligned}$
Similarly, if $V\in R^{m\times n}$ , and $f=tr(U^T V)$ , then
$df = tr(VdU^T) + tr(U^T dV)$

1.5 Examples:

(1) Linear Combination

$f(X)=a^T X b$ , where $a\in R^m, b\in R^n, X\in R^{m\times n}$ . We need to calculate $\frac{\partial f}{\partial X}$ . Hint: $f$ is scalar and $d f$ is a scalar as well, then $d f = t r (d f)$
Solution:
$df = tr(df) = tr(d(a^T X b))=tr(a^T dX b)=tr(ba^T dX)$
Compare equation (3), we have $(\frac{\partial f}{\partial X})^T=ba^T$ , which means $\frac{\partial f}{\partial X}=ab^T$ .
We can check it by the shape, since $x\in R^{m\times n}$ , then $\frac{\partial f}{\partial X}\in R^{m\times n}$ as well. Besides, $ab^n\in R^{m\times n}$ can be verified easily. It show that the result is reasonable.

(2) Least Square

$l(w)=\lVert Xw - y\rVert^2$ , find $\frac{\partial l}{\partial w}$ . In the least square method, $X\in R^{m\times n}$ is a data matrix, $w\in R^n$ is weight vector, $y\in R^m$ is target label.
Solution:
$l=\lVert Xw - y\rVert^2 = (Xw - y)^T (Xw - y)$
Notice that $X w - y$ is a vector now.
$\begin{aligned} tr(dl) &= dl \\ &=(Xdw)^T(Xw-y)+(Xw-y)^T(Xdw)\\ &= 2(Xw-y)^T X dw\\ \end{aligned}$
Then $\frac{\partial l}{\partial w} =2X^T (Xw-y)$ .

(3) PCA

$w^T \Sigma w + \lambda (1 - w^T w)$ , where $w\in R^n, \Sigma\in R^{n\times n}$ , $\Sigma$ is symmetric, which means $\Sigma^T = \Sigma$ , and $\lambda$ is a scalar. Find $\frac{\partial l}{\partial w}$ .
Solution:
$\begin{aligned} dl &= tr(dl) \\ &= tr(dw^T \Sigma w) + tr(w^T\Sigma dw) - \lambda(tr(dw^T w) + tr(w^T dw)) \\ &= tr(w^T\Sigma^T dw) + tr(w^T\Sigma dw) - \lambda(tr(w^T dw) + tr(w^T dw))\\ & = tr(2(\Sigma w - \lambda w)^Tdw) \end{aligned}$
Then we have $\frac{\partial l}{\partial w} = 2(\Sigma w-\lambda w)$ .

(4) SVM

What th SVM need do is find the proper weight $w$ , which
$\min_{w} \quad \frac{1}{2}\lVert w \rVert ^2 \\ s.t.\quad 1 - y_i(w^T x_i) \le 0, i=1,2,3,\cdots, M$
Where $w,x_i\in R^n, y_i\in \{-1,1\}$ , $M$ is the number of data. The Lagrangian
$L(w,\lambda) = \frac{1}{2}\lVert w \rVert ^2 +\sum_{i=1}^M \lambda_i(1 - y_i(w^T x_i) )$
where $\lambda\in R^M$ and $\lambda_i$ is $i$ th component of $\lambda$ , here $\lambda \succcurlyeq0$ , which means $\lambda_i\ge 0$ .
Find $\frac{\partial L}{\partial w}$ .
Solution:
$\begin{aligned} \frac{\partial L}{\partial w} = w - \sum_{i=1}^M \lambda_i y_i x_i \end{aligned}$
I find it is simple enough to use the trace tricks…