文章目录
神经翻译笔记3扩展a. 深度学习的矩阵微积分基础
写在前面:矩阵微积分是深度学习的数学基础之一,但是这部分内容在大学计算机系(及相关非数学类专业)本科几乎没有介绍,想要了解全凭自学。我之前看过的比较好的资料有三个:维基百科的Matrix Calculus词条、The Matrix Cookbook和Towser的《机器学习中的矩阵/向量求导》。然而前两个都是英文资料,而且主要重视结论,可以当做字典用,但是看完总有点“知其然不知其所以然”的感觉(维基词条似乎有简单的计算过程介绍,但是还是有点简略);Towser大神的文章写得很不错,但是我数学较差,看到张量相关的部分还是觉得有点脑子转不过来
最近这几天我整理自己的微博收藏时,无意间发现北邮陈光老师(爱可可爱生活@微博)曾经推荐过一篇文章:Terence Parr和Jeremy Howard合作的The Matrix Calculus You Need For Deep Learning,感觉比较符合我的需求(再次证明收藏过的东西如果不回顾就是白收藏),相对Towser的文章更基础一点。这篇博客就是我阅读该论文的一些摘要
预备知识
对于一元函数的导数,存在如下几条规则(以下均认为 x x x是自变量):
- 常数函数的导数为0: f ( x ) = c → d f / d x = 0 f(x) = c \rightarrow df/dx = 0 f(x)=c→df/dx=0
- 常量相乘法则: ( c f ( x ) ) ′ = c d f d x (cf(x))' = c\frac{df}{dx} (cf(x))′=cdxdf
- 幂函数求导法则: f ( x ) = x n → d f d x = n x n − 1 f(x) = x^n \rightarrow \frac{df}{dx} = nx^{n-1} f(x)=xn→dxdf=nxn−1
- 加法法则: d d x ( f ( x ) + g ( x ) ) = d f d x + d g d x \frac{d}{dx}(f(x) + g(x)) = \frac{df}{dx} + \frac{dg}{dx} dxd(f(x)+g(x))=dxdf+dxdg
- 减法法则: d d x ( f ( x ) − g ( x ) ) = d f d x − d g d x \frac{d}{dx}(f(x) - g(x)) = \frac{df}{dx} - \frac{dg}{dx} dxd(f(x)−g(x))=dxdf−dxdg
- 乘法法则: d d x ( f ( x ) ⋅ g ( x ) ) = f ( x ) ⋅ d g d x + d f d x ⋅ g ( x ) \frac{d}{dx}(f(x)\cdot g(x)) = f(x)\cdot \frac{dg}{dx} + \frac{df}{dx}\cdot g(x) dxd(f(x)⋅g(x))=f(x)⋅dxdg+dxdf⋅g(x)
- 链式法则: d d x ( f ( g ( x ) ) ) = d f ( u ) d u ⋅ d u d x \frac{d}{dx}(f(g(x))) = \frac{df(u)}{du}\cdot \frac{du}{dx} dxd(f(g(x)))=dudf(u)⋅dxdu,若令 u = g ( x ) u=g(x) u=g(x)
对于二元函数,需要引入偏导数的概念。假设函数为 f ( x , y ) f(x,y) f(x,y),求函数对 x x x或 y y y的偏导数时,将另一个变量看作是常量(对于多元函数,求对某个变量的偏导数时,将其它所有变量都视为常量)。求得的偏导数可以组合成为梯度,记为 ∇ f ( x , y ) \nabla f(x,y) ∇f(x,y)。即
∇ f ( x , y ) = [ ∂ f ( x , y ) ∂ x ∂ f ( x , y ) ∂ y ] \nabla f(x,y) = \left[\begin{matrix}\frac{\partial f(x,y)}{\partial x} \\ \frac{\partial f(x,y)}{\partial y}\end{matrix}\right] ∇f(x,y)=[∂x∂f(x,y)∂y∂f(x,y)]
(注意:原文里梯度表示成了一个行向量,并说明他们使用了numerator layout。但是这样会使得标量对向量的微分结果与该向量形状相异,也与主流记法相左。因此本文记录时做了转换,使用主流的denominator layout记法)
矩阵微积分
同一个函数对不同变量求偏导的结果可以组成称为梯度,多个函数对各变量求偏导的结果则可以组合成一个矩阵,称为雅可比矩阵(Jacobian matrix)。例如如果有函数 f , g f, g f,g,两者各自的梯度可以组合为
J = [ ∇ T f ( x , y ) ∇ T g ( x , y ) ] J = \left[\begin{matrix}\nabla^\mathsf{T} f(x,y) \\ \nabla^\mathsf{T} g(x, y)\end{matrix}\right] J=[∇Tf(x,y)∇Tg(x,y)]
雅可比矩阵的泛化
可以将多元变量组合成一个向量: f ( x 1 , x 2 , … , x n ) = f ( x ) f(x_1, x_2, \ldots, x_n) = f(\boldsymbol{x}) f(x1,x2,…,xn)=f(x)(本文认为所有向量都是 n × 1 n \times 1 n×1维的,即
x = [ x 1 x 2 ⋮ x n ] \boldsymbol{x} = \left[\begin{matrix}x_1 \\ x_2 \\ \vdots \\ x_n\end{matrix}\right] x=⎣⎢⎢⎢⎡x1x2⋮xn⎦⎥⎥⎥⎤
)假设有 m m m个函数分别向量 x \boldsymbol{x} x计算得出一个标量,即
y 1 = f 1 ( x ) y 2 = f 2 ( x ) ⋮ y m = f m ( x ) \begin{aligned} y_1 &= f_1(\boldsymbol{x}) \\ y_2 &= f_2(\boldsymbol{x}) \\ &\vdots \\ y_m &= f_m(\boldsymbol{x}) \end{aligned} y1y2ym=f1(x)=f2(x)⋮=fm(x)
可以简记为
y = f ( x ) \boldsymbol{y} = \boldsymbol{f}(\boldsymbol{x}) y=f(x)
y \boldsymbol{y} y对 x \boldsymbol{x} x求导的结果就是将每个函数对 x \boldsymbol{x} x的导数堆叠起来得到的雅可比矩阵:
∂ y ∂ x = [ ∂ ∂ x 1 f 1 ( x ) ∂ ∂ x 2 f 1 ( x ) ⋯ ∂ ∂ x n f 1 ( x ) ∂ ∂ x 1 f 2 ( x ) ∂ ∂ x 2 f 2 ( x ) ⋯ ∂ ∂ x n f 2 ( x ) ⋮ ⋮ ⋱ ⋮ ∂ ∂ x 1 f m ( x ) ∂ ∂ x 2 f m ( x ) ⋯ ∂ ∂ x n f m ( x ) ] \frac{\partial \boldsymbol{y}}{\partial \boldsymbol{x}} = \left[\begin{matrix}\frac{\partial}{\partial x_1}f_1(\boldsymbol{x}) & \frac{\partial}{\partial x_2}f_1(\boldsymbol{x}) & \cdots & \frac{\partial}{\partial x_n}f_1(\boldsymbol{x})\\ \frac{\partial}{\partial x_1}f_2(\boldsymbol{x}) & \frac{\partial}{\partial x_2}f_2(\boldsymbol{x}) & \cdots & \frac{\partial}{\partial x_n}f_2(\boldsymbol{x})\\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial}{\partial x_1}f_m(\boldsymbol{x}) & \frac{\partial}{\partial x_2}f_m(\boldsymbol{x}) & \cdots & \frac{\partial}{\partial x_n}f_m(\boldsymbol{x})\\ \end{matrix}\right] ∂x∂y=⎣⎢⎢⎢⎡∂x1∂f1(x)∂x1∂f2(x)⋮∂x1∂fm(x)∂x2∂f1(x)∂x2∂f2(x)⋮∂x2∂fm(x)⋯⋯⋱⋯∂xn∂f1(x)∂xn∂f2(x)⋮∂xn∂fm(x)⎦⎥⎥⎥⎤
两向量间逐元素运算的导数
假设 ◯ \bigcirc ◯是一个对两向量逐元素进行计算的操作符(例如 ⨁ \bigoplus ⨁是向量加法,就是两个向量逐元素相加),则对于 y = f ( w ) ◯ g ( x ) \boldsymbol{y} = \boldsymbol{f}(\boldsymbol{w}) \bigcirc \boldsymbol{g}(\boldsymbol{x}) y=f(w)◯g(x),假设 n = m = ∣ y ∣ = ∣ w ∣ = ∣ x ∣ n = m = |y| = |w| = |x| n=m=∣y∣=∣w∣=∣x∣,可以做如下展开
[ y 1 y 2 ⋮ y n ] = [ f 1 ( w ) ◯ g 1 ( x ) f 2 ( w ) ◯ g 2 ( x ) ⋮ f n ( w ) ◯ g n ( x ) ] \left[\begin{matrix}y_1 \\ y_2 \\ \vdots \\ y_n\end{matrix}\right] = \left[\begin{matrix} f_1(\boldsymbol{w}) \bigcirc g_1(\boldsymbol{x}) \\ f_2(\boldsymbol{w}) \bigcirc g_2(\boldsymbol{x}) \\ \vdots \\ f_n(\boldsymbol{w}) \bigcirc g_n(\boldsymbol{x})\end{matrix}\right] ⎣⎢⎢⎢⎡y1y2⋮yn⎦⎥⎥⎥⎤=⎣⎢⎢⎢⎡f1(w)◯g1(x)f2(w)◯g2(x)⋮fn(w)◯gn(x)⎦⎥⎥⎥⎤
y \boldsymbol{y} y分别对 w \boldsymbol{w} w和 x \boldsymbol{x} x求导可以得到两个方阵
J w = ∂ y ∂ w = [ ∂ ∂ w 1 ( f 1 ( w ) ◯ g 1 ( x ) ) ∂ ∂ w 2 ( f 1 ( w ) ◯ g 1 ( x ) ) ⋯ ∂ ∂ w n ( f 1 ( w ) ◯ g 1 ( x ) ) ∂ ∂ w 1 ( f 2 ( w ) ◯ g 2 ( x ) ) ∂ ∂ w 2 ( f 2 ( w ) ◯ g 2 ( x ) ) ⋯ ∂ ∂ w n ( f 2 ( w ) ◯ g 2 ( x ) ) ⋮ ⋮ ⋱ ⋮ ∂ ∂ w 1 ( f n ( w ) ◯ g n ( x ) ) ∂ ∂ w 2 ( f n ( w ) ◯ g n ( x ) ) ⋯ ∂ ∂ w n ( f n ( w ) ◯ g n ( x ) ) ] J_{\boldsymbol{w}} = \frac{\partial \boldsymbol{y}}{\partial \boldsymbol{w}} = \left[\begin{matrix}\frac{\partial }{\partial w_1}(f_1(\boldsymbol{w}) \bigcirc g_1(\boldsymbol{x})) & \frac{\partial }{\partial w_2}(f_1(\boldsymbol{w}) \bigcirc g_1(\boldsymbol{x})) & \cdots & \frac{\partial }{\partial w_n}(f_1(\boldsymbol{w}) \bigcirc g_1(\boldsymbol{x})) \\ \frac{\partial }{\partial w_1}(f_2(\boldsymbol{w}) \bigcirc g_2(\boldsymbol{x})) & \frac{\partial }{\partial w_2}(f_2(\boldsymbol{w}) \bigcirc g_2(\boldsymbol{x})) & \cdots & \frac{\partial }{\partial w_n}(f_2(\boldsymbol{w}) \bigcirc g_2(\boldsymbol{x})) \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial }{\partial w_1}(f_n(\boldsymbol{w}) \bigcirc g_n(\boldsymbol{x})) & \frac{\partial }{\partial w_2}(f_n(\boldsymbol{w}) \bigcirc g_n(\boldsymbol{x})) & \cdots & \frac{\partial }{\partial w_n}(f_n(\boldsymbol{w}) \bigcirc g_n(\boldsymbol{x})) \\ \end{matrix}\right] Jw=∂w∂y=⎣⎢⎢⎢⎡∂w1∂(f1(w)◯g1(x))∂w1∂(f2(w)◯g2(x))⋮∂w1∂(fn(w)◯gn(x))∂w2∂(f1(w)◯g1(x))∂w2∂(f2(w)◯g2(x))⋮∂w2∂(fn(w)◯gn(x))⋯⋯