Pytorch,Tensorflow Autograd/AutoDiff nutshells: Jacobian,Gradient,Hessian,JVP,VJP,etc

apche CN

于 2021-02-21 23:10:40 发布

阅读量2k

点赞数 2

分类专栏： 00.Tensorflow2

版权声明：本文为博主原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接和本声明。

本文链接：https://blog.csdn.net/apache/article/details/113925886

版权

00.Tensorflow2 专栏收录该内容

11 篇文章

订阅专栏

Pytorch,Tensorflow Autograd/AutoDiff nutshells: Jacobian,Gradient,Hessian,JVP,VJP,etc

Pytorch，MXNet,Tensorflow,JAX框架Auograd实现原理。

1.Jacobian matrix

Jacobian matrix 矩阵,简称Jacobian,简写J,是一阶偏导矩阵。

Jacobian来自德国数学家Carl Jacobi.

Jacobian是本文的核心内容，主要推理和计算都围绕Jacobian展开。

Given：

，f称为vector-vector function，

-

-性质：

1.对于 $\vec{y}=W\vec{x}$ ， $\vec{x},\vec{y}$ 均为列向量，W为矩阵，

有 $\frac{d\vec{y}}{d\vec{x}}=W$ .

即, J=W

2. $\vec{x},\vec{y}$ 为行向量时，W为矩阵, 1也成立。

特别是：

（1）当m=1,时，f称为 vector-scalar function，Jacobian退化为行向量，也就是gradient向量。

gradient定义：

-按定义有：

-

-这个标准基向量e在求解VJP会用到。

（2）当m = n = 1，，f称为 scalar-scalar function , Jacobian退化为单一量（single entry）,即函数f的导数。

顺便：Hessian matrix

Hessian matrix 矩阵，简称Hessian，简写H, 是二阶偏导矩阵.

Hessian matrix 来自德国数学家Otto Hesse.

请注意Hessian是Jacobian函数f的m=1的特殊情形。

Hessian函数f是vector-scalar function, Jacobian对应gradient的情形,因此Hessian称为梯度的梯度。

=梯度Gradient和Hessian 关系：

=Hessian矩阵与Jacobian矩阵的关系：

函数f的Hessian矩阵是函数梯度的Jacobian矩阵。

即：H(f(x)) = J(∇f(x))

2.深度神经网络，复合函数，Autograd, 计算图CG, forward & reverse mode

深度神经网络求解过程可表示为一系列复合函数的正向传播计算函数值和反向传播计算微分.

In a Nutshell, Gradient-Based Parameter Optimization.

重要的，这些复合函数可形式化为DAG形式的计算图（CG，Computational Graph）(F. L. Bauer，1974),由此可以借助计算机自动求解微分，

这就是Automatic Differentiation，简称AutoDiff （请你记住Autograd是AutoDiff的一种流行实现, 由此导致普遍混用）。

Autograd被今天流行的深度学习框架内核广泛采用，例如Pytorch,MXNet,Tensorflow,JAX,etc.

Autograd通常由机器学习框架实现，在模型训练阶段运行，机器学习框架的用户可以看做黑箱。

为何reverse mode autodiff更常见：

autodiff ：事实上，由于矩阵乘法的结合性，计算微分可以选择正向或反向。

对于计算图CG，输入为叶节点，设维度为n；输出为根节点，设维度为m。

当输入维度远大于输出维度，如n>>m,甚至m=1，采用reverse mode高效，遍历CG的次数为m；

否则采用forward mode高效,遍历CG的次数为n。

常规的计算函数，通常满足n>>m,reverse mode更为实用。

另，没有免费午餐，采用reverse mode模式需要申请额外内存存储中间变量。

3.Autograd/AutoDiff算法实现, JVP ，VJP，HVP

AutoDiff 基于两点认识：

一是无论多复杂的函数，都可以被分解为一系列单自变量或双自变量的基本函数；二是chain rule，微分链式法则。

准备：

为计算方便，避免直接实例化Jacobian矩阵。

先回忆Jacobin矩阵。

定义 Jacobian-vector product function,JVP

-

定义 vector-Jacobian product function,VJP

-

3.1 forward mode：

前向传播时，采用Jacobian vector products (JVPs)方式计算。

以pytorch Autograd为例：

【图片来源,PyTorch Autograd:Understanding the heart of PyTorch’s magic】

J-Jacobian如前所述，V-vector为Loss损失函数的梯度，

-

则JVP为

-

容易知道，避免直接计算J,V，通常转而计算JVP。

3.2 reverse mode：

reverse mode有另外一个名字，adjoint mode;

变量的微分记作 $\bar{x}$ ,也有另外一个名字adjoint( of x).

反向传播（避免使用后向传播这个词），采用vector Jacobian products (VJPs)方式计算。

例子：

函数f及其计算图

-

-求 O w.r.t. X 的梯度

-

-

$\mathbf{e}_{i}\in \mathbb{R}^{m}$ 和J的VJP，Vector Jacobian product ，对应 Jacobian 矩阵的第i行。

这可以看做一个方向导数(梯度/法线，切线均为方向导数)，e是标准基向量。

-

因此，O w.r.t. X 梯度 可解。

求解过程可理解为计算图反向传播构造VJP序列的过程。

当m=1时，计算代价为O(n^2).

reverse-mode AutoDiff 反向微分算法：

1.Forward/building the graph: 跟踪函数,将输入函数解析为单独的算子(OP)，并建立一个计算图，其中每个节点都是一个OP。
2.Topological sort： 根据OP在计算图中的依存关系将其拓扑排序。
3.为每个Op实现VJP： 在计算图的上下文中有效计算op的导数的一种方法是使用它的VJP。
4.Backward： 使用计算图和每个节点/op的VJP，在计算图中反向传播构造VJP序列以获得任意点的导数。

以pytorch Autograd为例：

pytorch在利用计算图CG求导的过程中根节点是一个标量（否则老版本会报错，新版已改进），遍历一次CG即可求出梯度。

当根节点为一个向量的时候，会构建多个计算图CG对该向量中的每一个元素分别进行求导。

pytorch计算框架采用动态计算图DCG，一次backward计算之后DCG立刻被删除,下次重建。

特定情况下，如果3.3双反向传播，框架支持保留CG，在调用backward设置参数retain_graph=True，默认为Flase，CG被保存不会立即释放。

3.3 double backpropagation： HVP，Hessian Vector Product

双反向传播，采用HVP方式计算。

4.工程实现

HIPS autograd

https://github.com/hips/autograd

mattjj autodidact

https://github.com/mattjj/autodidact

AutoDiff-from-scratch

https://github.com/titaneric/AutoDiff-from-scratch

Pytorch

https://github.com/pytorch/pytorch/tree/master/torch/autograd

twitter / torch-autograd

https://github.com/twitter/torch-autograd

google / jax

https://github.com/google/jax/blob/master/jax/core.py

5.实验和验证

参考pytorch autograd的各种使用案例。不赘。

=============================================

Reference：

=========================

HIPS autograd https://github.com/hips/autograd

mattjj autodidact https://github.com/mattjj/autodidact

AutoDiff-from-scratch https://github.com/titaneric/AutoDiff-from-scratch

Building A Mental Model for Backpropagation
https://towardsdatascience.com/building-a-mental-model-for-backpropagation-987ac74d1821

Building A Basic Computational Graph Engine

http://alexminnaar.com/2018/07/14/simple-computational-graph-engine.html

Pytorch 相关

Autograd source code pytorch/torch/autograd/

https://github.com/pytorch/pytorch/tree/master/torch/autograd

https://www.pytorchtutorial.com/pytorch-backward/

Understanding Autograd: 5 Pytorch tensor functions

https://medium.com/@namanphy/understanding-autograd-5-pytorch-tensor-functions-8f47c27dc38

A Gentle Introduction to torch.autograd

https://pytorch.org/tutorials/beginner/blitz/autograd_tutorial.html

PyTorch Autograd：Understanding the heart of PyTorch’s magic

https://towardsdatascience.com/pytorch-autograd-understanding-the-heart-of-pytorchs-magic-2686cd94ec95

中文，https://blog.csdn.net/OrdinaryMatthew/article/details/113728497

--

wiki, Jacobian matrix

https://math.wikia.org/wiki/Jacobian_matrix

What is the Jacobian matrix?

https://math.stackexchange.com/questions/14952/what-is-the-jacobian-matrix

Difference between gradient and Jacobian

https://math.stackexchange.com/questions/1519367/difference-between-gradient-and-jacobian

==

矩阵求导术(上)(下)

https://zhuanlan.zhihu.com/p/24709748

https://zhuanlan.zhihu.com/p/24863977

如何理解矩阵对矩阵求导？

https://www.zhihu.com/question/39523290

矩阵求导、几种重要的矩阵及常用的矩阵求导公式

https://blog.csdn.net/daaikuaichuan/article/details/80620518

==

pytorch中 backward 机制理解

https://blog.csdn.net/baidu_36161077/article/details/81435627

https://blog.csdn.net/qq_27825451/article/details/89393332

PyTorch中关于backward、grad、autograd的计算原理的深度剖析

https://blog.csdn.net/weixin_42782150/article/details/106116082

Computational Graph（计算图）

https://blog.csdn.net/zxl55/article/details/83537144

pytorch中backward()函数详解

https://blog.csdn.net/sinat_28731575/article/details/90342082

PyTorch 的 backward 为什么有一个 grad_variables 参数？

https://zhuanlan.zhihu.com/p/29923090

Ref：

F. L. Bauer,Computational Graphs and Rounding Error, 1974, SIAM J. Numer. Anal., 11(1), 87–96. (10 pages)

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。