神经网络求导与不能求导的情况

最新推荐文章于 2022-10-08 16:36:03 发布

hnshahao

最新推荐文章于 2022-10-08 16:36:03 发布

阅读量3.7k

点赞数 5

分类专栏：深度学习基础

本文链接：https://blog.csdn.net/hnshahao/article/details/80937453

版权

深度学习基础专栏收录该内容

4 篇文章 0 订阅

订阅专栏

关于神经网络的求导和不可求导，目前主要是两个地方遇到过，一个是karpathy在Policy Gradient的文章中有一节专门讲了

【1: Non-differentiable computation in Neural Networks - Andrej Karpathy】

这个标题讲的比较清楚，翻译一下，说的是 “神经网络中的不可求导操作”，根据这句话可以直接知道

(1) 不可求导是针对的神经网络中的 Computation, 不可求导是针对是神经网络中操作而言，和网络中的数据无关。神经网络是由操作和数据构成(数据包括计算数据和网络系数)

(2) Computation 有很多种，比如Dot product, Max Pooling, Sampling, Relu, etc. 而这里的不可求导主要是指 Sampling.

为什么Sampling 不能求导是因为，这个操作是随机的，文章中给出的intuitive解释是，we don’t know what would have happened if we sampled a different location。这句话翻译一下是说，如果sample输出的结果是一个另外的值，我们是无法评估这个新的值的效果，换句话说，我们无法判断sample出来一个新的值的好坏。

与sample对应的其他的操作，比如Dot product, Max pooling, 这些操作的产生的效果都是能确定的，而sample是没有办法确定的。之所以sample的操作的结果没有办法确定，所以采用 Policy Gradient的方法。对多个采用的样本进行评估，然后用 policy gradient的方法就行梯度下降。

下面是一个图解:

Notice that most arrows (in blue) are differentiable as normal, but some of the representation transformations could optionally also include a non-differentiable sampling operation (in red). We can backprop through the blue arrows just fine, but the red arrow represents a dependency that we cannot backprop through.

如上图，蓝色代表的是可以求导的操作，而红色的部分是不可求导的操作。对于蓝色箭头部分，我们可以按照正常的流程求导，但是红色的部分是没有办法直接求导的。

Policy gradients to the rescue! We’ll think about the part of the network that does the sampling as a small stochastic policy embedded in the wider network. Therefore, during training we will produce several samples (indicated by the branches below), and then we’ll encourage samples that eventually led to good outcomes (in this case for example measured by the loss at the end). In other words we will train the parameters involved in the blue arrows with backprop as usual, but the parameters involved with the red arrow will now be updated independently of the backward pass using policy gradients, encouraging samples that led to low loss. This idea was also recently formalized nicely in Gradient Estimation Using Stochastic Computation Graphs.

解决办法采用策略梯度，在训练的时候，产生很多的样本，如下面的多个分支，对于产生好效果的样本进行鼓励。在训练的时候，对于蓝色箭头部分的参数按照正常的训练流程，对于红色箭头部分的参数更新将会独立于其他部分参数的Backward pass过程，而采用梯度下降方法，来鼓励具有low loss的样本。

【2: VAE 中的分析】

在VAE的文章中 Tutorial on Variational Autoencoders

原话是，The forward pass of this network works fine and, if the output is averaged over many samples of X and z, produces the correct expected value. However, we need to back-propagate the error through a layer that samples z from Q(z|X), which is non-continuous operation and has no gradient. ~~Stochastic~~ gradient descent via backpropagation can handle stochastic inputs, but not stochastic units within the network.

这里的的分析总结一下，就是网络中有一层是做的采样的操作，所以不能求导。原因是

(1) 采样不是non-continous operation. ------这个解释也有道理，这里的连续操作应该和高等数学的连续函数不是一个概念

(2) 梯度下降可以处理随机的输入，但是不能处理网络本身中含有随机的unit,这里的随机unit应该是指随机的操作。这句话很关键，意思是说，随机的数据输入是可以进行求导的，但是随机的操作是不行的。

【3: 个人总结】

总结下来，网络不能求导的操作就是sample, 在sample的时候，可能是一个直接的概率分布，比如各个动作的概率，或者是一个概率分布的参数，比如VAE，而造成不能求导的原因是输入和输出数据之间的直接关联性被打破，输入数据的变化，不能立马反映在输出结果上。即不是 continous operation.

只要是网络的操作固定，比如max pooling这样的，不管输入的数据怎么随机变化，总是有一个确定性的选择的过程。而sample是不行的，sample是对于一个概率分布的采用，采样的结果，和描述概率分布或者描述概率分布的参数之间没有直接对应的关系，感觉是这种没有直接对应关系的现状造成了不能求导，换句话说，输入和输出直接没有直接关系。但是max pooling这样的操作实际上是在输入数据集合里面做选择，输入数据和输出数据是具有直接关系的，即改变输入的值，能立马影响输出的值。而sample的话是不行的。

[参考文献]

https://arxiv.org/abs/1606.05908

http://karpathy.github.io/2016/05/31/rl/

文章中提到了一个 Hard attention 后续查看, 结合attention一起看