Sequence Modeling:Recurrent and Recursive Nets

最新推荐文章于 2022-04-22 18:04:56 发布

IgorW

最新推荐文章于 2022-04-22 18:04:56 发布

阅读量1.6k

点赞数 1

分类专栏：深度学习文章标签：深度学习

本文链接：https://blog.csdn.net/github_29374279/article/details/52071044

版权

深度学习专栏收录该内容

5 篇文章 0 订阅

订阅专栏

Sequence Modeling:Recurrent and Recursive Nets

@(深度学习)

Recurrent neural networdks(RNN)用于处理连续性数据
sharing parameters（权值共享） across diﬀerent parts of a model，能够适用于不同的小大的样本
权值共享使模型能够应用和拓展于不同形式（长度）的样本

Computational Graphs

A computational graph is a way to formalize the structure of a set of computations,such as those involved in mapping inputs and parameters to outputs and loss

Unfolding

s (t) = f (s (t - 1), x (t)); θ

$s^{(t)} = f(s^{(t-1)},x^{(t)});\theta$
现在的状态包含了过去整个序列的信息

h (t) = f (h (t - 1), x (t); θ)

$h^{(t)} = f(h^{(t-1)},x^{(t)};\theta)$

Alt text

输入的特征 $x$ 上一次隐藏层的神经元 $(h^{(t-1)})$ 通过函数f变换成新的隐藏层 $h^{(t)}$

展开循环可以表示为：

h (t) = g (t) (x (t), x (t - 1), . . ., x (2), x (1)) = f (h (t - 1), x (t); θ)

$h^{(t)} = g^{(t)}(x^{(t)},x^{(t-1)},...,x^{(2)},x^{(1)}) \\ = f(h^{(t-1)},x^{(t)};\theta)$
函数g利用了整个序列的信息，可以将函数改写为递归的形式f，这个时候就能够适应于不同长度的输入x。神经网络学习的是f，不再是g，函数g必须应用于同一长度的输入，因此需要针对不同长度的序列建模，对每个模型都需要参数估计。而函数f可以处理不同长度的序列，且权值共享，提高了训练的速度和难度。

the same function f with the same parameters θ

Recurrent Neural Networks

隐藏层—隐藏层的循环：

RNN将输入变量 $\bf x$ 映射成输出变量 $\bf o$ ,损失函数L测量了预测值 $\bf o$ 和目标值 $\bf y$ 之间差异。输入到隐藏层的连接又参数 $U$ 控制，隐藏到隐藏由参数 $W$ 控制，隐藏到输出右参数 $V$ 控制
$a (t) = b + W h (t - 1) + U x (t) h (t) = t a n h (a (t)) o (t) = c + V h (t) y ̂ (t) = s o f t m a x (o (t)) t = 1 t o t = τ$ $a^{(t)}=b+Wh^{(t-1)}+Ux^{(t)} \\ h^{(t)} = tanh(a^{(t)}) \\ o^{(t)} = c + Vh^{(t)} \\ \hat y^{(t)} = softmax(o^{(t)}) \\ t=1 \ to \ t = \tau$
$b$ 和 $c$ 是bias向量
$U$ :input-to-hidden
$W$ :hiden-to-hiden
$V$ :hiden-to-output
假如 $L^{(t)}$ 是negative log-likelihood,则：
$L ({x (1), . . ., x (τ)}, {y (1), . . ., y (τ)} = \sum t L (t) = \sum t - l o g y ̂ (t) y (t)$ $L(\{ x^{(1)},...,x^{(\tau)}\},\{ y^{(1)},...,y^{(\tau)}\}=\sum_t {L^{(t)}}=\sum_t {-log \hat y_{y^{(t)}}^{(t)}}$

梯度的计算涉及到执行从左向右的传播，后向传播则是从右向左，因此时间复杂度最低就是 $O(\tau)$ ,每一个状态的向前传播值都必须记录下来，因此空间的复杂度也是 $O(\tau)$

back-propagation 算法应用到展开图(unrolled graph)中称为back-propagation through time(BPTT)

log-likelihood function:

$C = - l n a L y$ $C = -ln a_y^L$
为什么能做代价函数,直观的解释：在数字图像识别中,如果输入的数字是7,则神经网络输出的是输入是7的概率( $a^L_y$ )。如果神经网络正确的工作,则 $a^L_y$ 会趋近于1,代价函数C就会趋近于0。反之,如果输入不是7,则输出的概率会趋近于0,这个时候代价会很大

输出层-隐藏层的循环：

less power
这种结构的连接只是输出到隐藏层的连接，由于缺乏了隐藏层到隐藏层的连接，所以缺乏能力
由于缺乏了隐藏层，它要求输出单元捕捉所有网络中过去的信息，然而输出的单元是被期望训练成与目标匹配的值，因此它不可能捕获所有的历史信息，除非训练已经知道了系统的所有状态，并作为目标的一部分提供给输出
消除隐藏层的优点在于所有时间步长都是分离的，因此梯度的运算可以并行化（猜测：梯度的运算是从输出层向后传播，因为这个架构中间隐藏层不连接，前一个时间的隐藏层不会因为变化影响下一个时间的隐藏层，每个时间点当运算到输出层时，就可以独立的进行梯度运算）

Teacher Forcing
Alt text
训练阶段，将上一个时间正确目标和隐藏层连接，测试阶段，将上一个时间预测的值和隐藏层连接

仅产生一个输出的模型

RNN中梯度的计算

针对每个节点N，需要递归的(根据节点在图中的流动)计算梯度 $\nabla_N L$
最后节点的损失:

\partial L \partial L ( t ) = 1

$\frac{\partial L}{\partial L^{(t)}} = 1$

(\nabla o (t) L) i = \partial L \partial o ( t ) i = \partial L \partial L ( t ) \partial L ( t ) \partial o ( t ) i = y ̂ (t) i - 1 i, y (t) L (t) = - l o g y ̂ (t) = - l o g s o f t m a x (o t)

$(\nabla_{o^{(t)}} L)_i = \frac{\partial L}{\partial o^{(t)}_i}=\frac{\partial L}{\partial L^{(t)}}\frac{\partial L^{(t)}}{\partial o^{(t)}_i} =\hat y^{(t)}_i - 1_{i,y^{(t)}} \\ L^{(t)} = -log \hat y^{(t)} = -log \ {softmax(o^{t})}$
向后：

\nabla h (τ) L = (\nabla o (τ) L) \partial o ( τ ) \partial h ( τ ) = (\nabla (τ) o L) V

$\nabla_{h^{(\tau)}} L = (\nabla_{o^{(\tau)}} L) \frac{\partial o^{(\tau)}}{\partial h^{(\tau)}} = (\nabla_o^{(\tau)} L)V$
根据时期迭代的向后传播，注意

h(t)(t<τ) $h^{(t)} (t < \tau)$ 是

o(t) $o^{(t)}$ 和

h(t) $h^{(t)}$ 共同的descendents（复合函数的链式法则）

\nabla h (t) L = (\nabla h (t + 1) L) \partial h ( t + 1 ) \partial h ( t ) + (\nabla o (t) L) \partial o ( t ) \partial h ( t ) = (\nabla h (t + 1) L) d i a g (1 - (h (t + 1)) 2) W + (\nabla o (t) L) V

$\nabla_{h^{(t)}} L = (\nabla_{h^{(t+1)}} L) \frac{\partial h^{(t+1)}}{\partial h^{(t)}} + (\nabla_{o^{(t)}} L) \frac{\partial o^{(t)}}{\partial h^{(t)}} \\ = (\nabla_{h^{(t+1)}} L) diag(1-(h^{(t+1)})^2)W +(\nabla_{o^{(t)}} L)V$
求解梯度

\nabla c L = \sum t (\nabla o (t) L) \partial o ( t ) \partial c = \sum t \nabla o (t) \nabla b L = \sum t (\nabla h (t) L) \partial h ( t ) \partial b = \sum t (\nabla h (t) L) d i a g (1 - (h (t)) 2) \nabla V L = \sum t (\nabla o (t) L) \partial o ( t ) \partial V = \sum t (\nabla o (t) L) (h (t)) T \nabla W L = \sum t (\nabla h (t) L) \partial h ( t ) \partial W = \sum t (\nabla h (t) L) d i a g (1 - (h (t)) 2) (h (t - 1)) T \nabla U L = \sum t (\nabla h (t) L) d i a g (1 - (h (t)) 2) (x (t)) T

$\nabla_c L= \sum_t(\nabla_{o^{(t)}} L) \frac{\partial o^{(t)}}{\partial c}=\sum_t \nabla_{o^{(t)}} \\ \nabla_b L = \sum_t (\nabla_{h^{(t)}} L)\frac{\partial h^{(t)}}{\partial b} = \sum_t (\nabla_{h^{(t)}} L) diag(1-(h^{(t)})^2) \\ \nabla_V L = \sum_t(\nabla_{o^{(t)}} L) \frac{\partial o^{(t)}}{\partial V} = \sum_t(\nabla_{o^{(t)}} L) (h^{(t)})^T \\ \nabla_W L = \sum_t (\nabla_{h^{(t)}} L)\frac{\partial h^{(t)}}{\partial W} = \sum_t (\nabla_{h^{(t)}} L)diag(1-(h^{(t)})^2)(h^{(t-1)})^T \\ \nabla_U L = \sum_t (\nabla_{h^{(t)}} L)diag(1-(h^{(t)})^2)(x^{(t)})^T$

循环神经网络的参数共享假设：同一个参数能够在不同的时期步数中使用

序列长度的问题：

In the case when the output is a symbol taken from a vocabulary, one can add a special symbol corresponding to the end of a sequence
当输出是词典中的标志时，能够增加一个特殊的符号表示序列的结束，当这个特殊符号出现则表示抽样停止。在训练过程中，对每个序列的末尾加入这个符号

Another option is to introduce an extra Bernoulli output to the model that represents the decision to either continue generation or halt generation at each time step.
生成一个伯努利输出，判断继续或停止

Another way to determine the sequence lengthτis to add an extra output to the model that predicts the integerτitself. The model can sample a value of τ and then sample τ steps worth of data.
将长度作为RNN的额为输入

Modeling Sequences Conditioned on Context with RNNs

任何模型 $P(y;\theta)$ 都可以重新解释为条件分布 $P(y|w)$
把单独的 $\bf x$ 作为输入，为RNN提供额外输入：
- 在每个时期上作为额外的输入
- 作为 $h^{(0)}$ 的初始状态
- both

Alt text
输入变量是一个固定长度的向量 $\bf x$ ,例如图像描述的例子中，输入一个图像信息，返回一段描述该图像的词

Encoder-Decoder Sequence-to-Sequence Architectures

如何把RNN训练成将一个输入序列映射到一个输出序列，输出序列不需要是相同的长度。语音识别，机器翻译或QA，这些应用中通常输入和输出序列不是同样的长度

通常，把一个输入称之为”context”,我们希望生成这个context（C）的一个表示。C可能是个向量或一系列的向量

Alt text

The idea is very simple: (1) an encoder or reader or input RNN processes the input sequence. The encoder emits the context C, usually as a simple function of its ﬁnal hidden state. (2) a decoder or writer or output RNN is conditioned on that ﬁxed-length vector to generate the output sequence $Y= (y^{(1)}, . . . , y^{(ny)})$ .
C 是一个固定的context, 表示输入序列的语义总结

Recursive Neural Networks

Recurisive Neural networks是recurrent networks的另一种范化，用不同的计算图表示，结构是deep tree （recurrent 是 chian-like的结构）
Alt text

The Challenge of Long-Term Dependencies

基本的问题：梯度在传播的过程中会消失（most of time）或爆炸(rarely)

循环神经网络涉及到对同一个函数的反复合成（composition），将会产生极其的非线性行为

Echo State Networks

考虑state vector $h$ 和权重矩阵 $W$ 相乘，更新的状态 $W^Th$ 可能会比之前状态（ $h$ ）的norm大或者小，如果 $W^T$ 使得 $h$ 收缩（shrinks）,则称线性映射 $W^T$ contractive

Echo state networks are models whose weights are chosen to make the dynamics of forward propagation barely contractive

option:set the recurrent weights such that the recurrent hidden units do a good job of capturing the history of past inputs, and only learn the output weights?
reservoir computing
reservoir computing类似于kernel machines: 映射任意长度的序列(the history of inputs up to time t)为固定长度的向量(the recurrent state $h^{(t)}$ )

The important question is therefore: how do we set the input and recurrent weights so that a rich set of histories can be represented in the recurrent neural network state?
文献中建议将循环网络看作一个动态规划系统，将输入和循环权重设置使得动态规划系统接近于稳定

Leaky Units and Other Strategies for Multiple Time Scales

“leaky units” that integrate signals with diﬀerent time constants, and the removal of some of the connections used to model ﬁne-grained time scales.

一般的循环神经网络从时间t对应的单元连接到时间t+1对应的单元，能够构建一个有更长跨度的连接，也可以说是延迟的连接

随着time steps数量的增加，梯度会消失或者爆炸

The Long Short-Term Memory and Other Gated RNNs

在实际应用中最有效的序列模型称为gated RNNs，涉及到long short-term memory 和基于gated recurrent unit 的networks

创建一条梯度既不会消失也不会爆炸的路径，gated recurrent unit 能够在长时间的持续积累信息，一旦这些信息被使用后，也许需要神经网络忘记旧的状态，gated RNNs 不需要手动的决定什么时候忘记，而是去通过学习来决定

LSTM

Understanding LSTM Networks

Essential to these successes is the use of “LSTMs,” a very special kind of recurrent neural network which works, for many tasks, much much better than the standard version. Almost all exciting results based on recurrent neural networks are achieved with them. It’s these LSTMs that this essay will explore.

LSTM block diagram:

Alt text

Each cell has the same inputs and outputs as an ordinary recurrent network, but has more parameters and a system of gating units that controls the ﬂow of information.