TensorFlow实战：Chapter-7上（RNN简介和RNN在NLP应用)_tensorflow2.0 replace with new rnn api rnn-CSDN博客

本文链接：https://blog.csdn.net/u011974639/article/details/77377784

RNN简介
自然语言建模
TensorFlow中关于RNN的API
参考资料

RNN简介

循环神经网络是一类用于处理序列数据的神经网络。就像卷积网络是专门处理网格化数据 $X$ (如一个图像)的神经网络，循环神经网络是专门用于处理序列 $x^{(1)},...,x^{(\tau)}$ 的神经网络。正如卷积网络可以很容易地扩展到具有很大宽度和高度的图像，以及处理大小可变的图像，循环网络可以扩展到更长的序列，且大多数循环网络可以处理可变长度的序列。

从多层网络出发到循环网络，我们需要利用20世纪80年代机器学习和统计模型早期思想的优点：在模型的不同部分共享参数。参数共享使得模型能够扩展到不同形式的样本(这里指不同长度的样本)并进行泛华。如果我们在每个时间点都有一个单独的参数，不但不能泛化到训练时没有见过的序列长度，也不能在时间上共享不同序列长度和不同位置的统计强度。

为了简单起见，我们说的RNN是指在序列上的操作，并且该序列在时刻 $t(从1到\tau)包含向量x^{(t)}$ 。在实际情况中，循环网络通常在序列上的小批量上操作，并且小批量的每项具有不同序列长度 $\tau$ 。此外，RNN可以应用于跨越两个维度的空间数据(如图像)。当应用于涉及时间的数据，并且将整个序列提供给网络之前就能观察到整个序列时，网络可具有关于时间向后的连接。

序列建模方法：展开计算图

计算图是形式化一组计算结构的方式，如那些涉及将输入和参数映射到输出和损失的计算。我们对展开(unfolding)递归或循环计算得到的重复结构进行解释，这些重复结构通常对应于一个事件链。展开这个计算图将导致深度网络结构中的参数共享。

如：考虑动态系统的经典形式:

s (t) = f (s t - 1; θ) = f (f (s t - 2; θ); θ) = . . .

$s^{(t)}=f(s^{t-1};\theta)=f(f(s^{t-2};\theta);\theta)=...$

其中 $s^{(t)}$ 称为系统的状态。 $s$ 在时刻 $t$ 的定义需要参考时刻 $t-1$ 时同样的定义，故上式是循环的。
以上述方式展开等式，就能得到不涉及循环的表达。现在我们用传统的有无环计算图表达。

这里写图片描述

另一个例子，考虑外部信号 $x^{(t)}$ 驱动的动态系统，

s (t) = f (s (t - 1), x (t); θ)

$s^{(t)}=f(s^{(t-1)},x^{(t)};\theta)$
可以看到， 当前状态包含了整个过去序列的信息。

很多循环神经网络使用下式或类似的公式定义隐藏单元的值。为了表明状态是网络的隐藏单元，我们使用变量 $h$ 代表状态重写式:

h (t) = f (h (t - 1), x (t); θ)

$h^{(t)}=f(h^{(t-1)},x^{(t)};\theta)$

如下图所示，典型RNN会增加额外的架构(我们所说的展开(unfolding)就是这个操作)。

这里写图片描述

当训练循环网络根据过去预测未来时，映射任意长度的序列 $(x^{(t)},x^{(t-1)},...,x^{(2)},x^{(1)})$ 到一固定长度的向量 $h^{(t)}$ .根据不同的训练准则，摘要可能选择性地精确保留过去序列的某些方面。例如，如果在统计语言建模中使用的RNN，通常给定前一个词预测下一个词，可能没有必要存储时刻t前输入序列中的所有信息；而仅仅存储足够预测句子其余部分的信息。

我们可以用一个函数 $g^{(t)}$ 代表 $t$ 步展开后的循环:

h (t) = g (t) (x (t), x (t - 1), . . ., x (2), x (1)) = f (h (t - 1), x (t); θ)

$h^{(t)}=g^{(t)}(x^{(t)},x^{(t-1),...,x^{(2)}},x^{(1)})=f(h^{(t-1)},x^{(t)};\theta)$

函数 $g^{(t)}$ 将全部的过去序列 $(x^{(t)},x^{(t-1),...,x^{(2)}},x^{(1)})$ 作为输入来生成当前状态，展开的循环架构允许我们将 $g^{(t)}$ 分解为函数 $f$ 的重复应用。因此，展开过程引入两个主要优点：

无论序列的长度，学成的模型始终具有相同的的输入大小，因为它指定的是从一种状态到另一种状态的转移，而不是在可变长度的历史状态上操作。
我们可以在每个时间步使用相同参数的相同转移函数f。

这两个因素使得学习在所有时间步和所有序列长度上操作单一的模型f 是可能的，而不需要在所有可能时间步学习独立的模型 $g^{(t)}$ 。学习单一的共享模型允许泛化到没有见过的序列长度（没有出现在训练集中），并且估计模型所需的训练样本远远少于不带参数共享的模型。

循环神经网络

基于展开和参数共享的思想，我们可以设计各种循环神经网络。

1. 每个时间步都有输出，并且隐藏单元之间有循环连接的循环网络

我们看一下图上的RNN的前向传播公式。这个图没有指定隐藏单元的激活函数。这里假设使用双曲正切激活函数。此外，图中没有明确指定何种形式的输出和损失函数。我们假定输出是离散的，如用于预测词或字符的RNN。表示离散变量的常规方式是把输出 $o$ 作为每个离散变量可能值的非标准化对数概率。然后，我们可以应用softmax 函数后续处理后，获得标准化后概率的输出向量 $\hat{y}$ 。RNN 从特定的初始状态 $h^{(0)}$ 开始前向传播。从 $t=1$ 到 $t=\tau$ 的每个时间步，我们应用以下更新方程：

$a (t) = b + W h (t - 1) + U x (t); a (t) 是 h i d d e n_u n i t 的 i n p u t$ $a^{(t)}=b+Wh^{(t-1)}+Ux^{(t)} ;a^{(t)}是hidden\_unit的input$
$h (t) = t a n h (a (t)); A c t i v a t i o n_f u n c t i o n 是 t a n h$ $h^{(t)} = tanh(a^{(t)}); Activation\_function是tanh$
$o (t) = c + V h (t)$ $o^{(t)} = c + Vh^{(t)}$
$y^(t) = s o f t m a x (o (t))$ $\hat{y}^{(t)} = softmax(o^{(t)})$

其中的参数 $b和c$ 连同权重矩阵 $U、V和W$ ，分别对应于输入到隐藏、隐藏到输出和隐藏到隐藏的连接。这个循环网络将一个输入序列映射到相同长度的输出序列。与 $x$ 序列配对的 $y$ 的总损失就是所有时间步的损失之和。

我们在对模型在训练时，各个参数计算这个损失函数的梯度是计算成本很高的操作。梯度计算涉及执行一次前向传播（从左到右的传播），接着是由右到左的反向传播。运行时间是 $O(\tau)$ ，并且不能通过并行化来降低，因为前向传播图是固有循序的;每个时间步只能一前一后地计算。前向传播中的各个状态必须保存，直到它们反向传播中被再次使用，因此内存代价也是 $O(\tau)$ 。应用于展开图且代价为 $O(\tau)$ 的反向传播算法称为通过时间反向传播（back-propagation through time, BPTT）.

2.每个时间步都有输出，当前时刻的输出到下个时刻的隐藏单元之间有连接的循环网络

仅在一个时间步的输出和下一个时间步的隐藏单元间存在循环连接的网络没有那么强大。因为这个网络缺少隐藏到隐藏的循环，它要求输出单元捕捉用于预测未来的关于过去的所有信息。而输出单元明确地训练成匹配训练集的目标，它们不太能捕获关于过去输入历史的必要信息，除非用户知道如何描述系统的全部状态，并将它作为训练目标的一部分。消除隐藏到隐藏循环的优点在于，任何基于比时刻t的预测和时刻t的训练目标的损失函数中的所有时间步都解耦了。因此训练可以并行化，即在各时刻t分别计算梯度。因为训练集提供输出的理想值，所以没有必要先计算前一时刻的输出。

训练时，直接将上一层的期望输出连接下一层的隐藏单元，这样我们训练的时，是可以并行运算的。

3.隐藏单元之间存在循环连接，读取整个序列后产生单个输出的循环网络

双向RNN

目前为止我们考虑的所有循环神经网络有一个”因果”结构，意味着在时刻 $t$ 的状态只能从过去的序列 $x^{(1)},x^{(2)},...,x^{(t-1)})$ 以及当前的输入 $x^{(t)}$ 捕获信息。我们还讨论了某些在 $y$ 可用时，允许过去的 $y$ 值信息影响当前状态的模型。

然而，在许多应用中，我们要输出的 $y(t)$ 的预测可能依赖于整个输入序列。例如，在语音识别中，由于协同发音，当前声音作为音素的正确解释可能取决于未来几个音素，甚至潜在的可能取决于未来的几个词，因为词与附近的词之间的存在语义依赖：如果当前的词有两种声学上合理的解释，我们可能要在更远的未来（和过去）寻找信息区分它们。这在手写识别和许多其他序列到序列学习的任务中也是如此.

基于编码-解码的序列到序列(Seq2Seq)架构

我们已经在前面的图看到RNN如何将输入序列映射成一个序列、如何将一个输入序列映射到等长的输出序列。本节我们讨论如何训练RNN，使其将输入序列映射到不一定等长的输出序列。这在许多场景中都有应用，如语音识别、机器翻译(汉翻英时长度通常不一致)或问答，其中训练集的输入和输出序列的长度通常不相同（虽然它们的长度可能相关）。

我们经常将RNN的输入称为”上下文”。我们希望产生此上下文的表示 $C$ 。这个上下文C可能是一个概括输入序列 $X=x^{(1)},x^{(2)},...,x^{(n_x)})$ 的向量或者向量序列。

用于映射可变长度序列到另一可变长度序列的RNN架构称为编码-解码或序列到序列架构，这个想法非常简单：

编码器（encoder）或读取器(reader) 或输入(input)RNN处理输入序列。
编码器输出上下文C（通常是最终隐藏状态的简单函数）。
解码器（decoder）或写入器(writer) 或输出(output) RNN则以固定长度的向量为条件产生输出

这种架构对比本章前几节提出的架构的创新之处在于长度 $n_x$ 和 $n_y$ 可以彼此不同。在序列到序列的架构中，两个RNN 共同训练以最大化 $logP(y^{(1)},...,y^{(n_y)}|x^{(1)},...,x^{(n_x))})$ 。编码器RNN的最后一个状态 $h_{n_x}$ 通常被当作输入的表示 $C$ 并作为解码器RNN的输入。

此架构的一个明显不足是，编码器RNN输出的上下文 $C$ 的维度太小而难以适当地概括一个长序列。我们让C成为可变长度的序列，而不是一个固定大小的向量。此外，引入将序列C的元素和输出序列的元素相关联的注意力机制（attention mechanism）。

RNN的依赖和不足

长期依赖的挑战

学习循环网络时，经过许多阶段传播后的梯度倾向于消失（大部分情况）或爆炸（很少，但对优化过程影响很大）。即使我们假设循环网络是参数稳定的（可存储记忆，且梯度不爆炸），但长期依赖的困难来自比短期相互作用指数小的权重（涉及许多Jacobian 相乘）。

特别的是，循环神经网络所使用的函数组合有点像矩阵乘法。我们可以认为，循环联系

$h (t) = W T h (t - 1)$ $h^{(t)}=W^Th^{(t-1)}$

是一个非常简单的、缺少非线性激活函数和输入 $x$ 的循环神经网络。这种递推关系本质上描述了幂法。它可以被简化为

$h (t) = (W t) T h (0)$ $h^{(t)}=(W^t)^Th^{(0)}$

而当 $W$ 符合下列形式的特征分解(可对角化)：

$W = Q Λ Q T$ $W=Q\Lambda Q^T$

若其中 $Q$ 正交，循环性可进一步简化为

$h (t) = Q T Λ t Q h (0)$ $h^{(t)}=Q^T\Lambda^tQh^{(0)}$

特征值提升到 $t$ 次后，导致幅值不到一的特征值衰减到零，而幅值大于一的就会激增。任何不与最大特征向量对齐的 $h^{(0)}$ 的部分将最终被丢弃。

这个问题是针对循环网络的，在标量情况下，想象多次乘一个权重 $w$ 。该乘积 $w^t$ 消失还是爆炸取决于 $w$ 的幅值。

RNN学习中遇到的问题

RNN学习过程中遇到什么问题了？

我们通俗的来讲为什么RNN网络很难训练，如下图是一个RNN系统多次训练后，训练epoch与cost的关系:

我们可以看到sometimes时候，网络的cost抖动大，甚至是爆炸，运气好的时候碰上了收敛的情况。这意味着RNN网络训练起来比较难，工程上实现在比较难的。

为啥RNN学习会有问题？

我们从RNN的循环连接结构上理解为什么cost不稳定，下图是一个极其简化版的RNN网络：

这里我们简化了RNN的结构，我们假设有1000个神经元互联，且整个网络的结构极其简单。如图。

可以看到，因为是共享参数，当设置权重在超过1和小于1的时候，经过多次循环迭代后，对应的输出会有剧烈的抖动。

从数学的角度上来说：强非线性函数(如由许多时间步计算的循环网络)往往倾向于非常大或非常小幅度的梯度。看下图，目标函数存在一个伴随“断崖”的“地形”：宽且相当平坦区域被目标函数变化快的小区域隔开，形成了一种悬崖。

多参数之间的梯度下降关系，可以看到”断崖“的情况非常明显：

这导致的困难是，当参数梯度非常大时，梯度下降的参数更新可以将参数抛出很远，进入目标函数较大的区域，到达当前解所作的努力变成了无用功。梯度告诉我们，围绕当前参数的无穷小区域内最速下降的方向。这个无穷小区域之外，代价函数可能开始沿曲线背面而上。更新必须被选择为足够小，以避免过分穿越向上的曲面。我们通常使用衰减速度足够慢的学习率，使连续的步骤具有大致相同的学习率。适合于一个相对线性的地形部分的步长经常在下一步进入地形中更加弯曲的部分时变得不适合，会导致上坡运动。

解决办法：针对梯度爆炸的情况

一个简单的解决方案已被从业者使用多年：截断梯度（clipping the gradient）。此想法有不同实例。

一种选择是在参数更新之前，逐元素地截断小批量产生的参数梯度

另一种是在参数更新之前截断梯度 $g的范数||g||$ :

$i f | | g | | > v; 则 g \leftarrow g v | | g | |$ $if ||g||>v;则 g \leftarrow \frac{gv} {||g||}$
其中 $v$ 是范数上界， $g$ 用来更新参数。因为所有参数（包括不同的参数组，如权重和偏置）的梯度被单个缩放因子联合重整化，所以后一方法具有的优点是保证了每个步骤仍然是在梯度方向上的，但实验表明两种形式类似。虽然参数更新与真实梯度具有相同的方向梯度，经过梯度范数截断，参数更新的向量范数现在变得有界。这种有界梯度能避免执行梯度爆炸时的有害一步。

解决办法：针对梯度消失的情况

梯度截断有助于处理爆炸的梯度，但它无助于消失的梯度。为了解决消失的梯度问题并更好地捕获长期依赖，我们讨论了如下想法：

在展开循环架构的计算图中，沿着与弧度相关联的梯度乘积接近1的部分创建路径。实现这一点的一种方法是使用LSTM以及其他自循环和门控机制(后面会介绍LSTM)。
另一个想法是正则化或约束参数，以引导”信息流”。特别是即使损失函数只对序列尾部的输出作惩罚，我们也希望梯度向量 $\triangledown_{h(t)}L$ 在反向传播时能维持其幅度。

门控RNN

长短期记忆(LSTM)

现如今，实际应用中最有效的序列模型称为门控RNN（gated RNN）。包括基于长短期记忆（long short-term memory）和基于门控循环单元（gated recurrent unit）的网络。门控RNN想法是基于生成通过时间的路径，其中导数既不消失也不发生爆炸。门控RNN在每个时间步都可能改变的连接权重。

为什么会提出LSTM?

RNN的工作关键点在于使用历史的信息(双向RNN带有整体性)帮助当前的决策。RNN可以更好地利用传统神经网络结构所不能建模的信息，但同时，这也带来了更大的技术挑战–长期依赖(long-term dependencies)问题。

在有些有问题上，模型仅仅需要短期内信息执行当前的任务。例如预测短语“大海的颜色是蓝色”中“蓝色”，模型并不需要记忆这个短语之前之前更长的上下文信息(大海和颜色包含了足够的信息了)。在这样的场景下，相关信息和待预测的词的位置之间的间隔很小，RNN可以较容易地利用先前信息。

但同样也会遇到一些上下文场景复杂的情况。例如做语文的阅读理解问题。仅根据短期依赖无法很好的解决问题。根据上面分析的RNN学习过程会遇到的问题，在复杂场景下，循环网络的学习梯度容易爆炸/消失。

长短时记忆网络(LSTM)可以较好的解决这一问题。与单一的tanh循环体结构不同，LSTM是一种拥有三个“门”结构的特殊网络结构。LSTM靠这些“门”的结构让信息有选择性地影响循神经网络中每个时刻的状态。“门”结构常是由sigmoid神经网络和一个按位做乘法的操作合并而成。

LSTM的结构

LSTM块如图所示，在浅循环网络架构下，LSTM循环网络的除了外部的RNN循环外，还具有内部的“LSTM细胞”循环。因此LSTM不是简单地向输入和循环单元的仿射变换之后施加一个逐元素的非线性。与普通的循环网络类似，每个单元有相同的输入和输出，也有更多的参数和控制信息流动的门控单元系统。

一个LSTM块有四个输入：

输入(input) : 模块的输入
输入门(input gate): 控制输入
遗忘门(forget gate):控制是否更新记忆单元(memory cell)
输出门(output gate):控制输出

我们可以把LSTM块看成是原RNN循环网络的单个循环单元：

LSTM的传播公式

如图是一个简化的LSTM结构图：

可以看到每个LSTM Block内的Cell Memory的更新公式是:

$c' = g (z) f (z i) + c f (z f)$ $c' = g(z)f(z_i)+cf(z_f)$

输出为:

$a = h (c') f (z o)$ $a = h(c')f(z_o)$

对于 $input gate、forget gate、output gate$ 的输出分别为 $f(z_i)、f(z_f)、f(z_o)$ : 输入 $x^t$ 连接了 $z^f$ (遗忘门权重)、 $z_i$ (输入门权重)、 $z$ (输入权重)、 $z^o(输出门权重)$ :

LSTM循环网络的整个架构(截取)：

在多个LSTM连接的循环网络中，单个的LSTM的各个门的控制方式如下:

“遗忘门”：根据当前的输入 $x_t$ 、上一时刻状态 $c_{t-1}$ 和上一时刻输出 $h_{t-1}$ 共同决定
“输入门”: 根据输入 $x_t$ 、 $c_{t-1}$ 和 $h_{t-1}$ 决定那些部分将进入当前时刻的状态 $c_t$
“输出门”: 根据当前的状态 $c_t$ 、当前输入 $x_{t}$ 和上一时刻输出 $h_{t-1}$ 共同决定该时刻输出 $h_t$

整个LSTM网络的推导式

LSTM中最重要的组成部分是状态单元 $s_i^{(t)}$ ,这是由遗忘门(forget gate) $f_i^{(t)}$ 控制(时刻 $t$ 和细胞 $i$ )，由sigmoid单元将权重设置为0和1之间的值:

$f (t) i = σ (b f i + \sum j U f i x (t) j + \sum j W f i, j h (t - 1) j)$ $f_i^{(t)}=\sigma(b_i^{f}+\sum_j U_i^{f}x_j^{(t)}+\sum_j W_{i,j}^{f}h_j^{(t-1)})$
其中 $x^{(t)}$ 是当前输入向量， $h^{(t)}$ 是当前隐藏层向量， $h^{(t)}$ 包含所有LSTM细胞的输出。 $b^{(f)}，U^{(f)}，W^{(f)}$ 分别是偏置、输入权重和遗忘门的循环权重。因此LSTM细胞内部状态以如下方式更新，其中有一个条件的自环权重 $f_i^{(t)}$ ：

$s (t) i = f (t) i s (t - 1) i + g (t) i σ (b i + \sum j U i x (t) j + \sum j W i, j h (t - 1) j)$ $s_i^{(t)}=f_i^{(t)}s_i^{(t-1)}+g_i^{(t)}\sigma(b_i+\sum_j U_ix_j^{(t)}+\sum_j W_{i,j}h_j^{(t-1)})$

其中 $b,U,W$ 分别是LSTM 细胞中的偏置、输入权重和遗忘门的循环权重。外部输入门(external input gate) 单元 $g_i^{(t)}$ 以类似遗忘门（使用sigmoid获得一个0和1之间的值）的方式更新，但有自身的参数:

$g (t) i = σ (b g i + \sum j U g i x (t) j + \sum j W g i, j h (t - 1) j)$ $g_i^{(t)}=\sigma(b_i^{g}+\sum_j U_i^{g}x_j^{(t)}+\sum_j W_{i,j}^{g}h_j^{(t-1)})$

LSTM 细胞的输出 $h_i^{(t)}$ 也可以由输出门(output gate) $g_i^{(t)}$ 关闭（使用sigmoid单元作为门控）：

$h (t) i = tanh (s (t) i) q (t) i$ $h_i^{(t)}=\tanh(s_i^{(t)})q_i^{(t)}$
$q (t) i = σ (b o i + \sum j U o i x (t) j + \sum j W o i, j h (t - 1) j)$ $q_i^{(t)}=\sigma(b_i^{o}+\sum_j U_i^{o}x_j^{(t)}+\sum_j W_{i,j}^{o}h_j^{(t-1)})$

其中 $b^{o},U^{o},W^{o}$ 分别是偏置、输入权重和遗忘门的循环权重，在这些变体中，可以选择使用细胞状态 $s_i^{(t)}$ 作为额外的输入（及其权重），输入到第 $i$ 个单元的三个门，这将需要三个额外的参数。

其他门控RNN

这里主要介绍GRU(门控循环单元)，GRU与LSTM的主要区别是，单个门控单元同时控制遗忘因子和更新状态单元的决定。更新公式如下：

$h (t) i = u (t - 1) i h (t - 1) i + (1 - u (t - 1) i) σ (b i + \sum j U t i, j x (t) j + \sum j W i, j r (t - 1) j h (t - 1) j)$ $h_i^{(t)}=u_i^{(t-1)}h_i^{(t-1)}+(1-u_i^{(t-1)})\sigma(b_i+\sum_j U_{i,j}^{t}x_j^{(t)}+\sum_j W_{i,j}r_j^{(t-1)}h_j^{(t-1)})$

其中 $u$ 代表”更新”门， $r$ 表示”复位”门。它们的值就如通常所定义的：

$u (t) i = σ (b u i + \sum j U u i, j x (t) j + \sum j W u i, j h (t) j)$ $u_i^{(t)}=\sigma(b_i^{u}+\sum_j U_{i,j}^{u}x_j^{(t)}+\sum_j W_{i,j}^{u}h_j^{(t)})$
和
$r (t) i = σ (b r i + \sum j U r i, j x (t) j + \sum j W r i, j h (t) j)$ $r_i^{(t)}=\sigma(b_i^{r}+\sum_j U_{i,j}^{r}x_j^{(t)}+\sum_j W_{i,j}^{r}h_j^{(t)})$

复位和更新门能独立地“忽略”状态向量的一部分。更新门像条件渗漏累积器一样可以线性门控任意维度，从而选择将它复制（在sigmoid的一个极端）或完全由新的“目标状态”值（朝向渗漏累积器的收敛方向）替换并完全忽略它（在另一个极端）。复位门控制当前状态中哪些部分用于计算下一个目标状态，在过去状态和未来状态之间引入了附加的非线性效应。

围绕这一主题可以设计更多的变种。例如复位门（或遗忘门）的输出可以在多个隐藏单元间共享。或者，全局门的乘积（覆盖一整组的单元，例如整一层）和一个局部门（每单元）可用于结合全局控制和局部控制。然而，一些调查发现这些LSTM和GRU架构的变种，在广泛的任务中难以明显地同时击败这两个原始架构。关键因素是遗忘门，向LSTM遗忘门加入1的偏置能让LSTM变得与已探索的最佳变种一样健壮。

自然语言建模

自然语言处理（Natural Language Processing，NLP）让计算机能够使用人类语言，例如中文或英文。为了让简单的程序能够高效明确地解析，计算机程序通常读取和发出特殊化的语言。而自然的语言通常是模糊的，并且可能不遵循形式的描述。自然语言处理中的应用如机器翻译，学习者需要读取一种人类语言的句子，并用另一种人类语言发出等同的句子。许多 NLP 应用程序基于语言模型，语言模型定义了关于自然语言中的字、字符或字节序列的概率分布。

为了构建自然语言的有效模型，通常必须使用专门处理序列数据的技术。在很多情况下，我们将自然语言视为一系列词，而不是单个字符或字节序列。因为可能的词总数非常大，基于词的语言模型必须在极高维度和稀疏的离散空间上操作。为使这种空间上的模型在计算和统计意义上都高效，研究者已经开发了几种策略。

$n-gram$

语言模型（language model）定义了自然语言中标记序列的概率分布。根据模型的设计，标记可以是词、字符、甚至是字节。标记总是离散的实体。最早成功的语言模型基于固定长度序列的标记模型，称为n-gram。一个n-gram是一个包含n个标记的序列。

注解：
依据(12.5)的公式：

$P (T H E, D O G, R A N, A W A Y) = P 2 (T H E, D O G) P (R A N | T H E, D O G) P (A W A Y | D O G, R A N)$ $P(THE,DOG,RAN,AWAY)=P_2(THE,DOG)P(RAN|THE,DOG)P(AWAY|DOG,RAN)$
$P 2 (T H E, D O G) P (R A N | T H E, D O G) = P 3 (T H E, D O G, R A N) / / 条件概率公式$ $P_2(THE,DOG)P(RAN|THE,DOG)=P_3(THE,DOG,RAN) //条件概率公式$
$P (T H E, D O G, R A N, A W A Y) = P 3 (T H E, D O G, R A N) P (A W A Y | D O G, R A N)$ $P(THE,DOG,RAN,AWAY)= P_3(THE,DOG,RAN)P(AWAY|DOG,RAN)$
依据(12.6)的公式：
$P (A W A Y | D O G, R A N) = P 3 ( D O G , R A N , A W A Y ) P 2 ( D O G , R A N )$ $P(AWAY|DOG,RAN)=\frac{P_3(DOG,RAN,AWAY)} {P_2(DOG,RAN)}$

联立上述式子:

$P (T H E, D O G, R A N, A W A Y) = P 3 (T H E, D O G, R A N) P 3 ( D O G , R A N , A W A Y ) P 2 ( D O G , R A N )$ $P(THE,DOG,RAN,AWAY)= P_3(THE,DOG,RAN) \frac{P_3(DOG,RAN,AWAY)} {P_2(DOG,RAN)}$

神经语言模型

神经语言模型（Neural Language Model, NLM）是一类用来克服维数灾难的语言模型，它使用词的分布式表示对自然语言序列建模 (Bengio et al., 2001b)。不同于基于类的 $n-gram$ 模型，神经语言模型在能够识别两个相似的词，并且不丧失将每个词编码为彼此不同的能力。模型为每个词学习的分布式表示，允许模型处理具有类似共同特征的词来实现这种共享。

为什么要将字词转为向量形式?

在神经语言模型出现之前，NLP通常将字词转为离散的单独的符号，例如将“中国”转为编号5178的特征，将“北京”转为3987的特征。即one-hot Encoder。一个词对应一个向量(向量中只有一个值为1，其余为0)，可以想象，这样的表示方法表示整个的词汇库是一个超高维矩阵，例如需要将一篇文章中每一个词都转成一个向量，则整篇文章表示成一个稀疏矩阵。

使用One-Hot Encoder有一个问题，即我们对特征的编码往往是随机的，没有提供任何关联信息，没有考虑到字词间可能存在的关系。例如上述的“中国”和“北京”之间的关系在编码过程中丢失了。这不是我们想看见的。同时，将字词存储为稀疏向量的话，我们需要更多的数据来训练，因为稀疏数据训练的效率较低，计算也繁琐。

很自然地，我们就想到使用向量表达字词，向量空间模型可将字词转为连续值的向量表达，其中意思相近的词(属性类似)将被映射到向量空间中相近的位置。

将字词转为向量形式有什么优点?

我们认为神经语言模型是一类可以克服维数灾难的模型，它使用词的分布式表示对自然语言序列建模。神经语言模型在能够识别两个相似的词，并且不丧失将每个词编码为彼此不同的能力。向量空间模型共享一个词（及其上下文）和其他类似词（和上下文之间）的统计强度(即向量空间模型在NLP中主要依赖的假设是Distributional Hypothesis)。

使用这样的模型有许多好处，例如，如果词 dog 和词 cat 映射到具有许多属性的表示，则包含词 cat 的句子可以告知模型对包含词dog的句子做出预测，反之亦然。因为这样的属性很多，所以存在许多泛化的方式，可以将信息从每个训练语句传递到指数数量的语义相关语句。维数灾难需要模型泛化到指数多的句子（指数相对句子长度而言）。该模型通过将每个训练句子与指数数量的类似句子相关联克服这个问题。

我们有时将这些词表示称为词嵌入（word embedding）。在这个解释下，我们将原始符号视为维度等于词表大小的空间中的点。词表示将这些点嵌入到较低维的特征空间中。在原始空间中，每个词由一个one-hot向量表示，因此每对词彼此之间的欧氏距离都是 $\sqrt2$ 。在嵌入空间中，经常出现在类似上下文（或共享由模型学习的一些”特征”的任何词对）中的词彼此接近。这通常导致具有相似含义的词变得邻近。图 12.3 放大了学到的词嵌入空间的特定区域，我们可以看到语义上相似的词如何映射到彼此接近的表示。

语言模型评价指标–复杂度(perplexity)

Word2Vec

循环神经网络在NLP(Nature Language Processing)领域最常使用的神经网络结构，和CNN在图像识别领域的地位类似。而Word2Vec是将语言中的字词转换为计算机可以理解的稠密向量(Dense Vector)，进而可以做其他自然语言处理任务，比如文本分类、词性标注、机器翻译等

Word2Vec也称Word Embeddings，Word2Vec是一个可以将语言中字词转为向量形式表达(Vector Representations)的模型。

向量空间模型的分类

大致分为两类：

一类是计数模型，例如Latent Semantic Analysis。计数模型统计在语料库中，相邻出现的词的频率，再把这些计数统计结果转为小而稠密的矩阵；
另一类是预测模型，例如Neural Probabilistic Language Models。预测模型根据一个词周围相邻的词推测出这个词，以及它的空间向量。

Word2Vec即是一种计算高效的、可以从原始语料中学习字词空间向量的预测模型。它主要分为两种模式：

CBOW(Continuous Bag of Words)，从原始语句推测目标字词，对小型数据比较合适
Skip-Gram相反，从目标字词推测出原始语句，在大型语料表现更好

预测模型通常使用最大似然的方法，在给定前面的语句 $h$ 的情况下，最大化目标词汇 $w_t$ 的概率。这存在一个比较严重的问题是计算量非常大，需要计算词汇表中所有单词出现的可能性。在Word2Vec的CBOW模型中，不需要计算完整的概率模型，只需要训练一个二元的分类模型，用来区分真实的目标词汇和编造的词汇(噪声)这两类。

Skip-Gram模式的Word2Vec

在本节中我们主要使用Skip-Gram模式的Word2Vec，先来看训练样本的构造，以

$t h e q u i c k b r o w n f o x j u m p e d o v e r t h e l a z y d o g$ $the \ quick \ brown \ fox \ jumped \ over \ the \ lazy \ dog$
为例，我们要构造一个语境与目标词汇的映射关系，其中语境包括一个单词左边和右边的词汇，假设我们的滑窗尺寸为1，可以制造的映射关系包括
$[t h e, b r o w n] \to q u i c k 、 [q u i c k, f o x] \to b r o w n ， [b r o w n, j u m p e d] \to f o x 等。$ $[the,brown] \rightarrow quick、[quick,fox] \rightarrow brown，[brown,jumped] \rightarrow fox 等。$

因为Skip-Gram模型是从目标词汇预测语境，所有训练样本不再是

$[t h e, b r o w n] \to q u i c k, 而是 q u i c k \to t h e 和 q u i c k \to b r o w n 。$ $[the,brown] \rightarrow quick, 而是 quick \rightarrow the 和 quick \rightarrow brown。$
我们的训练集变成了
$(q u i c k, t h e) 、 (q u i c k, b r o w n) 、 (b o r w n, q u i c k) 、 (b r o w n, f o x) 等。$ $(quick,the)、(quick,brown)、(borwn,quick)、(brown,fox)等。$
我们训练时，希望模型能从目标词汇 $quick$ 上预测出语境 $the$ ，需要制造随机的词汇作为负样本(噪声),我们希望预测的概率分布在正样本 $the$ 上尽可能的大，而在随机产生的负样本上尽可能的小。在实际实现过程中，是通过优化算法例如SGD来更新模型中Word Embedding的参数，让概率分布的损失函数(NCE Loss)尽可能小。这样每个单词的Embedding Vector就会随着循环过程不断调整，直到处于一个最适合语料的空间位置。

Word2Vec在Tensorflow上的实现

代码编写

1. 导入模块，下载数据集并读取到列表中

使用urllib.urlretrieve下载数据的压缩文件并校验文件是否完整.
如果已经下载了数据原下载地址(在filename找到了，就跳过下载了，数据集text8.zip大小31.3M，如果网络不好，可以在点这里下载)

import collections import math import os import random import zipfile import numpy as np import urllib import tensorflow as tf # Step 1: 下载数据集 url = 'http://mattmahoney.net/dc/' def maybe_download(filename, expected_bytes): ''' 下载数据集,如果已下载,确保数据集完整 :param filename: 数据集地址 :param expected_bytes: 数据集大小 :return: 数据集 ''' if not os.path.exists(filename): filename, _ = urllib.urlretrieve(url + filename, filename) statinfo = os.stat(filename) # 返回文件的信息 if statinfo.st_size == expected_bytes: print('Found and verified', filename) else: print(statinfo.st_size) raise Exception( 'Failed to verify ' + filename + '. Can you get to it with a browser?') return filename filename = maybe_download('text8.zip', 31344016) # 读取数据到一个strings list def read_data(filename): ''' 解压数据并读取到words中 :param filename: :return: ''' with zipfile.ZipFile(filename) as f: data = tf.compat.as_str(f.read(f.namelist()[0])).split() # 按空格分割 return data words = read_data(filename) print('Data size', len(words))

输出为:

('Found and verified', 'text8.zip') ('Data size', 17005207)

函数 description
urlretrieve属于urllib包的，而urllib在Python2和Python3上的实现是不同的。

py2:
urllib.urlretrieve(url[, filename[, reporthook[, data]]])

py3:
urllib.request.urlretrieve(url, file=None, repo=None, data=None) copy一个由URL描述的网络对象到本地，如果URL指向一个本地文件，则对象不被copy除非提供文件名
返回一个元组对象(filename,tuple)
class zipfile.ZipFile(file, mode=’r’,
compression=ZIP_STORED, allowZip64=True)

with ZipFile(‘spam.zip’) as myzip: 参数file可以是一个文件的路径(字符串)或者是文件对象.
ZipFile 通过配合with关键字获得上下文管理器使用
ZipFile.namelist() 返回一个archive members by name列表 .
tf.compat.as_str 转换任何bytes或Unicode bytes，使用utf8编码文本

2. 创建数据集dict，统计单词频率，处理数据

创建vocabulary词汇表，使用collections.Counter统计单词列表中单词的频数，取前50000到vocabulary中。再把vocabulary词汇表转存到一个dict上用于快速查询(dict时间复杂度为O(1))。并统计这类词汇的数量。

下面遍历单词列表，对其中的每一个单词，先判断是否在vocabulary词汇表，是则转换为编号，不是就是0(UNK,unknown).

# Step 2: 建立数据集的dictionary并将不常出现的单词用UNK代替 vocabulary_size = 50000 def build_dataset(words): count = [['UNK', -1]] # count记录出现频率最高的词汇形式为"element":frequency. count.extend(collections.Counter(words).most_common(vocabulary_size - 1)) dictionary = dict() # dictionary记录前出现频率最高的单词的rank for word, _ in count: dictionary[word] = len(dictionary) # 按出现频率存入dict中,并排序 data = list() # data记录数据集(以单词出现频率rank来表示,不在rank内就记录为0-UNK) unk_count = 0 for word in words: if word in dictionary: # data以dict统计单词频率形式表现,不在前50000的记为UNK(unknown) index = dictionary[word] else: index = 0 # 不在dict内的都转换为dictionary['UNK'] unk_count += 1 data.append(index) count[0][1] = unk_count # 翻转dict,即记录数据形式为rank:'element' reverse_dictionary = dict(zip(dictionary.values(), dictionary.keys())) return data, count, dictionary, reverse_dictionary data, count, dictionary, reverse_dictionary = build_dataset(words) del words # Hint to reduce memory. print('Most common words (+UNK)', count[:5]) # Sanple data: # anarchism', 'originated', 'as', 'a', 'term', 'of', 'abuse', 'first', 'used' print('Sample data', data[:10], [reverse_dictionary[i] for i in data[:10]]) data_index = 0

输出为:

('Most common words (+UNK)', [['UNK', 418391], ('the', 1061396), ('of', 593677), ('and', 416629), ('one', 411764)]) ('Sample data', [5239, 3084, 12, 6, 195, 2, 3137, 46, 59, 156], ['anarchism', 'originated', 'as', 'a', 'term', 'of', 'abuse', 'first', 'used', 'against'])

函数 description
class collections.Counter([iterable-or-mapping]):

Counter.most_common([n]) Dict的子类，用于计算hashable items.
每个元素的elements存储记为dictionary keys，对应的elements出现的次数存储记为dictionary values.

返回一个列表的常见的元素和对应的出现次数
Counter(‘abracadabra’).most_common(3)
>>[(‘a’, 5), (‘r’, 2), (‘b’, 2)]
zip([iterable, …]) 返回一个tuple的列表
>>> x = [1, 2, 3]
>>> y = [4, 5, 6]
>>> zipped = zip(x, y)
>>> zipped
[(1, 4), (2, 5), (3, 6)]

3. 生成Word2Vec的训练样本

由上述数据采样可知，这里展示了以下样本:

$s a m p l e : a n a r c h i s m o r i g i n a t e d a s a t e r m o f a b u s e f i r s t u s e d$ $sample:anarchism \ originated \ as \ a \ term \ of \ abuse \ first \ used$

取样本的过程应该是:

判断batch_size和skip_window和num_skips参数的合理性
创建队列deques,队列长度为奇数，中间元素为目标词汇，两边对应目标词汇在队列中的元素排列
依据batch_size和skip_window决定采样的起始位置
对目标词汇两边随机采样(，引入一个targets_to_avoid保证采样不重复)
采样完一个目标词汇，更新队列deques,转为下一个目标词汇，直到满足batch_size个目标词汇

例如我们设置batch_size=8.num_skips=2,skip_window=1.则取出来的数据集应该为

$(o r i g i n a t e d, a n a r c h i s m); (o r i g i n a t e d, a s)$ $(originated,anarchism);(originated,as)$
$(a s, o r i g i n a t e d); (a s, a)$ $(as,originated);(as,a)$
$(a, a s); (a, t e r m)$ $(a,as);(a,term)$
$(t e r m, a); (t e r m, o f)$ $(term,a);(term,of)$

# Step 3: 生成训练数据(batch for the skip-gram model.) def generate_batch(batch_size, num_skips, skip_window): ''' 生成训练数据 :param batch_size: batch大小 :param num_skips: 对每个单词生成的样本数 :param skip_window: 滑窗大小 :return: ''' # data_index单词序号,我们会反复调用generate_batch,要确保data_index可以在函数generate_batch中修改 global data_index # batch_size必须是num_skips的整数倍,保证每个batch包含了一个词汇对应的所有样本 assert batch_size % num_skips == 0 assert num_skips <= 2 * skip_window # 样本数小于2倍的滑窗大小 batch = np.ndarray(shape=(batch_size), dtype=np.int32) # batch和labels转为array labels = np.ndarray(shape=(batch_size, 1), dtype=np.int32) # 定义span为对某个单词创建相关样本时会使用到的单词数量,包括目标单词本身和它前后的单词 span = 2 * skip_window + 1 # [ skip_window target skip_window ] # 创建一个最大容量为span的deque(双向队列,在对deque使用append方法添加数据时,只会保留最后插入的span变量) buffer = collections.deque(maxlen=span) # 填充满buffer,后续数据将替换掉前面的数据 for _ in range(span): buffer.append(data[data_index]) data_index = (data_index + 1) % len(data) # 每次循环对一个目标单词生成样本。现在buffer内是目标单词和所有相关单词 for i in range(batch_size // num_skips): # //除法取整 target = skip_window # target label at the center of the buffer targets_to_avoid = [ skip_window ] # 用于过滤已使用的单词 for j in range(num_skips): # 对一个单词生成num_skips个样本 while target in targets_to_avoid: #随机出一个满足整数(顺序不定但不重复) target = random.randint(0, span - 1) targets_to_avoid.append(target) # 单词已经使用了,过滤掉 batch[i * num_skips + j] = buffer[skip_window] labels[i * num_skips + j, 0] = buffer[target] buffer.append(data[data_index]) # 读入下一个单词,会自动抛弃一个单词 data_index = (data_index + 1) % len(data) return batch, labels batch, labels = generate_batch(batch_size=8, num_skips=2, skip_window=1) # 'anarchism', 'originated', 'as', 'a', 'term', 'of', 'abuse', 'first', 'used' # 'originated'->'anarchism', 'originated'->'as' # 'as'->'originated', 'as'->'a' ; 'a'->'as', 'a'->'term' ... for i in range(8): print(batch[i], reverse_dictionary[batch[i]], '->', labels[i, 0], reverse_dictionary[labels[i, 0]])

输出：

(3084, 'originated', '->', 12, 'as') (3084, 'originated', '->', 5239, 'anarchism') (12, 'as', '->', 3084, 'originated') (12, 'as', '->', 6, 'a') (6, 'a', '->', 12, 'as') (6, 'a', '->', 195, 'term') (195, 'term', '->', 6, 'a') (195, 'term', '->', 2, 'of')

函数 description
class collections.deque([iterable[, maxlen]]) 返回一个新的deque(队列)对象，通过迭代器从左到右(使用append())完成初始化

4. 构建训练参数和网络模型

使用tf.nn.embedding_lookup查找输入train_inputs对应的向量labels.
这里我们采用NCE Loss作为训练目标。

# Step 4: 建立训练模型 Build and train a skip-gram model. batch_size = 128 #训练时batch_size为128 embedding_size = 128 # embedding_size即将单词转为稠密向量的维度,一般取50~1000这个范围内的值 skip_window = 1 # How many words to consider left and right. num_skips = 2 # How many times to reuse an input to generate a label. # We pick a random validation set to sample nearest neighbors. Here we limit the # validation samples to the words that have a low numeric ID, which by # construction are also the most frequent. valid_size = 16 # 验证的单词数 valid_window = 100 # 验证单词只从频率最高的100个单词中抽取 valid_examples = np.random.choice(valid_window, valid_size, replace=False) num_sampled = 64 # 训练时用来做负样本的噪声单词的数量 graph = tf.Graph() with graph.as_default(): # Input data. train_inputs = tf.placeholder(tf.int32, shape=[batch_size]) train_labels = tf.placeholder(tf.int32, shape=[batch_size, 1]) valid_dataset = tf.constant(valid_examples, dtype=tf.int32) # 限定所有操作在CPU上执行，因为有的操作在GPU上还没有实现 # Ops and variables pinned to the CPU because of missing GPU implementation with tf.device('/cpu:0'): # Look up embeddings for inputs. embeddings = tf.Variable( tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0)) embed = tf.nn.embedding_lookup(embeddings, train_inputs) # Construct the variables for the NCE loss nce_weights = tf.Variable( tf.truncated_normal([vocabulary_size, embedding_size], stddev=1.0 / math.sqrt(embedding_size))) nce_biases = tf.Variable(tf.zeros([vocabulary_size])) # 计算NCE loss(计算学习出的词向量embedding在训练数据上的loss,并使用tf.reduce_mean汇总) # Compute the average NCE loss for the batch. # tf.nce_loss automatically draws a new sample of the negative labels each # time we evaluate the loss. loss = tf.reduce_mean( tf.nn.nce_loss(weights=nce_weights, biases=nce_biases, labels=train_labels, inputs=embed, num_sampled=num_sampled, num_classes=vocabulary_size)) # 使用SGD优化器,学习率为1 # Construct the SGD optimizer using a learning rate of 1.0. optimizer = tf.train.GradientDescentOptimizer(1.0).minimize(loss) # Compute the cosine similarity between minibatch examples and all embeddings. # 计算嵌入向量embeddings的L2范数 norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings), 1, keep_dims=True)) # 将embeddings除以其L2范数得到标准化后的normalized_embeddings normalized_embeddings = embeddings / norm # 使用embedding_lookup查询验证单词的嵌入向量,并计算验证单词的嵌入向量与词汇表中所有单词的相似性 valid_embeddings = tf.nn.embedding_lookup( normalized_embeddings, valid_dataset) similarity = tf.matmul( valid_embeddings, normalized_embeddings, transpose_b=True) # Add variable initializer. init = tf.global_variables_initializer()

函数 description
numpy.random.choice(a,
size=None, replace=True, p=None)

np.random.choice(5, 3, replace=False)
array([3,1,0])
>>> #等同于np.random.permutation(np.arange(5))[:3] 从a中随机采样得到一个1维的array

a:1-D array-like or int
如果为一个ndarray，则采样数据从该ndarray中获取。如果为整数，采样数据从np.arange(a)获取
size:int or tuple of ints, optional
输出的shape.如果给了一个shape(m,n,k)，则采样出来为(m * n * k)。如果为空则返回单个数字

5. 训练网络

# Step 5: Begin training. num_steps = 100001 with tf.Session(graph=graph) as session: # We must initialize all variables before we use them. init.run() print("Initialized") average_loss = 0 for step in range(num_steps): batch_inputs, batch_labels = generate_batch( batch_size, num_skips, skip_window) feed_dict = {train_inputs : batch_inputs, train_labels : batch_labels} # We perform one update step by evaluating the optimizer op (including it # in the list of returned values for session.run() _, loss_val = session.run([optimizer, loss], feed_dict=feed_dict) average_loss += loss_val if step % 2000 == 0: if step > 0: average_loss /= 2000 # 每2000次计算一下平均的loss并显示 # The average loss is an estimate of the loss over the last 2000 batches. print("Average loss at step ", step, ": ", average_loss) average_loss = 0 # 每10000次计算一次验证单词与全部单词的相似度,将最相似的8个单词打印出来 # Note that this is expensive (~20% slowdown if computed every 500 steps) if step % 10000 == 0: sim = similarity.eval() for i in range(valid_size): valid_word = reverse_dictionary[valid_examples[i]] top_k = 8 # number of nearest neighbors nearest = (-sim[i, :]).argsort()[1:top_k+1] log_str = "Nearest to %s:" % valid_word for k in range(top_k): close_word = reverse_dictionary[nearest[k]] log_str = "%s %s," % (log_str, close_word) print(log_str) final_embeddings = normalized_embeddings.eval()

输出：

以下展示的是模型训练100000次后，认为平均损失,以及与验证单词相似度最高的单词，可以看到模型对各种类型的单词的相似词汇的识别都较为准确。

('Average loss at step ', 92000, ': ', 4.7085344190597533) ('Average loss at step ', 94000, ': ', 4.6158797936439511) ('Average loss at step ', 96000, ': ', 4.7306651622056961) ('Average loss at step ', 98000, ': ', 4.6274294868111614) ('Average loss at step ', 100000, ': ', 4.6817108399868008) Nearest to history: cegep, lillian, list, extraction, akita, felis, tsar, imran, Nearest to of: microcebus, akita, callithrix, wct, including, yum, ssbn, dasyprocta, Nearest to up: out, thaler, them, him, daley, chlorophyll, back, hler, Nearest to his: their, her, its, the, s, my, ssbn, microcebus, Nearest to use: thaler, thibetanus, callithrix, akita, unassigned, victoriae, abitibi, shops, Nearest to seven: eight, six, five, nine, four, three, zero, callithrix, Nearest to d: b, r, p, layer, circ, six, bront, thaler, Nearest to he: it, she, they, who, there, never, microcebus, tamarin, Nearest to and: or, but, dasyprocta, while, agouti, akita, microcebus, when, Nearest to four: five, six, seven, eight, three, two, nine, zero, Nearest to not: they, usually, you, callithrix, now, it, still, often, Nearest to new: toole, alembert, trinomial, antennae, somers, aldiss, edward, cubism, Nearest to at: in, during, on, within, microcebus, dasyprocta, with, after, Nearest to called: UNK, enclosure, imran, and, used, microsite, specialises, webpages, Nearest to may: can, would, will, could, might, should, must, cannot, Nearest to people: rfcs, aalto, thaler, aorta, reservation, regulators, forces, access,

6. 可视化Word2Vec

在上面代码中，我们将50000个种类的vocabulary展成128维(embedding_size)的向量.为了便于观察，使用sklearn.manifold.TSNE实现数据降维,直接把128维降维到2维。这样就能在二维图像上汇出对应的vocabulary了(为了便于观察，这里只取出频率最高的50个vocabulary)

# Step 6: 用来可视化Word2Vec效果的函数 # Visualize the embeddings. def plot_with_labels(low_dim_embs, labels, filename='tsne.png'): ''' 可视化Word2Vec效果的函数 :param low_dim_embs: 降维到2维的单词的空间向量 :param labels: :param filename: :return: ''' assert low_dim_embs.shape[0] >= len(labels), "More labels than embeddings" plt.figure(figsize=(18, 18)) #in inches for i, label in enumerate(labels): x, y = low_dim_embs[i, :] plt.scatter(x, y) # 显示散点图 plt.annotate(label, # 显示单词本身 xy=(x, y), xytext=(5, 2), textcoords='offset points', ha='right', va='bottom') plt.savefig(filename) #%% try: from sklearn.manifold import TSNE import matplotlib.pyplot as plt # 使用sklearn.manifold.TSNE实现数据降维,从原始的128维降到2维,在展示50个频率高的单词 tsne = TSNE(perplexity=30, n_components=2, init='pca', n_iter=5000) plot_only = 50 low_dim_embs = tsne.fit_transform(final_embeddings[:plot_only,:]) labels = [reverse_dictionary[i] for i in range(plot_only)] plot_with_labels(low_dim_embs, labels) except ImportError: print("Please install sklearn, matplotlib, and scipy to visualize embeddings.")

输出：

距离相近的单词在语义上有很高的相似性。

函数 description
class sklearn.manifold.TSNE(n_components=2, perplexity=30.0, early_exaggeration=12.0, learning_rate=200.0, n_iter=1000, n_iter_without_progress=300, min_grad_norm=1e-07, metric=’euclidean’, init=’random’, verbose=0, random_state=None, method=’barnes_hut’, angle=0.5)[source] t-SNE是一个高维数据可视化的工具。

more details:
http://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html
TSNE.fit_transform(X[, y]) 训练X直到an embedded space 并返回transformed output.　｜

Tensorflow上的实现基于LSTM的语言模型

数据集

Penn Tree Bank(PTB)是在语言模型训练中经常使用的一个数据集。它的质量比较高，可用来评测语言模型的准确率，同时数据集不大，训练速度也快。

我们下载PTB数据集并解压，确保解压后的文件路径与后面的工程Python路径一致。这个数据集本身已经做了一些预处理，它包含了1万个不同的单词，有句尾的标记，同时将罕见的词汇统一处理为特殊字符。

在Liunx上直接下载并解压

wget http://www.fit.vutbr.cz/~imikolov/rnnlm/simple-examples.tgz # 下载 tar xvf simple-examples.tgz # 解压

或者去网站上下载，再解压

http://www.fit.vutbr.cz/~imikolov/rnnlm/simple-examples.tgz

为了让PTB数据集使用起来更方便，TensorFlow的Models包中的PTB Reader，借助它可以很方便的读取数据内容。

如果TensorFlow中已经安装了Models模块，则直接导入

from tensorflow.models.tutorials.rnn.ptb import reader

如果没有Models包，则直接使用git工具下载

git clone https://github.com/tensorflow/models.git cd models/tutorials/rnn/ptb # 保证工程Python与解压文件与ptb文件在同一目录下

PTB Readert 提供了ptb_raw_data函数用来读取PTB的原始数据，并将原始数据中的单词转换为单词ID.

# coding:utf8 import reader # 存放PTB数据集和位置 DATA_PATH = '/root/PycharmProjects/RNN/LSTM/simple-examples/data/' # 读取PTB数据 train_data, valid_data, test_data, _ = reader.ptb_raw_data(DATA_PATH) print(len(train_data)) print(train_data[:100]) ''' 程序输出: 929589 [9970, 9971, 9972, 9974, 9975, 9976, 9980, 9981, 9982, 9983, 9984, 9986, 9987, 9988, 9989, 9991, 9992, 9993, 9994, 9995, 9996, 9997, 9998, 9999, 2, 9256, 1, 3, 72, 393, 33, 2133, 0, 146, 19, 6, 9207, 276, 407, 3, 2, 23, 1, 13, 141, 4, 1, 5465, 0, 3081, 1596, 96, 2, 7682, 1, 3, 72, 393, 8, 337, 141, 4, 2477, 657, 2170, 955, 24, 521, 6, 9207, 276, 4, 39, 303, 438, 3684, 2, 6, 942, 4, 3150, 496, 263, 5, 138, 6092, 4241, 6036, 30, 988, 6, 241, 760, 4, 1015, 2786, 211, 6, 96, 4] '''

可以看到训练数据共包含了929589个单词，而这些单词被组成了一个非常长的序列。这个序列通过特殊的标识符给出了每句话结束的位置。在这个数据集中，句子结束的标识符ID为2.

在实际训练时需要按照某个固定的长度截取序列，为了实现截断并将数据组织成batch，Tensorflow提供了ptb_iterator函数。

# coding:utf8 import reader # 存放PTB数据集和位置 DATA_PATH = '/root/PycharmProjects/RNN/LSTM/simple-examples/data/' # 读取PTB数据 train_data, valid_data, test_data, _ = reader.ptb_raw_data(DATA_PATH) x, y = reader.ptb_producer(train_data, 4, 5) print(x) print(y) ''' 输出: Tensor("PTBProducer/StridedSlice:0", shape=(4, 5), dtype=int32) Tensor("PTBProducer/StridedSlice_1:0", shape=(4, 5), dtype=int32) '''

工程代码

#coding:utf8 #%% # Copyright 2016 The TensorFlow Authors. All Rights Reserved. # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. # ============================================================================== import time import numpy as np import tensorflow as tf import reader #flags = tf.flags #logging = tf.logging #flags.DEFINE_string("save_path", None, # "Model output directory.") #flags.DEFINE_bool("use_fp16", False, # "Train using 16-bit floats instead of 32bit floats") #FLAGS = flags.FLAGS #def data_type(): # return tf.float16 if FLAGS.use_fp16 else tf.float32 # class PTBInput(object): ''' The input data. config中有batch_size.num_steps num_steps :是LSTM的展开步数(unrolled steps of LSTM) epoch_size:为每个epoch内需要多少轮训练的迭代 ''' def __init__(self, config, data, name=None): self.batch_size = batch_size = config.batch_size self.num_steps = num_steps = config.num_steps self.epoch_size = ((len(data) // batch_size) - 1) // num_steps self.input_data, self.targets = reader.ptb_producer( data, batch_size, num_steps, name=name) # 通过一个PTBModel类来描述模型,方便维护循环神经网络中的状态 class PTBModel(object): ''' The PTB model. 语言模型. input_: batch_size和num_steps config: hidden_size(LSTM节点数)和vocab_size(词汇表大小) ''' def __init__(self, is_training, config, input_): self._input = input_ batch_size = input_.batch_size num_steps = input_.num_steps size = config.hidden_size # LSTM节点数 vocab_size = config.vocab_size # 词汇表大小 # Slightly better results can be obtained with forget gate biases # initialized to 1 but the hyperparameters of the model would need to be # different than reported in the paper. # 使用tf.contrib.rnn.BasicLSTMCell设置默认的LSTM单元 def lstm_cell(): return tf.contrib.rnn.BasicLSTMCell( size, forget_bias=0.0, state_is_tuple=True) attn_cell = lstm_cell # 如果训练状态且Dropout的keep_prob小于1,则在前面的lstm_cell之后接一个Dropout层 # 调用tf.contrib.rnn.DropoutWrapper, if is_training and config.keep_prob < 1: def attn_cell(): return tf.contrib.rnn.DropoutWrapper( lstm_cell(), output_keep_prob=config.keep_prob) # 使用RNN的堆叠函数tf.contrib.rnn.MultiRNNCell将前面构造的lstm_cell多层堆叠得到cell cell = tf.contrib.rnn.MultiRNNCell( [attn_cell() for _ in range(config.num_layers)], state_is_tuple=True) self._initial_state = cell.zero_state(batch_size, tf.float32) # 指定在cpu上执行 # embedding_lookup是将单词的ID转换为单词向量，这里embedding的维度为VOCAB_SIZE * SIZE.(行数为词汇表数,列数为hidden_size) # 从embedding_lookup上获得输入单词向量,并在训练时添加dropout with tf.device("/cpu:0"): embedding = tf.get_variable( "embedding", [vocab_size, size], dtype=tf.float32) inputs = tf.nn.embedding_lookup(embedding, input_.input_data) if is_training and config.keep_prob < 1: inputs = tf.nn.dropout(inputs, config.keep_prob) # 定义输出列表将不同时刻的LSTM输出记录到一起，再通过一个全连接层得到最终的输出 outputs = [] state = self._initial_state with tf.variable_scope("RNN"): # 为了控制训练,我们会限制梯度在反向传播时可以展开的步数为一个固定的值num_steps for time_step in range(num_steps): if time_step > 0 : tf.get_variable_scope().reuse_variables() # 设置复用变量 # 给cell传入inputs和state # inputs的三个维度,第一个维度代表batch的第几个样本,第二个维度代表样本中第几个单词，第三个维度是单词的向量表达的维度 # inputs[:,time_step,:]代表所有样本的第time_step个单词 (cell_output, state) = cell(inputs[:, time_step, :], state) # 将当前的输出加入到outputs列表 outputs.append(cell_output) # 将输出队列展开成[batch,size*num_steps]的形状，再reshape成[batch*num_steps,size] # 使用concat将所有输出接到一起并转为一维向量 output = tf.reshape(tf.concat(outputs, 1), [-1, size]) # 定义softmax层 # 从size的向量转换为vocab_size的单词ID softmax_w = tf.get_variable( "softmax_w", [size, vocab_size], dtype=tf.float32) softmax_b = tf.get_variable("softmax_b", [vocab_size], dtype=tf.float32) # 得到网络的输出 logits = tf.matmul(output, softmax_w) + softmax_b # 直接使用tf.contrib.legacy_seq2seq.sequence_loss_by_example计算输出logits和targets的交叉熵 loss = tf.contrib.legacy_seq2seq.sequence_loss_by_example( [logits], # 预测结果 [tf.reshape(input_.targets, [-1])], # 期待结果，这里将[batch_size,num_steps]二维数组展开成一维数组 [tf.ones([batch_size * num_steps], dtype=tf.float32)]) # 计算得到每个batch的损失 self._cost = cost = tf.reduce_sum(loss) / batch_size self._final_state = state # 只在训练的时候定义反向传播操作，不是训练状态就返回 if not is_training: return # 定义学习速率_lr self._lr = tf.Variable(0.0, trainable=False) # 获取所有可训练的参数tvars,针对前面得到的cost,计算tvars梯度 # 并tf.clip_by_global_norm设置梯度的最大范数,某种程度上起到了正则化的作用 # 这就是Gradient Clipping的方法,控制梯度的最大范数,防止Gradient Explosion梯度爆炸 tvars = tf.trainable_variables() grads, _ = tf.clip_by_global_norm(tf.gradients(cost, tvars), config.max_grad_norm) optimizer = tf.train.GradientDescentOptimizer(self._lr) # 用optimizer.apply_gradients将前面clip过的梯度应用到所有可训练的参数tvars上 # 然后使用tf.contrib.framework.get_or_create_global_step()生成全局统一的训练步数 self._train_op = optimizer.apply_gradients( zip(grads, tvars), global_step=tf.contrib.framework.get_or_create_global_step()) # 创建_new_lr控制学习率创建_lr_update使用tf.assign将_new_lr传给当前的学习率_lr self._new_lr = tf.placeholder( tf.float32, shape=[], name="new_learning_rate") self._lr_update = tf.assign(self._lr, self._new_lr) # assign_lr用于外部控制模型的学习速率 def assign_lr(self, session, lr_value): session.run(self._lr_update, feed_dict={self._new_lr: lr_value}) # 定义PTBModel class的一些property @property装饰器可以将返回变量设为只读,防止修改变量引发不必要问题 @property def input(self): return self._input @property def initial_state(self): return self._initial_state @property def cost(self): return self._cost @property def final_state(self): return self._final_state @property def lr(self): return self._lr @property def train_op(self): return self._train_op ''' 定义不同大小的模型的参数 init_scale 网络中权重值的初始scale learning_rate 学习率的初始值 max_grad_norm 梯度的最大范数 num_layers LSTM堆叠的层数 num_steps LSTM梯度反向传播展开步数 hidden_size LSTM内隐含节点数 max_epoch 初始学习率可训练的epoch max_max_epoch 总工可训练的epoch keep_prob dropout比率 lr_decay 学习率衰减速度 batch_size batch中样本数量 vocab_size ''' class SmallConfig(object): """Small config.""" init_scale = 0.1 learning_rate = 1.0 max_grad_norm = 5 num_layers = 2 num_steps = 20 hidden_size = 200 max_epoch = 4 max_max_epoch = 13 keep_prob = 1.0 # 设置为1即不用dropout lr_decay = 0.5 batch_size = 20 vocab_size = 10000 class MediumConfig(object): """ Medium config. 我们减少了init_scale.希望权重初值不要太大,这样有利于温和的训练增添hidden_size到650,训练次数也增大设置dropout为0.5,即开始使用dropout """ init_scale = 0.05 learning_rate = 1.0 max_grad_norm = 5 num_layers = 2 num_steps = 35 hidden_size = 650 max_epoch = 6 max_max_epoch = 39 keep_prob = 0.5 lr_decay = 0.8 batch_size = 20 vocab_size = 10000 class LargeConfig(object): """ Large config. 我们继续减少了init_scale. 放大了最大梯度范数到10 增添hidden_size到1500,训练次数也增大设置dropout为0.5,即开始使用dropout """ init_scale = 0.04 learning_rate = 1.0 max_grad_norm = 10 num_layers = 2 num_steps = 35 hidden_size = 1500 max_epoch = 14 max_max_epoch = 55 keep_prob = 0.35 lr_decay = 1 / 1.15 batch_size = 20 vocab_size = 10000 class TestConfig(object): """Tiny config, for testing.""" init_scale = 0.1 learning_rate = 1.0 max_grad_norm = 1 num_layers = 1 num_steps = 2 hidden_size = 2 max_epoch = 1 max_max_epoch = 1 keep_prob = 1.0 lr_decay = 0.5 batch_size = 20 vocab_size = 10000 def run_epoch(session, model, eval_op=None, verbose=False): """Runs the model on the given data.""" start_time = time.time() costs = 0.0 iters = 0 state = session.run(model.initial_state) fetches = { "cost": model.cost, "final_state": model.final_state, } if eval_op is not None: fetches["eval_op"] = eval_op for step in range(model.input.epoch_size): feed_dict = {} # 使用当前数据训练或预测模型 for i, (c, h) in enumerate(model.initial_state): # 将全部的状态加入 feed_dict[c] = state[i].c feed_dict[h] = state[i].h vals = session.run(fetches, feed_dict) # 训练 cost = vals["cost"] state = vals["final_state"] # 将不同时刻.不同batch的概率加起来再做指数运算就得到perplexity costs += cost iters += model.input.num_steps if verbose and step % (model.input.epoch_size // 10) == 10: print("%.3f perplexity: %.3f speed: %.0f wps" % (step * 1.0 / model.input.epoch_size, np.exp(costs / iters), iters * model.input.batch_size / (time.time() - start_time))) return np.exp(costs / iters) # 返回perplexity raw_data = reader.ptb_raw_data('simple-examples/data/') train_data, valid_data, test_data, _ = raw_data config = SmallConfig() eval_config = SmallConfig() eval_config.batch_size = 1 eval_config.num_steps = 1 with tf.Graph().as_default(): initializer = tf.random_uniform_initializer(-config.init_scale, config.init_scale) with tf.name_scope("Train"): train_input = PTBInput(config=config, data=train_data, name="TrainInput") with tf.variable_scope("Model", reuse=None, initializer=initializer): m = PTBModel(is_training=True, config=config, input_=train_input) #tf.scalar_summary("Training Loss", m.cost) #tf.scalar_summary("Learning Rate", m.lr) with tf.name_scope("Valid"): valid_input = PTBInput(config=config, data=valid_data, name="ValidInput") with tf.variable_scope("Model", reuse=True, initializer=initializer): mvalid = PTBModel(is_training=False, config=config, input_=valid_input) #tf.scalar_summary("Validation Loss", mvalid.cost) with tf.name_scope("Test"): test_input = PTBInput(config=eval_config, data=test_data, name="TestInput") with tf.variable_scope("Model", reuse=True, initializer=initializer): mtest = PTBModel(is_training=False, config=eval_config, input_=test_input) # 创建训练的管理器,默认session sv = tf.train.Supervisor() with sv.managed_session() as session: for i in range(config.max_max_epoch): lr_decay = config.lr_decay ** max(i + 1 - config.max_epoch, 0.0) # 计算累计的衰减值 m.assign_lr(session, config.learning_rate * lr_decay) print("Epoch: %d Learning rate: %.3f" % (i + 1, session.run(m.lr))) train_perplexity = run_epoch(session, m, eval_op=m.train_op, verbose=True) print("Epoch: %d Train Perplexity: %.3f" % (i + 1, train_perplexity)) valid_perplexity = run_epoch(session, mvalid) print("Epoch: %d Valid Perplexity: %.3f" % (i + 1, valid_perplexity)) test_perplexity = run_epoch(session, mtest) print("Test Perplexity: %.3f" % test_perplexity) # if FLAGS.save_path: # print("Saving model to %s." % FLAGS.save_path) # sv.saver.save(session, FLAGS.save_path, global_step=sv.global_step) #if __name__ == "__main__": # tf.app.run()

输出：
使用小型网络，可以达到30000单词每秒。在第13epoch上，训练集上可以达到41的perplexity,测试集和验证集上可以得到114和119的perplexity.这相当于在训练过程中，将选择下一个单词的范围缩减到41个。

Epoch: 11 Train Perplexity: 41.423 Epoch: 11 Valid Perplexity: 119.596 Epoch: 12 Learning rate: 0.004 0.004 perplexity: 62.332 speed: 31880 wps 0.104 perplexity: 45.575 speed: 30970 wps 0.204 perplexity: 50.202 speed: 31076 wps 0.304 perplexity: 48.064 speed: 31048 wps 0.404 perplexity: 47.202 speed: 31007 wps 0.504 perplexity: 46.485 speed: 30982 wps 0.604 perplexity: 44.948 speed: 30964 wps 0.703 perplexity: 44.271 speed: 30917 wps 0.803 perplexity: 43.513 speed: 30895 wps 0.903 perplexity: 42.061 speed: 30912 wps Epoch: 12 Train Perplexity: 41.149 Epoch: 12 Valid Perplexity: 119.290 Epoch: 13 Learning rate: 0.002 0.004 perplexity: 62.055 speed: 31557 wps 0.104 perplexity: 45.379 speed: 30842 wps 0.204 perplexity: 50.000 speed: 30992 wps 0.304 perplexity: 47.882 speed: 31186 wps 0.404 perplexity: 47.030 speed: 31151 wps 0.504 perplexity: 46.318 speed: 31090 wps 0.604 perplexity: 44.788 speed: 30990 wps 0.703 perplexity: 44.114 speed: 30884 wps 0.803 perplexity: 43.359 speed: 30846 wps 0.903 perplexity: 41.911 speed: 30833 wps Epoch: 13 Train Perplexity: 41.002 Epoch: 13 Valid Perplexity: 119.096 Test Perplexity: 114.660

TensorFlow中关于RNN的API

在前面我们讲了RNN的基本结构和LSTM循环神经网络，本节我们详解查看一下TensorFlow中提供的有关于RNN的API.

Tensorflow中关于RNN的api主要分布tf.nn和tf.contrib.rnn两个模块.

tf.nn下有关rnn的api

Embedding

TensorFlow提供的有关于embedding 相关的api(常用于NLP)

tf.nn.embedding_lookup
依据inputs_ids的id来寻找embedding_params中对应的元素.(详解参见embedding_lookup函数详解)
tf.nn.embedding_lookup_sparse
功能和embedding_lookup类似.

Recurrent Neural Networks

TensorFlow提供的一些构建RNN的方法。这些方法大多数接收的是RNNCell-subclassed object.

tf.nn.dynamic_rnn
tf.nn.bidirectional_dynamic_rnn
tf.nn.raw_rnn

tf.nn.dynamic_rnn

创建一个RNN.其中参数inputs为[batch_size, max_time,embedding_size].对于普通的RNN，要求输入数据的sequence_leng相同，如果不同需要通过padding补零，从而达到相同长度。

例如:针对一个RNN网络，inputs = [3,10,128],即一个batch内有3条输入数据，假设第1条数据长度为10，第2条为5，第3条为6.则第2和第3条数据需要padding到10(补零直到够10个数据长度)。这只是在一个batch上，假设在整个数据集上数据长度浮动在10-40之间，那样会有很多补零操作(太多的补零浪费模型性能)。针对这一问题dynamic_rnn可以让不同迭代下的传入的batch的数据长度是不同的，而rnn要求batch的数据长度还是固定的。

dynamic_rnn有一个参数:sequence_length.这个参数可以指定每个batch的数据长度，模型会根据这个参数进行反padding操作(遇到padding的数据进行删除)。

dynamic_rnn介绍

''' 详解: Creates a recurrent neural network specified by RNNCell cell. Performs fully dynamic unrolling of inputs. ''' dynamic_rnn( cell, inputs, sequence_length=None, initial_state=None, dtype=None, parallel_iterations=None, swap_memory=False, time_major=False, scope=None ) '''

Args:

cell: An instance of RNNCell.
inputs: The RNN inputs.

If time_major == False (default),inputs的shape为:[batch_size, max_time, …],或是此类元素的嵌套元组.
If time_major == True, inputs的shape为:[max_time, batch_size, …], 或是此类元素的嵌套元组.

sequence_length: (optional) An int32/int64 vector sized [batch_size]. 针对不同batch下数据长度可能不同，可以通过sequence_length指定inputs中不同batch对应的size.

initial_state: (optional) 针对RNN的一组初始状态.

如果 cell.state_size是整型,这必须是类似的类型且shape为[batch_size, cell.state_size].
如果 cell.state_size是元组,这必须是tensor元组且shape为 shapes [batch_size, s] for s in cell.state_size.

time_major: 输入输出的shape格式.

If true, 输入输出的shape格式为[max_time, batch_size, depth].
If false, 输入输出的shape格式为 [batch_size, max_time, depth].

Connectionist Temporal Classification (CTC)

CTC是一种改进的RNN模型。在一般RNN模型中，输入序列和标注序列是一一对应的。在某些序列建模问题上，输入序列和输出序列不对等(输入序列远长度输出序列)。CTC解决这一问题的方法是在标注序列上添加空白符号blank. 利用RNN标注，最后把blank符号和预测出的重复符号消除。

tf.nn.ctc_loss
tf.nn.ctc_greedy_decoder
tf.nn.ctc_beam_search_decoder

tf.contrib.rnn下有关rnn的api

Base interface for all RNN Cells(针对所有RNN Cells的Basic 接口)

tf.contrib.rnn.RNNCell
Class RNNCell是 RNN cell的一个abstract object(即程序架构上的基类，定义了一些基本的方法属性)

Core RNN Cells for use with TensorFlow’s core RNN methods(针对使用RNN methods需要的Core RNN Cells)

tf.contrib.rnn.BasicRNNCell
RNN的基本单元。
tf.contrib.rnn.BasicLSTMCell
LSTM网络的基本单元，下面搭建LSTM循环神经网络有详解
tf.contrib.rnn.GRUCell
Gated Recurrent Unit Cell.
tf.contrib.rnn.LSTMCell
LSTM网络的基本单元(相对于BasicLSTMCell参数更多，功能更强大)
tf.contrib.rnn.LayerNormBasicLSTMCell
带有layer normalization and recurrent dropout的LSTM单元.

Classes storing split RNNCell state

tf.contrib.rnn.LSTMStateTuple
在其他RNNCell的state_is_tuple=True时使用，
格式为 tuple used by LSTM Cells for state_size, zero_state, and output state.Stores two elements: (c, h), in that order.

Core RNN Cell wrappers (RNNCells that wrap other RNNCells)

tf.contrib.rnn.MultiRNNCell
堆叠不同的RNNCells.
tf.contrib.rnn.LSTMBlockWrapper
一个helper class，用来为LSTM cells提供housekeeping.
tf.contrib.rnn.DropoutWrapper
对于给定的cell添加dropouts.
tf.contrib.rnn.EmbeddingWrapper
对于给定的cell添加input embedding操作(用embedding_lookup好一点)
tf.contrib.rnn.InputProjectionWrapper
对给定的cell添加输入映射.(不常用)
tf.contrib.rnn.OutputProjectionWrapper
对给定的cell添加输出映射.(不常用)
tf.contrib.rnn.DeviceWrapper
确保RNNCells运行在particular device.
tf.contrib.rnn.ResidualWrapper
确保cell的input直接前馈到输出(Residual残差网络)

Block RNNCells

tf.contrib.rnn.LSTMBlockCell
设置LSTM的forget_gate的forget_bias为1 (default: 1),这是为了减少在训练初期forgetting的计算量(设置为1，让多数的forget gate处于开启状态便于训练)
tf.contrib.rnn.GRUBlockCell
Block GRU cell implementation(详解参见 http://arxiv.org/abs/1406.1078)

Fused RNNCells

融合RNN的相关API.

tf.contrib.rnn.FusedRNNCell
tf.contrib.rnn.FusedRNNCellAdaptor
tf.contrib.rnn.TimeReversedFusedRNN
tf.contrib.rnn.LSTMBlockFusedCell

LSTM-like cells

LSTM网络拓展单元。

tf.contrib.rnn.CoupledInputForgetGateLSTMCell
tf.contrib.rnn.TimeFreqLSTMCell
tf.contrib.rnn.GridLSTMCell

RNNCell wrappers

RNNCell的拓展单元。

tf.contrib.rnn.AttentionCellWrapper
tf.contrib.rnn.CompiledWrapper

搭建LSTM循环神经网络

在TensorFlow中，提供了tf.contrib.rnn.BasicLSTMCell类快速搭建LSTM模块。

tf.contrib.rnn.BasicLSTMCell

Class BasicLSTMCell:创建一个基本的LSTM循环神经网络cell.

我们将biases内的forget gate的forget_bias设置为1，这是为了减少训练初期的forgetting sacle.

BasicLSTMCell**不支持cell clipping**(梯度截断，防止梯度爆炸), a projection layer, 也没有使用 peep-hole connections。
对于高级模式的LSTM，请使用tf.nn.rnn_cell.LSTMCell.

属性

variables/weights： Returns the list of all layer variables/weights.

Returns: A list of variables/weights.

函数

_init_

初始化the basic LSTM cell.

当从CudnnLSTM-trained恢复checkpoints,必须使用CudnnCompatibleLSTMCell代替.

__init__( num_units, forget_bias=1.0, state_is_tuple=True, activation=None, reuse=None )

参数名称 description
num_units int, The number of units in the LSTM cell.
forget_bias float, forget gates上的偏置单元. 当从CudnnLSTM-trained checkpoints复原时必须设置为0.
state_is_tuple If True, 接收和返回的states是包括c_state and m_state的2-tuples.
If False, they are concatenated along the column axis. The latter behavior will soon be deprecated.
activation inner states的激活函数. 默认为: tanh.
reuse (optional) Python boolean describing whether to reuse variables in an existing scope. If not True, and the existing scope already has the given variables, an error is raised.

add_loss/add_update

增加loss/updates tensor(s).potentially dependent on layer inputs.

一些损失/更新可能依赖于调用layer时通过的输入。
因此，对于使用不同的输入A和B的同一层，layer.loss/updates可能独立依赖与a或b,这个方法会自动的keeps track of dependencies.

add_loss/update( losses/updates, inputs=None )

参数名称 description
losses/updates Loss/updates tensor, or list/tuple of tensors.
inputs Optional input tensor(s) that the loss(es)/update depend on.
在losses创建的时候必须匹配通过的Inputs参数.
如果没有数据通过，the losses/update are assumed to be unconditional, and will apply across all dataflows of the layer (e.g. weight regularization losses)

call

Long short-term memory cell (LSTM).

call( inputs, state )

参数名称 description
inputs 2-D tensor with shape [batch_size x input_size].
state An LSTMStateTuple of state tensors, each shaped [batch_size x self.state_size], if state_is_tuple has been set to True. Otherwise, a Tensor shaped [batch_size x 2 * self.state_size].

返回值:
一对包含着hidden state和new state的元素(either a LSTMStateTuple or a concatenated state, depending on state_is_tuple).

zero_state

Return zero-filled state tensor(s).

zero_state( batch_size, dtype )

参数名称 description
batch_size int, float, or unit Tensor representing the batch size.
dtype the data type to use for the state.

返回值:

If state_size is an int or TensorShape, then the return value is a N-D tensor of shape [batch_size x state_size] filled with zeros.
If state_size is a nested list or tuple, then the return value is a nested list or tuple (of the same structure) of 2-D tensors with the shapes [batch_size x s] for each s in state_size.

tf.contrib.rnn.MultiRNNCell

Class MultiRNNCell:用于堆叠多少RNN cell(堆叠的cell可以是普通的RNN cell，LSTM cell，GRU cell等).

__init__( cells, state_is_tuple=True )

参数名称 description
cells list, 将要被组合成一组RNN网络的RNN cell单元列表
state_is_tuple If True, 接收和返回的states为n-tuples, where n = len(cells).
If False, 所有的states会被按找列方向依次连接。

TensorFlow实现LSTM的demo

#coding=utf-8 #简单LSTM 结构的RNN 的前向传播过程实现 import tensorflow as tf lstm_hidden_size=1 batch_size=20 num_steps=20 # 定义一个LSTM结构 lstm = tf.nn.rnn_cell.BasicLSTMCell(lstm_hidden_size) # BasicLSTMCell类提供了zero_state函数来生成全零的初始状态 state = lstm.zero_state(batch_size,tf.float32) # 定义损失函数 loss=0.0 # 理论上循环神经网络可以处理任意长度的序列，在训练时为了避免 # 梯度消失问题，会规定一个最大序列长度，我们用num_steps表示这个长度 for i in range(num_steps): if i > 0: tf.get_variable_scope().reuse_variables() # 每一步处理时间序列中的一个时刻。将当前输入(current_input)和 # 上一时刻状态(state)传入定义的LSTM结构得到当前LSTM结构的输出lstm_output # 和更新后的状态state lstm_output, state = lstm(current_input, state) # 将当前时刻LSTM结构的输出传入一个全连接层得到最后的输出 final_output = fully_connected(lstm_output) # 计算当前时刻输出的损失 loss+=calc_loss(final_output, expected_output)

参考资料

《深度学习》 Ian Goodfellow
《TensorFlow实战Google深度学习框架》郑泽宇

函数	description
urlretrieve属于urllib包的，而urllib在Python2和Python3上的实现是不同的。 py2: urllib.urlretrieve(url[, filename[, reporthook[, data]]]) py3: urllib.request.urlretrieve(url, file=None, repo=None, data=None)	copy一个由URL描述的网络对象到本地，如果URL指向一个本地文件，则对象不被copy除非提供文件名返回一个元组对象(filename,tuple)
class zipfile.ZipFile(file, mode=’r’, compression=ZIP_STORED, allowZip64=True) with ZipFile(‘spam.zip’) as myzip:	参数file可以是一个文件的路径(字符串)或者是文件对象. ZipFile 通过配合with关键字获得上下文管理器使用
ZipFile.namelist()	返回一个archive members by name列表 .
tf.compat.as_str	转换任何bytes或Unicode bytes，使用utf8编码文本

函数	description
class collections.Counter([iterable-or-mapping]): Counter.most_common([n])	Dict的子类，用于计算hashable items. 每个元素的elements存储记为dictionary keys，对应的elements出现的次数存储记为dictionary values. 返回一个列表的常见的元素和对应的出现次数 Counter(‘abracadabra’).most_common(3) >>[(‘a’, 5), (‘r’, 2), (‘b’, 2)]
zip([iterable, …])	返回一个tuple的列表 >>> x = [1, 2, 3] >>> y = [4, 5, 6] >>> zipped = zip(x, y) >>> zipped [(1, 4), (2, 5), (3, 6)]

函数	description
class collections.deque([iterable[, maxlen]])	返回一个新的deque(队列)对象，通过迭代器从左到右(使用append())完成初始化

函数	description
class sklearn.manifold.TSNE(n_components=2, perplexity=30.0, early_exaggeration=12.0, learning_rate=200.0, n_iter=1000, n_iter_without_progress=300, min_grad_norm=1e-07, metric=’euclidean’, init=’random’, verbose=0, random_state=None, method=’barnes_hut’, angle=0.5)[source]	t-SNE是一个高维数据可视化的工具。 more details: http://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html
TSNE.fit_transform(X[, y])	训练X直到an embedded space 并返回transformed output.　｜

参数名称	description
num_units	int, The number of units in the LSTM cell.
forget_bias	float, forget gates上的偏置单元. 当从CudnnLSTM-trained checkpoints复原时必须设置为0.
state_is_tuple	If True, 接收和返回的states是包括c_state and m_state的2-tuples. If False, they are concatenated along the column axis. The latter behavior will soon be deprecated.
activation	inner states的激活函数. 默认为: tanh.
reuse	(optional) Python boolean describing whether to reuse variables in an existing scope. If not True, and the existing scope already has the given variables, an error is raised.

参数名称	description
losses/updates	Loss/updates tensor, or list/tuple of tensors.
inputs	Optional input tensor(s) that the loss(es)/update depend on. 在losses创建的时候必须匹配通过的Inputs参数. 如果没有数据通过，the losses/update are assumed to be unconditional, and will apply across all dataflows of the layer (e.g. weight regularization losses)

参数名称	description
inputs	2-D tensor with shape [batch_size x input_size].
state	An LSTMStateTuple of state tensors, each shaped [batch_size x self.state_size], if state_is_tuple has been set to True. Otherwise, a Tensor shaped [batch_size x 2 * self.state_size].

参数名称	description
batch_size	int, float, or unit Tensor representing the batch size.
dtype	the data type to use for the state.

参数名称	description
cells	list, 将要被组合成一组RNN网络的RNN cell单元列表
state_is_tuple	If True, 接收和返回的states为n-tuples, where n = len(cells). If False, 所有的states会被按找列方向依次连接。

TensorFlow实战：Chapter-7上（RNN简介和RNN在NLP应用)

RNN简介

序列建模方法：展开计算图

循环神经网络

1. 每个时间步都有输出，并且隐藏单元之间有循环连接的循环网络

2.每个时间步都有输出，当前时刻的输出到下个时刻的隐藏单元之间有连接的循环网络

3.隐藏单元之间存在循环连接，读取整个序列后产生单个输出的循环网络

双向RNN

基于编码-解码的序列到序列(Seq2Seq)架构

RNN的依赖和不足

长期依赖的挑战

RNN学习中遇到的问题

RNN学习过程中遇到什么问题了？

为啥RNN学习会有问题？

解决办法： 针对梯度爆炸的情况

解决办法：针对梯度消失的情况

门控RNN

长短期记忆(LSTM)

为什么会提出LSTM?

LSTM的结构

LSTM的传播公式

整个LSTM网络的推导式

其他门控RNN

自然语言建模

n−gram n-gram

神经语言模型

为什么要将字词转为向量形式?

将字词转为向量形式有什么优点?

语言模型评价指标–复杂度(perplexity)

Word2Vec

向量空间模型的分类

Skip-Gram模式的Word2Vec

Word2Vec在Tensorflow上的实现

代码编写

1. 导入模块，下载数据集并读取到列表中

2. 创建数据集dict，统计单词频率，处理数据

3. 生成Word2Vec的训练样本

4. 构建训练参数和网络模型

5. 训练网络

6. 可视化Word2Vec

Tensorflow上的实现基于LSTM的语言模型

数据集

工程代码

TensorFlow中关于RNN的API

tf.nn下有关rnn的api

Embedding

Recurrent Neural Networks

tf.nn.dynamic_rnn

dynamic_rnn介绍

Args:

Connectionist Temporal Classification (CTC)

tf.contrib.rnn下有关rnn的api

Base interface for all RNN Cells(针对所有RNN Cells的Basic 接口)

Core RNN Cells for use with TensorFlow’s core RNN methods(针对使用RNN methods需要的Core RNN Cells)

Classes storing split RNNCell state

Core RNN Cell wrappers (RNNCells that wrap other RNNCells)

Block RNNCells

Fused RNNCells

LSTM-like cells

RNNCell wrappers

搭建LSTM循环神经网络

tf.contrib.rnn.BasicLSTMCell

属性

函数

tf.contrib.rnn.MultiRNNCell

TensorFlow实现LSTM的demo

参考资料

解决办法：针对梯度爆炸的情况

$n-gram$

Classes storing split `RNNCell` state