TensorFlow学习（十三）：构造LSTM超长简明教程

本文链接：https://blog.csdn.net/xierhacker/article/details/78772560

本文深入介绍了TensorFlow中LSTM的使用，包括tf.nn.rnn_cell.LSTMCell、tf.nn.dynamic_rnn()等关键函数和类，并通过实例演示了如何构建和训练LSTM模型，从预测sin函数到MNIST图像分类，逐步解析LSTM在深度学习中的应用。同时，文章对比了TensorFlow 1.x和2.0中LSTM API的变化，强调理解底层原理的重要性。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

参考:
Module: tf.nn.rnn_cell

更新:

2017.12.25
增加了tf.nn.embedding_lookup 来进行embedding的内容

2018.1.14
增加tf.sequence_mask和tf.boolean_mask来对于序列的padding进行去除的内容.

2018.3.13
增加了手动调用`call` 函数实现的LSTM的网络.

2018.11.1
把全部的API从contrib形式改为了正式形式。（因为tensorflow2.0之后会弃用contrib模块）

2019.5.27
全部迁移到了tensorflow 2.0，同时暂时保留着1.0相关的内容

LSTM的理论就不多讲了,对于理论不是很熟悉的童鞋转到:
深度学习笔记七：循环神经网络RNN(基本理论)
深度学习笔记八：长短时记忆网络LSTM(基本理论)
来复习一下基本的理论.这里要知道的是,在深度学习的实践里面,必须先把理论给弄懂了才方便写代码的,LSTM更加是,所以务必把基础打好.不然在代码中很多地方为什么那么写都不知道.理论搞定之后,很重要的一点就是实践上面怎么使用LSTM了,估计很多人在使用tensorflow写LSTM的时候走了弯路.花了很多时间才弄清楚一点.不是因为LSTM有多难(在时序和多层次上考虑,其实也还是有点抽象的),而是不知道常见的结构可以怎么定义怎么写出来.对于初学者是很不友好的.
所以,本文先直接给出API文档里面常用的几个类和函数,然后写一些玩具案例,虽然案例是玩具,但是全部消化的话,不说精通,入门绝对是够了.

tensorflow 1.x

一.重要函数和类

这节主要就是说一下tensorflow里面在LSTM中比较常用的API了,毕竟是砖头,弄清楚肯定是有益处的.
这里先列一下

tf.nn.rnn_cell.LSTMCell
tf.nn.rnn_cell.MultiRNNCell
tf.nn.rnn_cell.LSTMStateTuple
tf.nn.rnn_cell.ResidualWrapper()
tf.nn.rnn_cell.DropoutWrapper
tf.nn.dynamic_rnn()
tf.nn.bidirectional_dynamic_rnn()
tf.sequence_mask()
tf.boolean_mask()

既然说到这里,那这里还说一个与词向量有关的常见函数,后面一并讲解.

tf.nn.embedding_lookup()

这里只说最基本的够用的. 当然还有几个这里没有列出来的,可以在最开始列出来的文档参考,等到以后升到高阶,也许会用得到.

Ⅰ.tf.nn.rnn_cell.LSTMCell

文档:tf.nn.rnn_cell.LSTMCell

BasicLSTMCell是比较基本的创建LSTM cell的一个类,首先来看一下使用的时候怎么创建一个对象吧,
构造函数为:

__init__(
	num_units,
    use_peepholes=False,
    cell_clip=None,
    initializer=None,
    num_proj=None,
    proj_clip=None,
    num_unit_shards=None,
    num_proj_shards=None,
    forget_bias=1.0,
    state_is_tuple=True,
    activation=None,
    reuse=None,
    name=None,
    dtype=None,
    **kwargs
)

参数：

num_units:LSTM cell中的units数量
use_peepholes: 要是为True 的话，表示使用diagonal/peephole连接。
cell_clip:可选，浮点值， (optional) A float value, if provided the cell state is clipped by this value prior to the cell output activation.
initializer: 可选，权重和后面投射层（projection）的矩阵权重初始化方式。
num_proj:可选，可以简单理解为一个全连接，表示投射（projection）操作之后输出的维度，要是为None的话，表示不进行投射操作。
proj_clip: (optional) A float value. If num_proj > 0 and proj_clip is provided, then the projected values are clipped elementwise to within [-proj_clip, proj_clip].
num_unit_shards:已弃用
num_proj_shards: 已弃用
forget_bias: Biases of the forget gate are initialized by default to 1 in order to reduce the scale of forgetting at the beginning of the training. Must set it manually to 0.0 when restoring from CudnnLSTM trained checkpoints.
state_is_tuple:state状态作为一个元组，今后都默认为True
activation: 内部状态的激活函数，默认是hanh
reuse: (optional) Python boolean describing whether to reuse variables in an existing scope. If not True, and the existing scope already has the given variables, an error is raised.
name: String, the name of the layer. Layers with the same name will share weights, but to avoid mistakes we require reuse=True in such cases.
dtype: Default dtype of the layer (default of None means use the type of the first input). Required when build is called before call.

举个例子,比如你想定义一个内部节点数为128的一个Cell,就可以用下面的语句,

import numpy as np
import tensorflow as tf
from tensorflow.contrib.layers.python.layers import initializers

lstm_cell=tf.nn.rnn_cell.LSTMCell(
    num_units=128,
    use_peepholes=True,
    initializer=initializers.xavier_initializer(),
    num_proj=64,
    name="LSTM_CELL"
)

print("output_size:",lstm_cell.output_size)
print("state_size:",lstm_cell.state_size)
print(lstm_cell.state_size.h)
print(lstm_cell.state_size.c)

结果：

output_size: 64
state_size: LSTMStateTuple(c=128, h=64)
64
128

你会发现这个构造函数里面居然没有基本的输入信息! 但是不用担心,关于输入的一些细节底层都做好了,只要在后面的环节里面给进去输入就行了.后面会继续讲到.这里还要了解一点,相对于别人把这128叫做隐藏层的节点,其实我更倾向于理解为在Cell中的128个节点,每个节点接受同样的输入向量,然后得到一个值,128个节点合起来,输出的话就是一个128维的向量.而上面的代码还经过了一次全连接操作，因此最后输出的是64维的输出。

说到这里可能就有点晕了，为什么一下128维，一下64维，因此这里提一下这个类比较重要的两个属性(当然,这个类不止这两个属性).分别是:output_size和state_size.看名字就能够猜到,output_size和state_size 分别表示的LSTM的输出尺寸和状态尺寸的. 有一个博客对这里的解释很好，可以去看一下，我就不啰嗦了。
tf.nn.dynamic_rnn的输出outputs和state含义

128个units决定了state_size 中的c就是128维的,这个很简单.这里的重点是state的格式,这里发现他是一个LSTMStateTuple的类型,别管那么多,简单当做一个tuple看待就行（但是和传统的tuple还是有区别的）.之所以是一个tuple,是因为state包含了h和c两方面的内容(这里需要知道一些LSTM的原理)。

然后这个类还有一个很重要的函数,如下
zero_state(batch_size,dtype)

作用:
这个函数主要是用来进行填零的初始化.注意,这里是初始化一个state,不是初始化整个LSTM.
参数:
batch_size: 批大小
dtype: state使用的类型.

返回一个填零的状态state

要是state_size is an int or TensorShape, then the return value is a N-D tensor of shape [batch_size x state_size] filled with zeros.
要是state_size 是一个tuple,那么返回的值是同样结构的tuple,其中每个元素都是一个2-D的tensor,他们的形状为[batch_size x s] 其中的s是state_size中各自的s .

__call__(inputs,state,scope=None)
作用:在给定的状态(state)和输入上运行RNN cell,这个方法是在一个"时间点"上面运行一次RNN的方法,是比较偏底层的一个函数,对于理解RNN的运行过程非常有帮助。
后面将会讲到的tf.nn.dynamic_rnn() 等接口就是更加高层的接口,直接把所有的运行过程都得到了.
参数:

inputs: 2-D tensor,形状为[batch_size，input_size].在实际使用的时候,你会先把数据整理成为[batch_size,time_steps_size,input_size] 的形状,所以假如当前时刻是i,那么使用的时候,直接使用[:,i,:] 作为数据传入就行了.
state: 要是self.state_size 是一个整形 ,那么这个参数应该当是一个形状为 [batch_size，self.state_size] 的tensor,否则,要是self.state_size 是一个整数的元组,那么这个应当是一个形状为[batch_size，s] 的元组,其中s在self.state_size 中.
scope: 这个创建的子图的变量域(VariableScope),默认是类名.

返回值:
A pair containing:
Output: 一个形状为[batch_size，self.output_size] 的2-D tensor
New state:新的state,和之前的state结构一样.

然后还有一些其他的方法和属性,这里不讲了,在真的有需求的时候可以参照官方文档.

Ⅱ.tf.nn.rnn_cell.MultiRNNCell

前面的类可以定义一个一层的LSTM,那么怎么定义多层的LSTM类呢? 这个类主要的作用是把单层LSTM结合为多层的LSTM.
首先来看他的构造函数是怎样的.
__init__( cells,state_is_tuple=True)

参数:
cells:一个列表,里面是你想叠起来的RNNCells,
state_is_tuple:要是是True 的话, 以后都是默认是True了，因此这个参数不用管。接受和返回的state都是n-tuple,其中n = len(cells).

然后还有一些其他的函数和属性都和前面的BasicLSTMCell差不多.但是这里还是要说一下在这里,他的两个属性output_size和state_size 会变成怎么样的形式.下面举一个例子:

import numpy as np
import tensorflow as tf
from tensorflow.contrib.layers.python.layers import initializers

lstm_cell_1=tf.nn.rnn_cell.LSTMCell(
    num_units=128,
    use_peepholes=True,
    initializer=initializers.xavier_initializer(),
    num_proj=64,
    name="LSTM_CELL_1"
)

lstm_cell_2=tf.nn.rnn_cell.LSTMCell(
    num_units=128,
    use_peepholes=True,
    initializer=initializers.xavier_initializer(),
    num_proj=64,
    name="LSTM_CELL_2"
)


lstm_cell_3=tf.nn.rnn_cell.LSTMCell(
    num_units=128,
    use_peepholes=True,
    initializer=initializers.xavier_initializer(),
    num_proj=64,
    name="LSTM_CELL_3"
)

multi_cell=tf.nn.rnn_cell.MultiRNNCell(cells=[lstm_cell_1,lstm_cell_2,lstm_cell_3])


print("output_size:",multi_cell.output_size)
print("state_size:",type(multi_cell.state_size))
print("state_size:",multi_cell.state_size)

#需要先索引到具体的那层cell，然后取出具体的state状态
print(multi_cell.state_size[0].h)
print(multi_cell.state_size[0].c)

结果:

output_size: 64
state_size: <class 'tuple'>
state_size: (LSTMStateTuple(c=128, h=64), LSTMStateTuple(c=128, h=64), LSTMStateTuple(c=128, h=64))
64
128

这里首先建立了3层LSTM,然后使用MultiRNNCell 的构造函数把他们堆叠在一起,所以结果中的属性output_size为64,就是最后那层的projection_num数量了.这个比较简单,重要的是state_size的样式,可以看到是一个tuple里面,然后又有3个LSTMStateTuple对象,其实这里也可以看出来了,就是每层的LSTMStateTuple属性放到了一个大的tuple里面.
这里还是非常重要的.之后各种需要state 的地方可能涉及到state的转换.要是这里不清楚,到时候就不好转换了.

Ⅲ.tf.nn.dynamic_rnn()

这个函数的作用就是通过指定的RNN Cell来展开计算神经网络.
他的构造函数如下:
dynamic_rnn(cell,inputs,sequence_length=None,initial_state=None,dtype=None,parallel_iterations=None,swap_memory=False,time_major=False,scope=None)

对于dynamic_rnn来说每个batch的序列长度都是一样的（不足的话自己要去padding），这个函数会根据 sequence_length 中止计算.同时dynamic_rnn是动态生成graph的

参数:

cell: RNNCell的对象.
inputs: RNN的输入,当time_major == False (default) 的时候,必须是形状为 [batch_size, max_time, ...] 的tensor, 要是 time_major == True 的话, 必须是形状为 [max_time, batch_size, ...] 的tensor. 前面两个维度应该在所有的输入里面都应该匹配.
sequence_length: 可选,一个int32/int64类型的vector,他的尺寸是[batch_size]. 对于最后结果的正确性,这个还是非常有用的.因为给他具体每一个序列的长度,能够精确的得到结果,排除了之前为了把所有的序列弄成一样的长度padding造成的不准确.
initial_state: 可选,RNN的初始状态. 要是cell.state_size 是一个整形,那么这个参数必须是一个形状为 [batch_size, cell.state_size] 的tensor. 要是cell.state_size 是一个tuple, 那么这个参数必须是一个tuple,其中元素为形状为[batch_size, s] 的tensor,s为cell.state_size 中的各个相应size.
dtype: 可选,表示输入的数据类型和期望输出的数据类型.当初始状态没有被提供或者RNN的状态由多种形式构成的时候需要显示指定.
parallel_iterations: 默认是32,表示的是并行运行的迭代数量(Default: 32). 有一些没有任何时间依赖的操作能够并行计算,实际上就是空间换时间和时间换空间的折中,当value远大于1的时候,会使用的更多的内存但是能够减少时间,当这个value值很小的时候,会使用小一点的内存,但是会花更多的时间.
swap_memory: Transparently swap the tensors produced in forward inference but needed for back prop from GPU to CPU. This allows training RNNs which would typically not fit on a single GPU, with very minimal (or no) performance penalty.
time_major: 规定了输入和输出tensor的数据组织格式,如果 true, tensor的形状需要是[max_time, batch_size, depth]. 若是false, 那么tensor的形状为[batch_size, max_time, depth]. 要是使用time_major = True 的话,会更加高效率一点,因为避免了在RNN计算的开始和结束的时候对于矩阵的转置 ,然而,大多数的tensorflow数据格式都是采用的以batch为主的格式,所以这里也默认采用以batch为主的格式.
scope: 子图的scope名称,默认是"rnn"

返回（非常重要）:
返回(outputs, state)形式的结果对,其中：

outputs: 表示RNN的输出隐状态h，就是所有时间步的h，要是time_major == False (default),那么这个tensor的形状为[batch_size, max_time, cell.output_size],要是time_major == True, 这个Tensor的形状为[max_time, batch_size, cell.output_size]. 这里需要注意一点，要是是双向LSTM，那么outputs就会是一个tuple，其中两个元素分别表示前向的outputs和反向的outputs，后面讲到双向LSTM会详细说这个内容。
state: 最终时间步的states,要是单向网络，假如有K层，states就是一个元组，里面包含K（层数）个LSTMStateTuple，分别代表这些层最终的状态信息。要是是双向网络，那么还是元组，元组里面又是两个小元组分别表示前向的states和后向的states。相应的小元组里面就是每一层的最终时刻的states信息。

例1:单层lstm

import tensorflow as tf
import numpy as np

inputs = tf.placeholder(np.float32, shape=(32,40,5)) # 32 是 batch_size
lstm_cell_1 = tf.contrib.rnn.BasicLSTMCell(num_units=128)
#lstm_cell_2 = tf.contrib.rnn.BasicLSTMCell(num_units=256)
#lstm_cell_3 = tf.contrib.rnn.BasicLSTMCell(num_units=512)
#多层lstm_cell
#lstm_cell=tf.contrib.rnn.MultiRNNCell(cells=[lstm_cell_1,lstm_cell_2,lstm_cell_3])

print("output_size:",lstm_cell_1.output_size)
print("state_size:",lstm_cell_1.state_size)
#print(lstm_cell.state_size.h)
#print(lstm_cell.state_size.c)
output,state=tf.nn.dynamic_rnn(
    cell=lstm_cell_1,
    inputs=inputs,
    dtype=tf.float32
)

print("output.shape:",output.shape)
print("len of state tuple",len(state))
print("state.h.shape:",state.h.shape)
print("state.c.shape:",state.c.shape)

结果:
这里写图片描述

例二:多层LSTM

import tensorflow as tf
import numpy as np

inputs = tf.placeholder(np.float32, shape=(32,40,5)) # 32 是 batch_size
lstm_cell_1 = tf.contrib.rnn.BasicLSTMCell(num_units=128)
lstm_cell_2 = tf.contrib.rnn.BasicLSTMCell(num_units=256)
lstm_cell_3 = tf.contrib.rnn.BasicLSTMCell(num_units=512)
#多层lstm_cell
lstm_cell=tf.contrib.rnn.MultiRNNCell(cells=[lstm_cell_1,lstm_cell_2,lstm_cell_3])

print("output_size:",lstm_cell.output_size)
print("state_size:",lstm_cell.state_size)
#print(lstm_cell.state_size.h)
#print(lstm_cell.state_size.c)
output,state=tf.nn.dynamic_rnn(
    cell=lstm_cell,
    inputs