TensorFlow源码分析 | 官网教程Recurrent Neural Networks源码分析

最新推荐文章于 2021-06-27 10:28:08 发布

taoqick

最新推荐文章于 2021-06-27 10:28:08 发布

阅读量413

点赞数

分类专栏：机器学习 tensorflow

机器学习同时被 2 个专栏收录

73 篇文章 0 订阅

订阅专栏

tensorflow

19 篇文章 0 订阅

订阅专栏

From: https://zhuanlan.zhihu.com/p/33286899

本文是在学习TensorFlow官网教程过程中的一篇笔记，主要分析了官网一篇教程Recurrent Neural Networks中所提例子的源码，源码来自于TensorFlow Models模块，在 models/tutorials/rnn/ptb/ 目录下，其使用的数据来自PTB dataset from Tomas Mikolov's webpage，这份源码也是对论文《Recurrent Neural Network Regularization》的实现。看了一天多，感觉就是：本来很简单，让Google给写复杂了。

基本的知识见官网教程或极客学院的翻译，下面直接讲代码，我会一步步的用自己的思路引出代码，然后把代码串联起来。

一、模型参数

首先，对于一个模型来说，肯定有参数，我们把参数保存在类SmallConfig中：

class SmallConfig(object):
  """Small config."""
  init_scale = 0.1       # 模型参数初始化的范围[-init_scale, init_scale]
  learning_rate = 1.0    # 学习率，在本例中是变化的
  max_grad_norm = 5      # 梯度截断的规模，梯度截断主要是用来防止梯度消失和爆炸
  num_layers = 2         # 几层的lstm
  num_steps = 20         # 每层的序列长度
  hidden_size = 200      # 隐藏层大小，等同于状态state的大小，本例中也代表词向量的大小
  max_epoch = 4          # 该参数是用来约束学习率的，在epoch < max_epoch时，学习率保持不变
  max_max_epoch = 13     # 训练时，最大的迭代次数
  keep_prob = 1.0        # drop out的参数
  lr_decay = 0.5         # 迭代次数epoch>max_epoch之后，用来计算新的学习率
  batch_size = 20        # 这个不解释了
  vocab_size = 10000     # 词典规模
  rnn_mode = BLOCK       # 这个先不解释

二、数据处理

数据处理先看一个函数，这个函数是处理原始数据，将单词数据转化为编号数据：

def ptb_raw_data(data_path=None):
  # 训练数据、验证数据、测试数据的路径
  train_path = os.path.join(data_path, "ptb.train.txt")
  valid_path = os.path.join(data_path, "ptb.valid.txt")
  test_path = os.path.join(data_path, "ptb.test.txt")

  word_to_id = _build_vocab(train_path) # 根据训练数据得到{word: id}的dict
  # 下边三行代码是分别将训练数据、验证数据、测试数据转成 [id1, id2...]的list
  train_data = _file_to_word_ids(train_path, word_to_id)
  valid_data = _file_to_word_ids(valid_path, word_to_id)
  test_data = _file_to_word_ids(test_path, word_to_id)
  vocabulary = len(word_to_id)
  return train_data, valid_data, test_data, vocabulary

_file_to_word_ids等函数也比较简单，就是辅助处理数据。把所有的换行换成<eos>，把所有的单词换成id。

比如若原始数据是：

there is no asbestos in our products now
neither <unk> nor the researchers who studied the workers were aware of any research on smokers of the kent cigarettes

则先变成：

there is no asbestos in our products now <eos> neither <unk> nor the researchers who studied the workers were aware of any research on smokers of the kent cigarettes <eos>

然后变成：

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 12 16 17 18 19 20 21 22 23 19 12 24 25 8

只不过，编号的顺序不是按照单词出现的顺序，而是按照单词出现的频率进行编号的，频率越高，编号越小。

三、模型

模型用一个类PTBModel来表示：

class PTBModel(object):
  """The PTB model."""

模型肯定有输入有输出，输入通过__init__函数参数传入：

  def __init__(self, is_training, config, input_):
    self._is_training = is_training
    self._input = input_
    self._cell = None
    self.batch_size = input_.batch_size
    self.num_steps = input_.num_steps
    size = config.hidden_size
    vocab_size = config.vocab_size

就是第三个参数input_，是一个类的实例，这个类有两个重要的元素是：input_.input_data和input_.targets，一看就明白，就是输入模型的x和y。

然后，我们用embedding来表示所有的词向量，其shape必然是 [单词总数，向量大小]，因其较大，将其放在CPU上。函数embedding_lookup()的作用是根据输入数据（其数据形式是编号id），得到词向量。

    # 将embedding放在cpu计算，因为占用空间较大，没必要放在GPU
    with tf.device("/cpu:0"):
      embedding = tf.get_variable("embedding", [vocab_size, size], dtype=data_type())
      # input_data.shape = [batch_size, num_steps]，inputs的shape变为：[batch_size, num_steps, hidden_size]
      inputs = tf.nn.embedding_lookup(embedding, input_.input_data)

有了输入数据，那么输入到模型中就可得到输出数据：

    # output的shape：[batch_size * numsteps, hidden_size]
    output, state = self._build_rnn_graph(inputs, config, is_training)

输出的output的shape代码中说明了，state的是什么类型呢？等会介绍_build_rnn_graph()函的时候会解释，当前只知道这个函数就是一个模型，输入数据，处理，然后得到输出数据。

但得到输出数据还没完，因为我们得到的output维度不对，监督训练最后需要得到每个单词的一个概率，当前output的shape没有一个维度是vocab_size，需要再做一次仿射变换：

    softmax_w = tf.get_variable("softmax_w", [size, vocab_size], dtype=data_type())
    softmax_b = tf.get_variable("softmax_b", [vocab_size], dtype=data_type())
    # 将输出output映射到每个单词, logits.shape: [batch_size * numsteps, vocab_size]
    logits = tf.nn.xw_plus_b(output, softmax_w, softmax_b)
     # Reshape logits to be a 3-D tensor for sequence loss
    logits = tf.reshape(logits, [self.batch_size, self.num_steps, vocab_size])

最后logits又reshape了一下，主要是为了满足函数tf.contrib.seq2seq.sequence_loss()的需要，这个函数是求loss函数：

    loss = tf.contrib.seq2seq.sequence_loss(
        logits=logits,
        targets=input_.targets,
        weights=tf.ones([self.batch_size, self.num_steps], dtype=data_type()),
        average_across_timesteps=False,
        average_across_batch=True)

    # Update the cost
    self._cost = tf.reduce_sum(loss)
    self._final_state = state

    if not is_training:
      return

tf.contrib.seq2seq.sequence_loss()函数的第三个参数weights是一个mask，输入数据肯定不是标量，而是一个向量，有时候为了对齐，可能会padding一些0进去，这个mask就是标记该值是有效的还是无效的，有效的用1表示，padding的无效的0用0表示。后两个参数是计算loss的方式。

如果不是在训练，而是在测试，到这就结束了。但是如果是训练，那后边还要计算梯度。函数tf.clip_by_global_norm()是进行梯度截断的，第一个参数是计算的梯度，第二个参数是截断的规模。第一个返回值是新的梯度，第二个返回值是global norm，global norm是所有梯度的平方和再开方。该方法来自论文。其计算方式如下：

t_list[i] * clip_norm / max(global_norm, clip_norm)

where:

global_norm = sqrt(sum([l2norm(t)**2 for t in t_list]))

    self._lr = tf.Variable(0.0, trainable=False)
    tvars = tf.trainable_variables()    # 得到所有 trainable=True 的variables

    # 对梯度进行截取，以防梯度消失或爆炸的问题，返回的第一个值是重新计算过后的梯度，
    # 第二个值是global norm，global norm是所有梯度的平方和再开方
    grads, _ = tf.clip_by_global_norm(tf.gradients(self._cost, tvars), config.max_grad_norm)
    optimizer = tf.train.GradientDescentOptimizer(self._lr)
    self._train_op = optimizer.apply_gradients(
        zip(grads, tvars), global_step=tf.train.get_or_create_global_step())

    # new_lr 是一个占位符类型，说明new_lr需要在外部计算，然后传入session
    self._new_lr = tf.placeholder(tf.float32, shape=[], name="new_learning_rate")
    self._lr_update = tf.assign(self._lr, self._new_lr)

到这，模型其实就结束了。接下来解释下函数_build_rnn_graph()：

  def _build_rnn_graph(self, inputs, config, is_training):
      def make_cell():
          cell = tf.contrib.rnn.LSTMBlockCell(config.hidden_size, forget_bias=0.0)
          if is_training and config.keep_prob < 1:
              cell = tf.contrib.rnn.DropoutWrapper(cell, output_keep_prob=config.keep_prob)
          return cell

      # 此处设置 state_is_tuple=True，及接收和返回的states都是 n-tuples，n = len(cells)，其元素是LSTMStateTuple
      cell = tf.contrib.rnn.MultiRNNCell([make_cell() for _ in range(config.num_layers)], state_is_tuple=True)

      # self._initial_state是n-tuples
      self._initial_state = cell.zero_state(config.batch_size, data_type())
      state = self._initial_state
      outputs = []
      with tf.variable_scope("RNN"):
          for time_step in range(self.num_steps):
              if time_step > 0: tf.get_variable_scope().reuse_variables()
              # cell_output 的shape：[batch_size, self.output_size]
              (cell_output, state) = cell(inputs[:, time_step, :], state)
              outputs.append(cell_output)
      # outputs：[num_steps][batch_size, hidden_size]
      # 把outputs展开成[batch_size, hidden_size*num_steps],然后 reshape成[batch_size*numsteps, hidden_size]
      output = tf.reshape(tf.concat(outputs, 1), [-1, config.hidden_size])
      return output, state

前边说过inputs的shape变为：[batch_size, num_steps, hidden_size]，所以在按时间顺序输入数据时，遍历的还是中间维度，中间维度正式num_steps。还是这一行，cell是tf.contrib.rnn.MultiRNNCell()，这个类有个__call__方法，参数就是input和state，返回是：

Output: A 2-D tensor with shape [batch_size x self.output_size].
New state: Either a single 2-D tensor, or a tuple of tensors matching the arity and shapes of state.

所以cell_output的shape是[batch_size, self.output_size]，output把cell_output做了收集和转换。

四、训练

前边介绍了模型参数、数据处理以及模型，训练的代码就比较简单了。

def main(_):

  # 获取数据
  train_data, valid_data, test_data, _ = reader.ptb_raw_data(FLAGS.data_path)

  # 该函数就是获取第一节模型参数中介绍的参数
  config = get_config()
  eval_config = get_config()
  eval_config.batch_size = 1
  eval_config.num_steps = 1

  with tf.Graph().as_default():
    # 参数初始化器
    initializer = tf.random_uniform_initializer(-config.init_scale, config.init_scale)

    with tf.name_scope("Train"):
      train_input = PTBInput(config=config, data=train_data, name="TrainInput")
      with tf.variable_scope("Model", reuse=None, initializer=initializer):
        m = PTBModel(is_training=True, config=config, input_=train_input)
      tf.summary.scalar("Training Loss", m.cost)
      tf.summary.scalar("Learning Rate", m.lr)

    with tf.name_scope("Valid"):
      valid_input = PTBInput(config=config, data=valid_data, name="ValidInput")
      with tf.variable_scope("Model", reuse=True, initializer=initializer):
        mvalid = PTBModel(is_training=False, config=config, input_=valid_input)
      tf.summary.scalar("Validation Loss", mvalid.cost)

    with tf.name_scope("Test"):
      test_input = PTBInput(config=eval_config, data=test_data, name="TestInput")
      with tf.variable_scope("Model", reuse=True, initializer=initializer):
        mtest = PTBModel(is_training=False, config=eval_config, input_=test_input)

    sv = tf.train.Supervisor(logdir=FLAGS.save_path)
    
    # allow_soft_placement 如果你指定的设备不存在，是否允许tf自动分配设备
    config_proto = tf.ConfigProto(allow_soft_placement=False)
    with sv.managed_session(config=config_proto) as session:
      for i in range(config.max_max_epoch):
        # 根据迭代次数i计算新的 learning_rate
        lr_decay = config.lr_decay ** max(i + 1 - config.max_epoch, 0.0)
        m.assign_lr(session, config.learning_rate * lr_decay)

        print("Epoch: %d Learning rate: %.3f" % (i + 1, session.run(m.lr)))
        train_perplexity = run_epoch(session, m, eval_op=m.train_op, verbose=True)
        print("Epoch: %d Train Perplexity: %.3f" % (i + 1, train_perplexity))
        valid_perplexity = run_epoch(session, mvalid)
        print("Epoch: %d Valid Perplexity: %.3f" % (i + 1, valid_perplexity))

      test_perplexity = run_epoch(session, mtest)
      print("Test Perplexity: %.3f" % test_perplexity)

      if FLAGS.save_path:
        print("Saving model to %s." % FLAGS.save_path)
        sv.saver.save(session, FLAGS.save_path, global_step=sv.global_step)

其中，函数run_epoch()是用来计算困惑度perplexity的，困惑度是语言模型中的指标，用于评价模型的好坏。具体可见博客。

五、其他代码

还有一些其他代码解释一下，输入数据类：

class PTBInput(object):
  """The input data."""
  def __init__(self, config, data, name=None):
    self.batch_size = batch_size = config.batch_size
    self.num_steps = num_steps = config.num_steps
    self.epoch_size = ((len(data) // batch_size) - 1) // num_steps
    self.input_data, self.targets = reader.ptb_producer(data, batch_size, num_steps, name=name)

这个类比较简单，只有最后一行的reader.ptb_producer()函数需要解释一下：

def ptb_producer(raw_data, batch_size, num_steps, name=None):
 
  with tf.name_scope(name, "PTBProducer", [raw_data, batch_size, num_steps]):
    raw_data = tf.convert_to_tensor(raw_data, name="raw_data", dtype=tf.int32)

    data_len = tf.size(raw_data)    # 语料中一共有多少个单词
    batch_len = data_len // batch_size  # 一个batch 有多少个单词
    data = tf.reshape(raw_data[0 : batch_size * batch_len],
                      [batch_size, batch_len])  # 截掉最后不够一个batch的单词

    epoch_size = (batch_len - 1) // num_steps
    
    i = tf.train.range_input_producer(epoch_size, shuffle=False).dequeue()

    # strided_slice()函数的第二个参数是begin, 第三个参数是end，begin和end是list，
    # list的第一个元素代指第0维度的开始或结束，第二个元素代指第1维度的开始或结束
    # 所以x, y始终是差一个单词的，假设batch_size=2, num_steps=4，
    # 则有 x=[[7,1,2,1], [9,1,4,6]], 则 y=[[1,2,1,5], [1,4,6,0]]
    x = tf.strided_slice(data, [0, i * num_steps],
                         [batch_size, (i + 1) * num_steps])
    x.set_shape([batch_size, num_steps])

    y = tf.strided_slice(data, [0, i * num_steps + 1],
                         [batch_size, (i + 1) * num_steps + 1])
    y.set_shape([batch_size, num_steps])
    return x, y

这个函数主要是为了迭代的读取数据raw_data。

六、说明

上面的代码与源代码相比我做了一些修改，比如在第四节训练一节中，源码有两个图结构 tf.Graph().as_default()，第一个图结构的最后通过export_ops()函数保存cost, lr, new_lr, lr_update, output, initial_state, final_state 到 tf.collections 中，还将图模型保存到变量 metagraph。然后第二个图结构又把这些给加载进来了，这完全是多余嘛，可能是Google想更多的提供一些参考代码吧。

此外，源代码还可通过flags设置模型的大小：small, medium, large，这就是为啥第一小节模型参数中，类的名字是SmallConfig()的原因，因为还有MediumConfig和 LargeConfig()。再此说明。

其他差别就真的无关紧要了。

7、TensorFlow知识点

7.1 得到所有GPU的name:

  gpus = [
      x.name for x in device_lib.list_local_devices() if x.device_type == "GPU"
  ]

7.2 将TensorFlow的variable放在CPU计算：

  with tf.device("/cpu:0"):
  embedding = tf.get_variable("embedding", [vocab_size, size], dtype=data_type())
  # input_data.shape = [batch_size, num_steps]，inputs的shape变为：[batch_size, num_steps, hidden_size]
  inputs = tf.nn.embedding_lookup(embedding, input_.input_data)

7.3 得到所有trainable=True的variables

tvars = tf.trainable_variables()    # 得到所有 trainable=True 的variables

7.4 梯度截取

    # 对梯度进行截取，以防梯度消失或爆炸的问题，返回的第一个值是重新计算过后的梯度，第二个值是global norm，global norm是所有梯度的平方和再开方
    grads, _ = tf.clip_by_global_norm(tf.gradients(self._cost, tvars), config.max_grad_norm)
    optimizer = tf.train.GradientDescentOptimizer(self._lr)
    self._train_op = optimizer.apply_gradients(
        zip(grads, tvars), global_step=tf.train.get_or_create_global_step())

7.5 学习率衰减

    lr_decay = config.lr_decay ** max(i + 1 - config.max_epoch, 0.0)
    lr_new = config.learning_rate * lr_decay

7.6 迭代遍历输入数据

    i = tf.train.range_input_producer(epoch_size, shuffle=False).dequeue()

    # strided_slice()函数的第二个参数是begin, 第三个参数是end，begin和end是list，
    # list的第一个元素代指第0维度的开始或结束，第二个元素代指第1维度的开始或结束
    # 所以x, y始终是差一个单词的，假设batch_size=2, num_steps=4，
    # 则有 x=[[7,1,2,1], [9,1,4,6]], 则 y=[[1,2,1,5], [1,4,6,0]]
    x = tf.strided_slice(data, [0, i * num_steps],
                         [batch_size, (i + 1) * num_steps])
    x.set_shape([batch_size, num_steps])

    y = tf.strided_slice(data, [0, i * num_steps + 1],
                         [batch_size, (i + 1) * num_steps + 1])
    y.set_shape([batch_size, num_steps])
    return x, y

taoqick

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
TensorFlow源码分析 | 官网教程Recurrent Neural Networks源码分析

From: https://zhuanlan.zhihu.com/p/33286899本文是在学习TensorFlow官网教程过程中的一篇笔记，主要分析了官网一篇教程Recurrent Neural Networks中所提例子的源码，源码来自于TensorFlow Models模块，在 models/tutorials/rnn/ptb/ 目录下，其使用的数据来自PTB dataset from T...
复制链接

扫一扫

专栏目录