Tensorflow1.0阅读seq2seq源码笔记

最新推荐文章于 2022-05-14 11:18:10 发布

AprilNing

最新推荐文章于 2022-05-14 11:18:10 发布

阅读量3.1k

点赞数 1

分类专栏：阅读源码笔记

本文链接：https://blog.csdn.net/u014025868/article/details/73350648

版权

阅读源码笔记专栏收录该内容

0 篇文章 0 订阅

订阅专栏

seq2seq主页链接：https://google.github.io/seq2seq/
seq2seq源码链接：https://github.com/google/seq2seq

源码解析

注：以下的代码片均为节选

读数据的方式

train.py

def create_experiment(output_dir):
  """
  Creates a new Experiment instance.

  Args:
    output_dir: Output directory for model checkpoints and summaries.
  """
  ......
  # Training data input pipeline
  train_input_pipeline = input_pipeline.make_input_pipeline_from_def(
      def_dict=FLAGS.input_pipeline_train,
      mode=tf.contrib.learn.ModeKeys.TRAIN)
  # input_pipeline中定义了一个父类、三个子类，三个子类分别实现了不同的读取数据的方式
  # 返回定义的读取方式，此处返回的即为ParallelTextInputPipeline的一个实例。

  # Create training input function
  # create_input_fn:Creates an input function that can be used with tf.learn estimators.
  train_input_fn = training_utils.create_input_fn(
      pipeline=train_input_pipeline,
      batch_size=FLAGS.batch_size,
      bucket_boundaries=bucket_boundaries,
      scope="train_input_fn")

  # 以下是读取Development data的程序，同上，区别只有mode的定义。
  # Development data input pipeline
  dev_input_pipeline = input_pipeline.make_input_pipeline_from_def(
      def_dict=FLAGS.input_pipeline_dev,
      mode=tf.contrib.learn.ModeKeys.EVAL,
      shuffle=False, num_epochs=1)

  # Create eval input function
  eval_input_fn = training_utils.create_input_fn(
      pipeline=dev_input_pipeline,
      batch_size=FLAGS.batch_size,
      allow_smaller_final_batch=True,
      scope="dev_input_fn")

  ......

utils.py（即为上边的training_utils）

def create_input_fn(pipeline,
                    batch_size,
                    bucket_boundaries=None,
                    allow_smaller_final_batch=False,
                    scope=None):
  """Creates an input function that can be used with tf.learn estimators.
    Note that you must pass "factory funcitons" for both the data provider and
    featurizer to ensure that everything will be created in  the same graph.

  Args:
    pipeline: An instance of `seq2seq.data.InputPipeline`.
    batch_size: Create batches of this size. A queue to hold a
      reasonable number of batches in memory is created.
    bucket_boundaries: int list, increasing non-negative numbers.
      If None, no bucket is performed.

  Returns:
    An input function that returns `(feature_batch, labels_batch)`
    tuples when called.
  """

  def input_fn():
    """Creates features and labels.
    """

    with tf.variable_scope(scope or "input_fn"):
      data_provider = pipeline.make_data_provider()
      features_and_labels = pipeline.read_from_data_provider(data_provider)
      '''
      上边介绍到input_pipeline中定义了三个类，每个类都实现了函数make_data_provider
      make_data_provider函数中实现了具体的读取数据方式，包括对token的一些处理，例如在添加标识符SEQUENCE_END
      但是具体的读取数据的实现，还是调用了tensorflow.contrib.slim.python.slim.data接口
      tf.RandomShuffleQueue: A queue implementation that dequeues elements in a random order.
      read_from_data_provider的定义放在了父类InputPipeline中实现
      '''

      if bucket_boundaries:
        _, batch = tf.contrib.training.bucket_by_sequence_length(
            input_length=features_and_labels["source_len"],
            bucket_boundaries=bucket_boundaries,
            tensors=features_and_labels,
            batch_size=batch_size,
            keep_input=features_and_labels["source_len"] >= 1,
            dynamic_pad=True,
            capacity=5000 + 16 * batch_size,
            allow_smaller_final_batch=allow_smaller_final_batch,
            name="bucket_queue")
      else: 
        batch = tf.train.batch(
            tensors=features_and_labels,
            enqueue_many=False,
            batch_size=batch_size,
            dynamic_pad=True,
            capacity=5000 + 16 * batch_size,
            allow_smaller_final_batch=allow_smaller_final_batch,
            name="batch_queue")
        # Lazy bucketing of input tensors according to `which_bucket`
      '''
      tf.contrib.training.bucket_by_sequence_length()及bucket()实现了按batch_size大小读取数据
      This method calls `tf.contrib.training.bucket` under the hood, after first subdividing the 
      bucket boundaries into separate buckets and identifying which bucket the given `input_length` 
      belongs to.  See the documentation for `which_bucket` for details of the other arguments.

      有趣的是，读取的过程中定义了好多队列，以bucket_boundaries.size()+1 = 5为例。
      1.tf.RandomShuffleQueue，输入train_source、train_target，完成parallel读取及dequeue时shuffle的功能；
      2.5个tf.FIFOQueue，在1 dequeue时，根据长度大小放入相应的队列中；
      3.top_queue=tf.PaddingFIFOQueue，because if we use allow_smaller_final_batch, shapes will 
        contain Nones in their first entry; as a result, a regular FIFOQueue would die when being 
        passed shapes that are not fully defined.

      Each bucket has its own queue.  When a bucket contains `batch_size` elements, this minibatch is 
      pushed onto a top queue.  The tensors returned from this function are a the result of dequeue-
      -ing the next minibatch from this top queue.
      '''
      # Separate features and labels
      features_batch = {k: batch[k] for k in pipeline.feature_keys}
      if set(batch.keys()).intersection(pipeline.label_keys):
        labels_batch = {k: batch[k] for k in pipeline.label_keys}
      else:
        labels_batch = None

      return features_batch, labels_batch

  return input_fn

运行模式

train.py

def main(_argv):
  """The entrypoint for the script"""
  ......
  learn_runner.run(
      experiment_fn=create_experiment,
      output_dir=FLAGS.output_dir,
      schedule=FLAGS.schedule)
  #creat_experiment 其中调用了图构建的函数，包括模型的状态，输入输出等
  #schedule 定义了训练图的方式，主要是train和evaluate。如果定义了分布式，对于ps、master、worker返回的schedule是不一样的，这件事情的完成是在tensorflow/contrib/learn/python/learn/learn_runner.py中完成的。

learn_runner.py

def run(experiment_fn, output_dir, schedule=None):
  """Make and run an experiment."""
  ......
  # Call the builder
  experiment = experiment_fn(output_dir=output_dir)
  '''
  该函数完成了几件事：
  1.model_fn的初始化，例model class=AttentionSeq2Seq，则返回一个AttentionSeq2Seq的初始化及调用接口；
    return model(features, labels, params)
  2.estimator实例的定义及初始化，Estimator class实现了TensorFlow中最基础的trainer/evaluator.
    estimator = tf.contrib.learn.Estimator(model_fn=model_fn, model_dir=output_dir,......)
  3.experiment实例的定义及初始化，类Experiment包含了训练模型的所有信息，当创建实例成功后，便知道如何训练。
    experiment = PatchedExperiment(estimator=estimator, train_input_fn=train_input_fn,......)
  总结：experiment只是调用了estimator中的接口去实现模型的训练、测试等，即最基础的trainer/evaluator还是在
  estimator类中，而最基础的模型定义则是在seq2seq工程中，例如encode、decoder、loss等的定义，详见工程代码.
  '''
  # Get the schedule
  config = experiment.estimator.config
  schedule = schedule or _get_default_schedule(config)
  # 该行代码需要注意，若schedule是利用传参的形式定义的，则无法运行or后面的语句，即无法成功定义分布式。
  # _get_default_schedule函数中定义了对于ps、master、worker返回的schedule函数是不同的，见函数体实现。
  # MASTER，返回 'train_and_evaluate'，只有master会进行交叉验证。
  # PS，返回 'run_std_server'，Starts a TensorFlow server and joins the serving thread.
  # WORKER，返回 'train'
  # 定义好schedule后，以下即为执行schedule的过程。

  # Execute the schedule
  task = getattr(experiment, schedule)
  return task()
......

experiment.py

...
config = self._estimator.config
    if (config.environment != run_config.Environment.LOCAL and
        config.environment != run_config.Environment.GOOGLE and
        config.cluster_spec and config.master):
      self._start_server()
#必须设置config.environment = 'CLOUD', 才可以start server
...

模型定义

infer

basic_seq2seq.py

  def _decode_infer(self, decoder, bridge, _encoder_output, features, labels):
    """Runs decoding in inference mode"""
    batch_size = self.batch_size(features, labels)
    if self.use_beam_search:
      batch_size = self.params["inference.beam_search.beam_width"]
    '''
    features的维数为 batch_size * source_len *vocab_size: 32 * 50 * 50003
    若使用beam_search，则batch_size为beam_width，否则就为batch_size
    觉得这样做的主要原因时利用这beam_width个向量存储所有可能的结果，最后在结果中选取一个最优的作为最终结果。
    '''
    target_start_id = self.target_vocab_info.special_vocab.SEQUENCE_START
    helper_infer = tf_decode_helper.GreedyEmbeddingHelper(
        embedding=self.target_embedding,
        start_tokens=tf.fill([batch_size], target_start_id),
        end_token=self.target_vocab_info.special_vocab.SEQUENCE_END)
    decoder_initial_state = bridge()
    return decoder(decoder_initial_state, helper_infer)
    #此处的decoder为beam_search_decoder

笔记Notes

tf.contrib.learn.ModeKeys —> model_fn()
TRAIN: training mode —> fit()
EVAL: evaluation mode —> evaluate()
INFER: inference mode —> predict()
runtime parallelize REF: https://www.tensorflow.org/programmers_guide/faq
individual ops并行实现方式，在CPU中是使用multiple cores，在GPU中是使用multiple threads；
Tensorflow中的Independent nodes可以并行在multiple devices/GPUs. REF: https://www.tensorflow.org/tutorials/using_gpu
The Session API allows multiple concurrent steps (i.e. calls to tf.Session.run in parallel. This enables the runtime to get higher throughput, if a single step does not use all of the resources in your computer.

config.gpu_options.allow_growth = True //一开始分配较小的GPU memory，随着需要会扩展，但是不释放memory。
config.gpu_options.per_process_gpu_memory_fraction = 0.4 //对于每个GPU分配给任务多少百分比的Memory；
allow_soft_placement=True //若没找到指定的设备，True允许代码去其他的设备上运行
log_device_placement=True //是否打印运行时分配的device信息
seq2seq如何实现分布式
没有提供单机多GPU的运行方式，如果想利用多GPU，只能利用tensorflow本身的分布式。
seq2seq基于tf.contrib.learn.experiment定义了一个子类，子类中实现了一种新的schedule: continuous_train_and_eval().
首先问题是tf.contrib.learn.learn_ruuner.py 86 行：schedule = schedule or _get_default_schedule(config), 若schedule已被定义，则后面的函数不会执行。所以尽管定了了集群，也没有运行集群，不能实现分布式。若想在seq2seq这个项目中实现分布式，必须将schedule的定义与集群的定义同时更改。
那么为什么非得新定义一个schedule呢，将在问题4中给出答案。
def continuous_train_and_eval() 和def train_and_evaluate()的两个区别
区别一，continuous_train_and_eval会将train和evaluate交替进行，例如会训练1000步便进行评测，但是后者不会，直接利用train_steps训练模型，不会有更小的迭代训练。
区别二，continuous_train_and_eval在evaluation之前会将training时占用的资源（例如：memory等）释放，但是train_and_evaluate不会，会占用双份的资源；continuous_train_and_eval会在每一轮迭代训练后保存下来相应的checkpoint，即会保存更多的checkpoints。

参考

获取GPU信息
通过下面的脚本可以获取当前可访问GPU的基本信息。
————————————————————————————————————————————
from tensorflow.python.client import device_lib

def get_available_gpus():
local_device_protos = device_lib.list_local_devices()
return [x.name for x in local_device_protos if x.device_type == 'GPU']
print(get_available_gpus())
————————————————————————————————————————————

问题

AprilNing

关注

1
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
Tensorflow1.0阅读seq2seq源码笔记

seq2seq主页链接：https://google.github.io/seq2seq/ seq2seq源码链接：https://github.com/google/seq2seq笔记 tf.contrib.learn.ModeKeys —> model_fn() TRAIN: training mode —> fit() EVAL: evaluation mode —> evalu
复制链接

扫一扫