基于skip_thoughts vectors 的sentence2vec神经网络实现

1、论文摘要

我们描述了一种通用、分布式句子编码器的无监督学习方法。使用从书籍中提取的连续文本,我们训练了一个编码器-解码器模型,试图重建编码段落周围的句子。语义和语法属性一致的句子因此被映射到相似的向量表示。我们接着引入一个简单的词汇扩展方法来编码不再训练预料内的单词,令词汇量扩展到一百万词。在训练模型后,我们用线性模型在8个任务上提取和评估我们的向量,包括:语义相关性,释义检测,图像句子排序,问题类型归类,以及4个基准情绪和主观性数据集。最终的结果是一个非专门设计的编码器,能够生成高度通用性的句子表示,在实践中表现良好。

2、模型详解

代码来源于https://github.com/tensorflow/models/tree/master/research/skip_thoughts
模型大概分为3步骤:1、构建上下文三元组; 2、构建编码器; 3、构建解码器。

2.1 构建上下文三元组(skip_thoughts/data/preprocess_dataset.py)

def _process_input_file(filename, vocab, stats):
  """Processes the sentences in an input file.
  Args:
    filename: Path to a pre-tokenized input .txt file.
    vocab: A dictionary of word to id.
    stats: A Counter object for statistics.
  Returns:
    processed: A list of serialized Example protos
  """
  tf.logging.info("Processing input file: %s", filename)
  processed = []

  predecessor = None  # Predecessor sentence (list of words).
  current = None  # Current sentence (list of words).
  successor = None  # Successor sentence (list of words).

  for successor_str in tf.gfile.FastGFile(filename):
    stats.update(["sentences_seen"])
    successor = successor_str.split()

    # The first 2 sentences per file will be skipped.
    if predecessor and current and successor:
      stats.update(["sentences_considered"])

      # Note that we are going to insert <EOS> later, so we only allow
      # sentences with strictly less than max_sentence_length to pass.
      if FLAGS.max_sentence_length and (
          len(predecessor) >= FLAGS.max_sentence_length or len(current) >=
          FLAGS.max_sentence_length or len(successor) >=
          FLAGS.max_sentence_length):
        stats.update(["sentences_too_long"])
      else:
        serialized = _create_serialized_example(predecessor, current, successor,
                                                vocab)
        processed.append(serialized)
        stats.update(["sentences_output"])

    predecessor = current
    current = successor

    sentences_seen = stats["sentences_seen"]
    sentences_output = stats["sentences_output"]
    if sentences_seen and sentences_seen % 100000 == 0:
      tf.logging.info("Processed %d sentences (%d output)", sentences_seen,
                      sentences_output)
    if FLAGS.max_sentences and sentences_output >= FLAGS.max_sentences:
      break

  tf.logging.info("Completed processing file %s", filename)
  return processed

目的:提前将文档进行分句,例如文档由[S1,S2,S3,S4]组成,则可以构造
p r e d e c e s s o r = [ S 1 , S 2 ] predecessor=[S1,S2] predecessor=[S1S2] c u r r e n t = [ 32 , S 3 ] current = [32,S3] current=[32S3] s u c c e s s o r = [ S 3 , S 4 ] successor=[S3,S4] successor=[S3S4]
再根据词典转为对应的id

2.2 构建编码器(skip_thoughts/skip_thoughts_model.py)

 def build_encoder(self):
    """Builds the sentence encoder.
    Inputs:
      ##self.encode_ids对应的是2.1中的current转为CBOW形式
      self.encode_emb##对2.1中的current进行word embedding(cbow)保留词频信息
      self.encode_mask##将2.1中的current转为one-hot形式
    Outputs:
      self.thought_vectors
    Raises:
      ValueError: if config.bidirectional_encoder is True and config.encoder_dim
        is odd.
    """
    with tf.variable_scope("encoder") as scope:
      length = tf.to_int32(tf.reduce_sum(self.encode_mask, 1), name="length")

      if self.config.bidirectional_encoder:
        if self.config.encoder_dim % 2:
          raise ValueError(
              "encoder_dim must be even when using a bidirectional encoder.")
        num_units = self.config.encoder_dim // 2
        cell_fw = self._initialize_gru_cell(num_units)  # Forward encoder
        cell_bw = self._initialize_gru_cell(num_units)  # Backward encoder
        _, states = tf.nn.bidirectional_dynamic_rnn(
            cell_fw=cell_fw,
            cell_bw=cell_bw,
            inputs=self.encode_emb,
            sequence_length=length,
            dtype=tf.float32,
            scope=scope)
        thought_vectors = tf.concat(states, 1, name="thought_vectors")
      else:
        cell = self._initialize_gru_cell(self.config.encoder_dim)
        _, state = tf.nn.dynamic_rnn(
            cell=cell,
            inputs=self.encode_emb,
            sequence_length=length,
            dtype=tf.float32,
            scope=scope)
        # Use an identity operation to name the Tensor in the Graph.
        thought_vectors = tf.identity(state, name="thought_vectors")

    self.thought_vectors = thought_vectors

目的:对current(文本)进行编码(采用双向gru/单向gru),输出文本编码向量

2.3 构建解码器(skip_thoughts/skip_thoughts_model.py)

  def _build_decoder(self, name, embeddings, targets, mask, initial_state,
                     reuse_logits):
    """Builds a sentence decoder.
    Args:
      name: Decoder name.
      embeddings: Batch of sentences to decode; a float32 Tensor with shape
        [batch_size, padded_length, emb_dim].
      targets: Batch of target word ids; an int64 Tensor with shape
        [batch_size, padded_length].
      mask: A 0/1 Tensor with shape [batch_size, padded_length].
      initial_state: Initial state of the GRU. A float32 Tensor with shape
        [batch_size, num_gru_cells].
      reuse_logits: Whether to reuse the logits weights.
    """
    # Decoder RNN.
    cell = self._initialize_gru_cell(self.config.encoder_dim)
    with tf.variable_scope(name) as scope:
      # Add a padding word at the start of each sentence (to correspond to the
      # prediction of the first word) and remove the last word.
      decoder_input = tf.pad(
          embeddings[:, :-1, :], [[0, 0], [1, 0], [0, 0]], name="input")
      length = tf.reduce_sum(mask, 1, name="length")
      decoder_output, _ = tf.nn.dynamic_rnn(
          cell=cell,
          inputs=decoder_input,
          sequence_length=length,
          initial_state=initial_state,
          scope=scope)

    # Stack batch vertically.
    decoder_output = tf.reshape(decoder_output, [-1, self.config.encoder_dim])
    targets = tf.reshape(targets, [-1])
    weights = tf.to_float(tf.reshape(mask, [-1]))

    # Logits.
    with tf.variable_scope("logits", reuse=reuse_logits) as scope:
      logits = tf.contrib.layers.fully_connected(
          inputs=decoder_output,
          num_outputs=self.config.vocab_size,
          activation_fn=None,
          weights_initializer=self.uniform_initializer,
          scope=scope)

    losses = tf.nn.sparse_softmax_cross_entropy_with_logits(
        labels=targets, logits=logits)
    batch_loss = tf.reduce_sum(losses * weights)
    tf.losses.add_loss(batch_loss)

    tf.summary.scalar("losses/" + name, batch_loss)

    self.target_cross_entropy_losses.append(losses)
    self.target_cross_entropy_loss_weights.append(weights)

  def build_decoders(self):
    """Builds the sentence decoders.
    Inputs:##同2.2中的encode,decode_pre是对2.1中的predecessor处理,decode_post是对2.2中的successor处理
      self.decode_pre_emb
      self.decode_post_emb
      self.decode_pre_ids
      self.decode_post_ids
      self.decode_pre_mask
      self.decode_post_mask
      self.thought_vectors
    Outputs:
      self.target_cross_entropy_losses
      self.target_cross_entropy_loss_weights
    """
    if self.mode != "encode":
      # Pre-sentence decoder.
      self._build_decoder("decoder_pre", self.decode_pre_emb,
                          self.decode_pre_ids, self.decode_pre_mask,
                          self.thought_vectors, False)

      # Post-sentence decoder. Logits weights are reused.
      self._build_decoder("decoder_post", self.decode_post_emb,
                          self.decode_post_ids, self.decode_post_mask,
                          self.thought_vectors, True)

目的:根据2.2输出的文本向量编码对上文和下文分别进行解码,输出上文解码向量和下文解码向量,上文解码向量同decode_pre_ids求损失,损失乘与decode_pre_mask去除干扰

算法思路较简单,我在此算法上另外增加了bilstm和self_attention两个模型 代码地址https://github.com/jinjiajia/skip_thoughts

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值