[软件工程应用与实践]lingvo学习笔记

最新推荐文章于 2024-08-07 09:59:35 发布

NewtonLoop

最新推荐文章于 2024-08-07 09:59:35 发布

阅读量646

点赞数

分类专栏： [软件工程应用与实践]lingvo学习笔记文章标签：深度学习神经网络自然语言处理

本文链接：https://blog.csdn.net/NewtonLoop/article/details/121457374

版权

[软件工程应用与实践]lingvo学习笔记专栏收录该内容

18 篇文章 0 订阅

订阅专栏

[软件工程应用与实践]lingvo学习笔记

2021SC@SDUSC

lingvo调用tf.nn.seq2seq阅读

basic_rnn_seq2seq:
- input : embedding
- output : embedding
  状态向量作为decoder的初始状态; encoder和decoder使用相同的rnn神经元, 不共享权值参数。
tied_rnn_seq2seq:
- input : embedding
- output : embedding
  encoder和decoder共享权值参数
embedding_rnn_seq2seq :
- input : id
- output : id
  内部创建encoder和decoder的嵌入矩阵
embedding_tied_rnn_seq2seq :
- input : id
- output : id
  内部创建encoder和decoder的嵌入矩阵
embedding_attention_seq2seq :
- input : id
- output : id
  增加注意力机制

tf.nn.seq2seq.embedding_attention_seq2seq 代码片段

# T代表time_steps, 时序长度
def embedding_attention_seq2seq(encoder_inputs,  # [T, batch_size] 
                                decoder_inputs,  # [T, batch_size]
                                cell,
                                num_encoder_symbols,
                                num_decoder_symbols,
                                embedding_size,
                                num_heads=1,      # attention的抽头数量
                                output_projection=None, #decoder的投影矩阵
                                feed_previous=False,
                                dtype=None,
                                scope=None,
                                initial_state_attention=False):

params
input

encoder_inputs : int32 型 id tensor list
decoder_inputs : int32 型 id tensor list
cell : RNN_cell 的实例
num_encoder_symbols, num_decoder_symbols：分别是编码和解码的符号数，即词表大小
embedding_size : 词向量维度
num_heads : attention的抽头数量
output_projection : decoder的output向量投影到词表空间时，用到的投影矩阵和偏置项(W, B)；W的shape是[output_size, num_decoder_symbols]，B的shape是[num_decoder_symbols]；若此参数存在且feed_previous=True，上一个decoder的输出先乘W再加上B作为下一个decoder的输入
feed_previous : 若为True, 只有第一个decoder的输入（“GO"符号）有用，所有的decoder输入都依赖于上一步的输出；一般在测试时用（当然源码也提到，可以在训练时用于模拟测试的环境，比如Scheduled Sampling）
initial_state_attention : 默认为False, 初始的attention是零；若为True，将从initial state和attention states开始attention

output

(outputs, state) 元组对，outputs是二维张量列表, 每个Tensor的shape是[batch_size, cell.state_size]；state是最后一个timestep，decoder cell的state，shape是[batch_size, cell.state_size]

encoder

创建嵌入矩阵

encoder_cell = rnn_cell.EmbeddingWrapper(      
        cell, embedding_classes=num_encoder_symbols,
        embedding_size=embedding_size)
    encoder_outputs, encoder_state = rnn.rnn(
        encoder_cell, encoder_inputs, dtype=dtype) #  [T，batch_size，size]

计算 encoder 的 output 和 state

top_states = [array_ops.reshape(e, [-1, 1, cell.output_size])
                  for e in encoder_outputs]    # T * [batch_size, 1, size]

生成 attention states, 计算 attention

    attention_states = array_ops.concat(1, top_states) # [batch_size,T,size]

EmbeddingWrapper 是rnn神经元的前面加一层 embedding 作为encoder_cell, input可以是word的id。

EmbeddingWrapper

生成embedding矩阵[embedding_classes,embedding_size]
inputs: [batch_size, 1]
return : (output, state)

class EmbeddingWrapper(RNNCell):
  def __init__(self, cell, embedding_classes, embedding_size, initializer=None):
  def __call__(self, inputs, state, scope=None):

Decoder
生成decoder的cell，通过OutputProjectionWrapper类对输入参数中的cell实例包装实现

# Decoder.
    output_size = None
    if output_projection is None:
      cell = rnn_cell.OutputProjectionWrapper(cell, num_decoder_symbols)
      output_size = num_decoder_symbols
    if isinstance(feed_previous, bool):
      return embedding_attention_decoder(
          ...
      )

OutputProjectionWrapper

class OutputProjectionWrapper(RNNCell):
  def __init__(self, cell, output_size): # output_size:size after casting
  def __call__(self, inputs, state, scope=None):
  #init return rnn_cell with output projection

embedding_attention_decoder

create the embedding for decoder
创建 loop_function 将 embedding 映射到output_projection 空间, 得到一个wordembedding作为下一步输入

def embedding_attention_decoder(decoder_inputs,
                                initial_state,
                                attention_states,
                                cell,
                                num_symbols,
                                embedding_size,
                                num_heads=1,
                                output_size=None,
                                output_projection=None,
                                feed_previous=False,
                                update_embedding_for_previous=True,
                                dtype=None,
                                scope=None,
                                initial_state_attention=False):
# 核心代码
    embedding = variable_scope.get_variable("embedding",
                                            [num_symbols, embedding_size])
    loop_function = _extract_argmax_and_embed(
        embedding, output_projection,
        update_embedding_for_previous) if feed_previous else None
    emb_inp = [
        embedding_ops.embedding_lookup(embedding, i) for i in decoder_inputs]
    # T * [batch_size, embedding_size]
    return attention_decoder(
        emb_inp,
        initial_state,
        attention_states,
        cell,
        output_size=output_size,
        num_heads=num_heads,
        loop_function=loop_function,
        initial_state_attention=initial_state_attention)

tf.nn.attention_decoder
这个地方不太好懂啊啊啊啊啊

论文涉及三个公式：

encoder 输出的隐含层状态 h_i , decoder 的隐含层状态 d_i. v^T, W’₁ 是模型要学的参数. 所谓的attention，就是在每个解码的时间步，对encoder的隐层状态进行加权求和，针对不同信息进行不同程度的注意力。

attention 机制常见步骤

通过当前隐含层状态和关注的隐含层状态求出对应权重
softmax 归一化为概率
作为加权系数对不同的隐含层状态求和, 得到一个信息向量.

attention就是对信息的加权求和，一个attention head对应了一种加权求和方式，这个参数定义了用多少个attention head去加权求和，所以公式三可以进一步表述为

To calculate W1 * h_t we use a 1-by-1 convolution 卷积方式实现 W1 * h_t, 返回 shape 为 [batch_size, attn_length, 1, attention_vec_size] 张量

# To calculate W1 * h_t we use a 1-by-1 convolution
    hidden = array_ops.reshape(
        attention_states, [-1, attn_length, 1, attn_size])
    hidden_features = []
    v = []
    attention_vec_size = attn_size  # Size of query vectors for attention.
    for a in xrange(num_heads):
      k = variable_scope.get_variable("AttnW_%d" % a,
                                      [1, 1, attn_size, attention_vec_size])
      hidden_features.append(nn_ops.conv2d(hidden, k, [1, 1, 1, 1], "SAME"))
      v.append(
          variable_scope.get_variable("AttnV_%d" % a, [attention_vec_size]))

W₂ * d_t通过linear实现

for a in xrange(num_heads):
        with variable_scope.variable_scope("Attention_%d" % a):
          # query对应当前隐层状态d_t
          y = linear(query, attention_vec_size, True)
          y = array_ops.reshape(y, [-1, 1, 1, attention_vec_size])
          # 计算u_t
          s = math_ops.reduce_sum(
              v[a] * math_ops.tanh(hidden_features[a] + y), [2, 3])
          a = nn_ops.softmax(s)
          # 计算 attention-weighted vector d.
          d = math_ops.reduce_sum(
              array_ops.reshape(a, [-1, attn_length, 1, 1]) * hidden,
              [1, 2])
          ds.append(array_ops.reshape(d, [-1, attn_size]))