[软件工程应用与实践]lingvo学习笔记
2021SC@SDUSC
lingvo调用tf.nn.seq2seq阅读
basic_rnn_seq2seq
:- input : embedding
- output : embedding
状态向量作为decoder的初始状态; encoder和decoder使用相同的rnn神经元, 不共享权值参数。
tied_rnn_seq2seq
:- input : embedding
- output : embedding
encoder和decoder共享权值参数
embedding_rnn_seq2seq
:- input : id
- output : id
内部创建encoder和decoder的嵌入矩阵
embedding_tied_rnn_seq2seq
:- input : id
- output : id
内部创建encoder和decoder的嵌入矩阵
embedding_attention_seq2seq
:- input : id
- output : id
增加注意力机制
tf.nn.seq2seq.embedding_attention_seq2seq
代码片段
# T代表time_steps, 时序长度
def embedding_attention_seq2seq(encoder_inputs, # [T, batch_size]
decoder_inputs, # [T, batch_size]
cell,
num_encoder_symbols,
num_decoder_symbols,
embedding_size,
num_heads=1, # attention的抽头数量
output_projection=None, #decoder的投影矩阵
feed_previous=False,
dtype=None,
scope=None,
initial_state_attention=False):
params
input
encoder_inputs
:int32
型 id tensor listdecoder_inputs
:int32
型 id tensor listcell
: RNN_cell 的实例num_encoder_symbols
,num_decoder_symbols
: 分别是编码和解码的符号数,即词表大小embedding_size
: 词向量维度num_heads
: attention的抽头数量output_projection
: decoder的output向量投影到词表空间时,用到的投影矩阵和偏置项(W, B);W的shape是[output_size, num_decoder_symbols]
,B的shape是[num_decoder_symbols]
;若此参数存在且feed_previous=True
,上一个decoder的输出先乘W再加上B作为下一个decoder的输入feed_previous
: 若为True
, 只有第一个decoder的输入(“GO"符号)有用,所有的decoder输入都依赖于上一步的输出;一般在测试时用(当然源码也提到,可以在训练时用于模拟测试的环境,比如Scheduled Sampling)initial_state_attention
: 默认为False
, 初始的attention是零;若为True
,将从initial state和attention states开始attention
output
(outputs, state)
元组对,outputs是 二维张量列表, 每个Tensor的shape是[batch_size, cell.state_size]
;state是最后一个timestep,decoder cell的state,shape是[batch_size, cell.state_size]
encoder
- 创建嵌入矩阵
encoder_cell = rnn_cell.EmbeddingWrapper(
cell, embedding_classes=num_encoder_symbols,
embedding_size=embedding_size)
encoder_outputs, encoder_state = rnn.rnn(
encoder_cell, encoder_inputs, dtype=dtype) # [T,batch_size,size]
- 计算 encoder 的 output 和 state
top_states = [array_ops.reshape(e, [-1, 1, cell.output_size])
for e in encoder_outputs] # T * [batch_size, 1, size]
- 生成 attention states, 计算 attention
attention_states = array_ops.concat(1, top_states) # [batch_size,T,size]
EmbeddingWrapper
是rnn神经元的前面加一层 embedding 作为encoder_cell, input可以是word的id。
EmbeddingWrapper
- 生成embedding矩阵[embedding_classes,embedding_size]
- inputs: [batch_size, 1]
- return : (output, state)
class EmbeddingWrapper(RNNCell):
def __init__(self, cell, embedding_classes, embedding_size, initializer=None):
def __call__(self, inputs, state, scope=None):
Decoder
生成decoder的cell,通过OutputProjectionWrapper
类对输入参数中的cell实例包装实现
# Decoder.
output_size = None
if output_projection is None:
cell = rnn_cell.OutputProjectionWrapper(cell, num_decoder_symbols)
output_size = num_decoder_symbols
if isinstance(feed_previous, bool):
return embedding_attention_decoder(
...
)
OutputProjectionWrapper
class OutputProjectionWrapper(RNNCell):
def __init__(self, cell, output_size): # output_size:size after casting
def __call__(self, inputs, state, scope=None):
#init return rnn_cell with output projection
embedding_attention_decoder
- create the embedding for decoder
- 创建
loop_function
将embedding
映射到output_projection
空间, 得到一个wordembedding
作为下一步输入
def embedding_attention_decoder(decoder_inputs,
initial_state,
attention_states,
cell,
num_symbols,
embedding_size,
num_heads=1,
output_size=None,
output_projection=None,
feed_previous=False,
update_embedding_for_previous=True,
dtype=None,
scope=None,
initial_state_attention=False):
# 核心代码
embedding = variable_scope.get_variable("embedding",
[num_symbols, embedding_size])
loop_function = _extract_argmax_and_embed(
embedding, output_projection,
update_embedding_for_previous) if feed_previous else None
emb_inp = [
embedding_ops.embedding_lookup(embedding, i) for i in decoder_inputs]
# T * [batch_size, embedding_size]
return attention_decoder(
emb_inp,
initial_state,
attention_states,
cell,
output_size=output_size,
num_heads=num_heads,
loop_function=loop_function,
initial_state_attention=initial_state_attention)
tf.nn.attention_decoder
这个地方不太好懂啊啊啊啊啊
论文涉及三个公式:
encoder 输出的隐含层状态 hi , decoder 的隐含层状态 di. vT, W’1 是模型要学的参数. 所谓的attention,就是在每个解码的时间步,对encoder的隐层状态进行加权求和,针对不同信息进行不同程度的注意力。
attention 机制常见步骤
- 通过当前隐含层状态和关注的隐含层状态求出对应权重
- softmax 归一化为概率
- 作为加权系数对不同的隐含层状态求和, 得到一个信息向量.
attention就是对信息的加权求和,一个attention head对应了一种加权求和方式,这个参数定义了用多少个attention head去加权求和,所以公式三可以进一步表述为
- To calculate W1 * h_t we use a 1-by-1 convolution 卷积方式实现 W1 * h_t, 返回 shape 为
[batch_size, attn_length, 1, attention_vec_size]
张量
# To calculate W1 * h_t we use a 1-by-1 convolution
hidden = array_ops.reshape(
attention_states, [-1, attn_length, 1, attn_size])
hidden_features = []
v = []
attention_vec_size = attn_size # Size of query vectors for attention.
for a in xrange(num_heads):
k = variable_scope.get_variable("AttnW_%d" % a,
[1, 1, attn_size, attention_vec_size])
hidden_features.append(nn_ops.conv2d(hidden, k, [1, 1, 1, 1], "SAME"))
v.append(
variable_scope.get_variable("AttnV_%d" % a, [attention_vec_size]))
- W2 * dt通过linear实现
for a in xrange(num_heads):
with variable_scope.variable_scope("Attention_%d" % a):
# query对应当前隐层状态d_t
y = linear(query, attention_vec_size, True)
y = array_ops.reshape(y, [-1, 1, 1, attention_vec_size])
# 计算u_t
s = math_ops.reduce_sum(
v[a] * math_ops.tanh(hidden_features[a] + y), [2, 3])
a = nn_ops.softmax(s)
# 计算 attention-weighted vector d.
d = math_ops.reduce_sum(
array_ops.reshape(a, [-1, attn_length, 1, 1]) * hidden,
[1, 2])
ds.append(array_ops.reshape(d, [-1, attn_size]))