1.Encoder-Decoder
基础的seq2seq = Encoder + Decoder + 语义编码c (连接两者的中间状态向量)。Encoder–Decoder 结构,输入是一个序列,输出也是一个序列。Encoder 中将可变长度的信号序列变为固定长度的向量表达,Decoder 将这个固定长度的向量变成可变长度的目标信号序列Encoder,Decoder可以是CNN,RNN,Transformer三种结构,Encoder和Decoder可以是相同的结构,也可以是不同的结构。
class Encoder(tf.keras.layers.Layer):
def __init__(self, vocab_size, embedding_dim, enc_units, batch_sz, embedding_matrix):
super(Encoder, self).__init__()
self.batch_sz = batch_sz
# self.enc_units = enc_units,使用双向GRU则enc_units除以2
self.enc_units = enc_units // 2
self.embedding = tf.keras.layers.Embedding(vocab_size,
embedding_dim,
weights=[embedding_matrix],
trainable=False)
# self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
# tf.keras.layers.GRU自动匹配cpu、gpu
self.gru = tf.keras.layers.GRU(self.enc_units,
return_sequences=True,
return_state=True,
recurrent_initializer='glorot_uniform')
self.bigru = tf.keras.layers.Bidirectional(self.gru, merge_mode='concat')
def call(self, x, hidden):
"""
@desc batch_size是每次批量训练的样本量,每次训练传入训练样本内所有的词,这里hidden_size=embedding_dim
@param x: encoder input word id(batch_size, seq_len)
hidden: hidden state(batch_size, hidden_size)
@return output: output vector(batch_size, seq_len, embedding_dim)
state: hidden state(batch_size, hidden_size)
"""
x = self.embedding(x)
hidden = tf.split(hidden, num_or_size_splits=2, axis=1)
output, forward_state, backward_state = self.bigru(x, initial_state=hidden)
state = tf.concat([forward_state, backward_state], axis=1)
# output, state = self.gru(x, initial_state=hidden)
return output, state
def initialize_hidden_state(self):
return tf.zeros((self.batch_sz, 2*self.enc_units))
class Decoder(tf.keras.layers.Layer):
def __init__(self, vocab_size, embedding_dim, dec_units, batch_sz, embedding_matrix):
super(Decoder, self).__init__()
self.batch_sz = batch_sz
self.dec_units = dec_units
self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim,
weights=[embedding_matrix],
trainable=False)
# self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
self.gru = tf.keras.layers.GRU(self.dec_units,
return_sequences=True,
return_state=True,
recurrent_initializer='glorot_uniform')
# self.dropout = tf.keras.layers.Dropout(0.5)
self.fc = tf.keras.layers.Dense(vocab_size, activation=tf.keras.activations.softmax)
# self.fc = tf.keras.layers.Dense(vocab_size)
def call(self, x, hidden, enc_output, context_vector):
"""
@desc 按照输出词的数量迭代该函数,每次迭代进行一个词的预测
@param x: decoder input word id(batch_size, 1)
hidden: decoder hidden state(batch_size, hidden_size)
enc_output: encoder的输出(batch_size, seq_length, embedding_dim)
context_vector: 注意力机制的输出(batch_size, hidden_size)
@return x: output vector(batch_size, 1, embedding_dim + hidden_size)
out: output vector(batch_size, vocab_size)
state: hidden state(batch_size, hidden_size)
"""
# enc_output shape == (batch_size, max_length, hidden_size)
# x shape after passing through embedding == (batch_size, 1, embedding_dim)
x = self.embedding(x)
# x shape after concatenation == (batch_size, 1, embedding_dim + hidden_size)
# 把attention机制得到的context_vector与输入词向量进行拼接
x = tf.concat([tf.expand_dims(context_vector, 1), x], axis=-1)
# passing the concatenated vector to the GRU
output, state = self.gru(x)
# output shape == (batch_size * 1, hidden_size)
output = tf.reshape(output, (-1, output.shape[2]))
# output = self.dropout(output)
# out shape == (batch_size, vocab_size)
out = self.fc(output)
return x, out, state
2.Attention机制
Encoder和Decoder之间的唯一联系就是一个固定长度的语义向量C,即Encoder要将整个序列的信息压缩进一个固定长度的向量中去因为语义向量无法完全表示整个序列的信息,而且之前先输入的内容带有的信息会被后输入的信息稀释掉,输入序列越长,这个现象就越严重。这样Decoder时,没有获得输入序列足够的信息, 因此Deocder的准确度自然会降低。因此提出基于Attention的encoder-decoder模型。
令encoder hidden为 h ‾ s \overline h_{s} hs,decoder hidden为 h t h_{t} ht。在decoder过程中,t时刻的输出需基于 h t − 1 h_{t-1} ht−1和所有的encoder hidden state计算相似度,得到attention权重 α t s = s o f t m a x ( s c o r e ( h ‾ s , h t − 1 ) \alpha_{ts}=softmax(score(\overline h_{s},h_{t-1}) αts=softmax(score(hs,ht−1),从而计算该时刻的context vector为 c t = ∑ j = 1 s α t j h ‾ j c_t=\sum_{j=1}^s \alpha _{tj} \overline h_{j} ct=∑j=1sαtjhj,然后与 h t h_t ht进行拼接。其中score为距离函数,可使用内积 h t − 1 T w h ‾ s h^T_{t-1}w\overline h_s ht−1Twhs即LuongAttention。也可采用 v a T t a n h ( w 1 h ‾ s + w 2 h t − 1 ) v^T_atanh(w_1\overline h_s+w_2h_{t-1}) vaTtanh(w1hs+w2ht−1)即BahdanauAttention,其中 v a 、 w 1 、 w 2 v_a、w_1、w_2 va、w1、w2基于神经网络训练得到。
class BahdanauAttention(tf.keras.layers.Layer):
def __init__(self, units):
super(BahdanauAttention, self).__init__()
self.W1 = tf.keras.layers.Dense(units)
self.W2 = tf.keras.layers.Dense(units)
self.V = tf.keras.layers.Dense(1)
def call(self, dec_hidden, enc_output):
"""
@desc 每个词的预测都会调用一次该函数,计算注意力机制权重并输出向量,第一个dec_hidden使用encoder最后一个enc_hidden
@param dec_hidden: decoder上一个状态的hidden state,shape=(batch_size, hidden_size)
enc_output: encoder的输出,shape=(batch_size, seq_len, embedding_size)
@return context_vector: 注意力机制的输出向量,shape=(batch_size, embedding_size)
attn_dist: encoder每一个输入的占比(经过softmax处理),shape=(batch_size, seq_len)
"""
# hidden_with_time_axis shape == (batch_size, 1, hidden_size)
# we are doing this to perform addition to calculate the score
hidden_with_time_axis = tf.expand_dims(dec_hidden, 1) # shape=(16, 1, 256)
# att_features = self.W1(enc_output) + self.W2(hidden_with_time_axis)
# Calculate v^T tanh(W_h h_i + W_s s_t + b_attn)
score = self.V(tf.nn.tanh(self.W1(enc_output) + self.W2(hidden_with_time_axis))) # shape=(16, 200, 1)
# Calculate attention distribution
attn_dist = tf.nn.softmax(score, axis=1) # shape=(16, 200, 1)
# context_vector shape after sum == (batch_size, hidden_size)
context_vector = attn_dist * enc_output # shape=(16, 200, 256)
context_vector = tf.reduce_sum(context_vector, axis=1) # shape=(16, 256)
return context_vector, tf.squeeze(attn_dist, -1)
3.Seq2Seq
3.1 Seq2Seq模型
简单的Encoder-Decoder即可初步解决Seq2Seq的问题。为了提升模型的准确性,Seq2Seq模型可基于Encoder-Decoder框架下,增加Attention机制。
class SequenceToSequence(tf.keras.Model):
def __init__(self, params):
super(SequenceToSequence, self).__init__()
self.embedding_matrix = load_word2vec(params)
self.params = params
self.encoder = rnn_encoder.Encoder(params["vocab_size"],
params["embed_size"],
params["enc_units"],
params["batch_size"],
self.embedding_matrix)
self.attention = rnn_decoder.BahdanauAttention(params["attn_units"])
self.decoder = rnn_decoder.Decoder(params["vocab_size"],
params["embed_size"],
params["dec_units"],
params["batch_size"],
self.embedding_matrix)
def call_encoder(self, enc_inp):
"""
@desc 计算encoder部分,得到encoder部分的输出及最后一个词的hidden state
@param enc_inp: input word id, shape=(batch_size, seq_len)
@return enc_output: output vector(batch_size, seq_len, embedding_dim)
enc_hidden: encoder最后一个hidden state(batch_size, hidden_size)
"""
enc_hidden = self.encoder.initialize_hidden_state()
# [batch_sz, max_train_x, enc_units], [batch_sz, enc_units]
enc_output, enc_hidden = self.encoder(enc_inp, enc_hidden)
return enc_output, enc_hidden
def call(self, enc_output, dec_inp, dec_hidden, dec_tar):
"""
@desc 计算encoder部分,得到encoder部分的输出及最后一个词的hidden state
@param enc_output: input word id, shape=(batch_size, seq_len)
dec_inp: decoder部分的输入, shape=(batch_size, seq_len),seq_len包括<start>在内
dec_hidden:decoder的hidden state, shape=(batch_size, hidden_size)
dec_tar:decoder部分的目标值,shape=(batch_size, seq_len),seq_len可能包括<end>在内
@return tf.stack(predictions, 1): 输出每一次预测对应词的概率分布, shape=(batch_size, seq_len, vocab_size)
dec_hidden: decoder最后一个hidden state, shape=(batch_size, hidden_size)
"""
predictions = []
attentions = []
# 使用encoder的输出计算decoder的attention机制的第一个向量
context_vector, _ = self.attention(dec_hidden, # shape=(16, 256)
enc_output) # shape=(16, 200, 256)
for t in range(dec_tar.shape[1]): # 50
# Teachering Forcing
_, pred, dec_hidden = self.decoder(tf.expand_dims(dec_inp[:, t], 1),
dec_hidden,
enc_output,
context_vector)
context_vector, attn_dist = self.attention(dec_hidden, enc_output)
predictions.append(pred)
attentions.append(attn_dist)
return tf.stack(predictions, 1), dec_hidden
3.2 Loss函数
Loss函数常使用Cross-entropy loss交叉熵,损失函数是 J = − 1 N ∑ i = 1 N y i l o g ( y ^ i ) J=-\frac {1}{N} \sum^N_{i=1}y_ilog(\hat y_i) J=−N1∑i=1Nyilog(y^i)。当分类问题变成二分类问题时, J = − 1 N ∑ i = 1 N ( y i l o g ( y ^ i ) + ( 1 − y i ) l o g ( 1 − y ^ i ) ) J=-\frac {1}{N} \sum^N_{i=1}(y_ilog(\hat y_i)+(1-y_i)log(1-\hat y_i)) J=−N1∑i=1N(yilog(y^i)+(1−yi)log(1−y^i))。
## 基于one-hot编码计算多分类损失函数
y_true = [[0, 1, 0], [0, 0, 1]]
y_pred = [[0.05, 0.95, 0], [0.1, 0.8, 0.1]]
# Using 'auto'/'sum_over_batch_size' reduction type.
cce = tf.keras.losses.CategoricalCrossentropy()
cce(y_true, y_pred).numpy()
## 基于标签值计算多分类损失函数,参数要使用tf.convert_to_tensor函数转化为tensor否则会报错
y_true = tf.convert_to_tensor([1, 2])
y_pred = tf.convert_to_tensor[[0.05, 0.95, 0], [0.1, 0.8, 0.1]])
# Using 'auto'/'sum_over_batch_size' reduction type.
scce = tf.keras.losses.SparseCategoricalCrossentropy()
scce(y_true, y_pred).numpy()
# loss调用api, from_logits: Whether y_pred is expected to be a logits tensor.
loss_object = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True, reduction='none')
3.3 Teacher Forcing
RNN模型是用前一步的输出作为输入,而Teacher Forcing是在学习时跟着老师(ground truth)走,使用训练数据的标准答案(ground truth)的对应上一项作为下一个state的输入。Teacher Forcing常用于机器翻译,文本摘要,图像字幕的深度学习语言模型。
Teacher Forcing由于依赖标签数据,在训练过程中,模型会有较好的效果。但是在测试的时候因为不能得到真实样本的支持,即推理时当前输入用的却是上一个词的输出,模型的鲁棒性变差。由于文本生成在训练和推断时的不一致造成exposure bias。为了降低这种偏差,可以采用beam search、scheduled sampling、构建负样本(mask)或进行对抗训练(比如梯度惩罚)的方式增强模型的泛化能力。
# Teachering Forcing,decoder输入dec_input即上一个状态的标准答案,而非上一次的预测结果
# dec_input = [start_id] + sequence[:],dec_target= sequence[:] + [stop_id]
for t in range(dec_target.shape[1]): # dec_target.shape[1]是输出单词数,即循环次数
_, pred, dec_hidden = self.decoder(tf.expand_dims(dec_input [:, t], 1),
dec_hidden,
enc_output,
context_vector)
[1]Attention and Augmented Recurrent Neural Networks
[2]Sequence Modeling: Recurrent and Recursive Nets
[3]TensorFlow Core
[4]Seq2Seq训练-teacher forcing & beam search
理解NLP:Seq2Seq模型与Attention机制

本文介绍了NLP中的Seq2Seq模型,包括Encoder-Decoder结构,强调了Attention机制的重要性,用于解决固定长度语义向量无法完全表示输入序列信息的问题。同时,讨论了Seq2Seq模型的Loss函数和Teacher Forcing技术,并提到了解决Teacher Forcing导致的exposure bias的方法。
3963

被折叠的 条评论
为什么被折叠?



