如有错误,欢迎及时指出~
⽂文本摘要⾃动⽣生成项目textsum源码链接: https://github.com/tensorflow/models/tree/master/research/textsum
其主要思想是基于attention的seq2seq
基于attention的seq2seq
模型基于attention的seq2seq机制。
seq2seq主要思想是:
Seq2Seq模型有效地建模了基于输入序列,预测未知输出序列的问题。模型由两部分构成,一个编码阶段的”Encoder”和一个解码阶段的”Decoder”。如下图的简单结构所示,Encoder的RNN每次输入一个字符代表的embedding向量,如依次输入A,B,C, 及终止标志,将输入序列编码成一个固定长度的向量;之后解码阶段的RNN会一个一个字符地解码, 如预测为X, 之后在训练阶段会强制将前一步解码的输出作为下一步解码的输入,如X会作为下一步预测Y时的输入。
定义输入序列
x=[x1,x2,...,xTx]
x
=
[
x
1
,
x
2
,
.
.
.
,
x
T
x
]
,由
Tx
T
x
个固定长度为d的向量构成,输出序列为
y=[y1,y2,...,yTy]
y
=
[
y
1
,
y
2
,
.
.
.
,
y
T
y
]
,由Ty个固定长度为
d
d
的向量构成; 定义输入序Encoder阶段的RNN隐藏层为, Decoder阶段的RNN隐藏层为
si
s
i
Attention注意力分配机制
LSTM模型虽然具有记忆性,但是当Encoder阶段输入序列过长时,解码阶段的LSTM也无法很好地针对最早的输入序列解码。基于此,Attention注意力分配的机制被提出,就是为了解决这个问题。在Decoder阶段每一步解码,都能够有一个输入,对输入序列所有隐藏层的信息 h1,h2,…hTx h 1 , h 2 , … h T x 进行加权求和。打个比方就是每次在预测下一个词时都会把所有输入序列的隐藏层信息都看一遍,决定预测当前词时和输入序列的那些词最相关。
Attention机制代表了在解码Decoder阶段,每次都会输入一个Context上下文的向量
Ci
C
i
, 隐藏层的新状态
Si
S
i
根据上一步的状态
Si−1,Yi,Ci
S
i
−
1
,
Y
i
,
C
i
三者的一个非线性函数得出:
si=f(si−1,yi,ci)
s
i
=
f
(
s
i
−
1
,
y
i
,
c
i
)
Context向量在解码的每一步都会重新计算,根据一个MLP模型计算出输出序列
i
i
对每个输入序列的隐含层的对应权重
aij
a
i
j
,并对所有隐含层加权平均。文章中说的Alignment Model就是代表这种把输入序列位置
j
j
和输出序列位置建立关系的模型。
aij
a
i
j
可以理解为Decoder解码输出序列的第
i
i
步,对输入序列第步分配的注意力权重。
eij
e
i
j
为一个简单的MLP模型激活的输出;
aij
a
i
j
的计算是对
eij
e
i
j
做softmax归一化后的结果。
整体结构
总的来说,程序由4部分组成:
1)处理输入数据/预处理
2)seq2seq_attention模型
3)decoder
4)Beam Search方法生成摘要
代码共有8个源文件,其中:
文件 | 主要功能 |
---|---|
seq2seq_attention.py | 整个程序的主函数部分,执行整个调用逻辑,并定义了很多输入参数。负责tf中模型的构建和各种操作(op)的定义。 |
seq2seq_attention_decode.py | seq2seq的decoder |
seq2seq_attention_model.py | seq2seq_attention模型的实现部分,整个程序/模型的核心 |
seq2seq_lib.py | seq2seq模型相关的一些辅助性操作的函数 |
beam_search.py | Beam Search方法生成摘要 |
data_convert_example.py | 作者提供的生成模型的一个例子 |
这里主要分析seq2seq_attention.py 、seq2seq_attention_model.py 、seq2seq_attention_decode.py 、beam_search.py 四个文件。
代码分析
seq2seq_attention.py
我们从这个文件开始:seq2seq_attention.py
在构建模型和训练之前,我们首先需要设置一些参数。tf中可以使用tf.flags来进行全局的参数设置。
FLAGS = tf.app.flags.FLAGS
tf.app.flags.DEFINE_string('data_path',
'', 'Path expression to tf.Example.')
tf.app.flags.DEFINE_string('vocab_path',
'', 'Path expression to text vocabulary file.')
......
tf.app.flags.DEFINE_integer('num_gpus', 0, 'Number of gpus used.')
tf.app.flags.DEFINE_integer('random_seed', 111, 'A seed value for randomness.')
然后是main函数:
做了以下事情
1、加载建立好词和id对应关系的map字典,建立过程详见data.py中的class Vocab
vocab = data.Vocab(FLAGS.vocab_path, 1000000)
2、定义attention_seq2seq模型的超参数
hps = seq2seq_attention_model.HParams(
mode=FLAGS.mode, # train, eval, decode
min_lr=0.01, # min learning rate.
lr=0.15, # learning rate
batch_size=batch_size,
enc_layers=4,
enc_timesteps=120,
dec_timesteps=30,
min_input_len=2, # discard articles/summaries < than this
num_hidden=256, # for rnn cell
emb_dim=128, # If 0, don't use embedding
max_grad_norm=2,
num_softmax_samples=4096) # If 0, no sampled softmax.
3、获取输入数据切分后的batches,函数内具体过程详见batch_reader.py的class Batcher
batcher = batch_reader.Batcher(
FLAGS.data_path, vocab, hps, FLAGS.article_key,
FLAGS.abstract_key, FLAGS.max_article_sentences,
FLAGS.max_abstract_sentences, bucketing=FLAGS.use_bucketing,
truncate_input=FLAGS.truncate_input)
4、根据不同的模式,进行不同的操作
如果当前模式为’train’,则开始训练
初始化attention_seq2seq模型,函数内具体过程详见seq2seq_attention_model.py 的 class Seq2SeqAttentionModel
model = seq2seq_attention_model.Seq2SeqAttentionModel(
hps, vocab, num_gpus=FLAGS.num_gpus) #
_Train(model, batcher) #执行train函数
在train函数中:
主要进行了以下操作
(1)建立图
model.build_graph() #后面详细分析
(2)设置Supervisor
由于本模型训练所需时间较长,可能多至几天,此时要求训练时要:
干净的处理shutdown以及crash
在shutdown或者crash之后可以恢复
可以通过tensorboard来监控
为了在shutdown之后可以恢复训练,训练过程中必须规律性地run summary 的op,同时将返回的值加到事件文件(events file)中。 tensorboard监控事件文件并且显式图,来报告随着时间进行的训练过程。故作者采用Supervisor实现一个鲁棒的训练过程。使用Supervisor的优势见http://blog.csdn.net/mijiaoxiaosan/article/details/75021279
sv = tf.train.Supervisor(logdir=FLAGS.log_root,is_chief=True,saver=saver, summary_op=None,save_summaries_secs=60,save_model_secs=FLAGS.checkpoint_secs,global_step=model.global_step)
作者设置每100次,将所有的事件文件刷新到硬盘中。调用 summary_writer.flush() 确保到现在为止的所有事件文件都已经写入到了硬盘中。
if step % 100 == 0:
summary_writer.flush()
(3)设置每次迭代,进行的操作
在每次迭代的过程中,主要进行了三个操作:
a) 获取当前batch
b) 将当前的batch输入模型,进行训练(model.run_train_step函数的详细分析见后)
c) 计算loss
(article_batch, abstract_batch, targets, article_lens, abstract_lens,loss_weights, _, _) = data_batcher.NextBatch()
(_, summaries, loss, train_step) = model.run_train_step(sess, article_batch, abstract_batch, targets, article_lens,abstract_lens, loss_weights) #根据每次输入的batch,放入模型进行训练
running_avg_loss = _RunningAvgLoss(running_avg_loss, loss, summary_writer, train_step)
如果当前模式为’decode’,则对模型进行decode解码
elif hps.mode == 'decode':
decode_mdl_hps = hps
#注意:由于我们保留并回传每一个step的output,这里只需要恢复1st step 并对其再利用即可,所以将超参数中的dec_timesteps 由30变成1
decode_mdl_hps = hps._replace(dec_timesteps=1) #超参数中的dec_timesteps 由30替换成1
model = seq2seq_attention_model.Seq2SeqAttentionModel(
decode_mdl_hps, vocab, num_gpus=FLAGS.num_gpus) #初始化Seq2SeqAttention模型
decoder = seq2seq_attention_decode.BSDecoder(model, batcher, hps, vocab) #初始化decode模型
decoder.DecodeLoop() #对每个step依次进行decode
看一下decoder.DecodeLoop()函数中涉及的:self._Decode(self._saver, sess),其主要作用是加载训练好的模型并decode
def _Decode(self, saver, sess):
ckpt_state = tf.train.get_checkpoint_state(FLAGS.log_root) #checkpoint文件会记录保存信息,通过它可以定位最新保存的模型
if not (ckpt_state and ckpt_state.model_checkpoint_path):
tf.logging.info('No model to decode yet at %s', FLAGS.log_root)
return False
tf.logging.info('checkpoint path %s', ckpt_state.model_checkpoint_path)
ckpt_path = os.path.join(
FLAGS.log_root, os.path.basename(ckpt_state.model_checkpoint_path))
saver.restore(sess, ckpt_path) #加载模型
self._decode_io.ResetFiles() #详见seq2seq_attention.py的 class DecodeIO ,该函数的目的是在将decode结果写入前,重置/清空输出文件
for _ in xrange(FLAGS.decode_batches_per_ckpt):
(article_batch, _, _, article_lens, _, _, origin_articles,
origin_abstracts) = self._batch_reader.NextBatch()
for i in xrange(self._hps.batch_size):
bs = beam_search.BeamSearch(
self._model, self._hps.batch_size,
self._vocab.WordToId(data.SENTENCE_START),
self._vocab.WordToId(data.SENTENCE_END),
self._hps.dec_timesteps)
article_batch_cp = article_batch.copy()
article_batch_cp[:] = article_batch[i:i+1]
article_lens_cp = article_lens.copy()
article_lens_cp[:] = article_lens[i:i+1]
best_beam = bs.BeamSearch(sess, article_batch_cp, article_lens_cp)[0] #得到根据当前输入,模型给出的最优的beam_search结果,该函数具体过程见后 beam_search.py分析
decode_output = [int(t) for t in best_beam.tokens[1:]] #
self._DecodeBatch(
origin_articles[i], origin_abstracts[i], decode_output) #将机器输出的abstract的单词id转换成实际的words,并输出结果
return True
seq2seq_attention_model.py
下面我们来看基于attention的seq2seq模型部分的代码:
建立图,该模块的核心部分
def build_graph(self):
self._add_placeholders() #先添加占位符站位,为定义图打基础
self._add_seq2seq() #后面会仔细分析这个函数
self.global_step = tf.Variable(0, name='global_step', trainable=False)
if self._hps.mode == 'train':
self._add_train_op() # 设置训练模型的各个参数,如:学习率,模型优化方式等
self._summaries = tf.summary.merge_all()
def _add_seq2seq(self)函数主要做的工作是建立seq2seq模型,做了以下几件事:
(1)建立双向LSTM模型
cell_fw = tf.contrib.rnn.LSTMCell(
hps.num_hidden,
initializer=tf.random_uniform_initializer(-0.1, 0.1, seed=123),
state_is_tuple=False)
cell_bw = tf.contrib.rnn.LSTMCell(
hps.num_hidden,
initializer=tf.random_uniform_initializer(-0.1, 0.1, seed=113),
state_is_tuple=False)
(emb_encoder_inputs, fw_state, _) = tf.contrib.rnn.static_bidirectional_rnn(
cell_fw, cell_bw, emb_encoder_inputs, dtype=tf.float32,
sequence_length=article_lens)
(2)建立decoder模型
with tf.variable_scope('decoder'), tf.device(self._next_device()):
# 当 decoding时,使用上一步的output,作为下一步的输入
loop_function = None
if hps.mode == 'decode':
loop_function = _extract_argmax_and_embed(
embedding, (w, v), update_embedding=False) #循环函数,每次上一步的xx,并embedding它
cell = tf.contrib.rnn.LSTMCell(
hps.num_hidden,
initializer=tf.random_uniform_initializer(-0.1, 0.1, seed=113),
state_is_tuple=False)
#定义attention_encoder的输入参数
encoder_outputs = [tf.reshape(x, [hps.batch_size, 1, 2*hps.num_hidden])
for x in encoder_outputs]
self._enc_top_states = tf.concat(axis=1, values=encoder_outputs)
self._dec_in_state = fw_state
# 在decoding期间, _dec_in_state 来自于 beam_search.
# dec_out_state 被 beam_search 存储,用于下一个 step 的 feeding
initial_state_attention = (hps.mode == 'decode')
#调用tensorflow的attention_decoder函数
decoder_outputs, self._dec_out_state = tf.contrib.legacy_seq2seq.attention_decoder(
emb_decoder_inputs, self._dec_in_state, self._enc_top_states,
cell, num_heads=1, loop_function=loop_function,
initial_state_attention=initial_state_attention)
tensorflow的attention_decoder函数源码:
def attention_decoder(decoder_inputs,
initial_state,
attention_states,
cell,
output_size=None,
num_heads=1,
loop_function=None,
dtype=None,
scope=None,
initial_state_attention=False):
'''
:param decoder_inputs: 经过embedding的输入
:param initial_state: encoder输入的encoder_state;encoder最终状态
:param attention_states: 就是encoder_output
:param output_size: cell输出大小,不是词汇表的大小
:param num_heads: 每个decoder hiden state, 会计算num_heads 个 加权encoder output
:param initial_state_attention:
:return:
'''
if not decoder_inputs:
raise ValueError("Must provide at least 1 input to attention decoder.")
if num_heads < 1:
raise ValueError("With less than 1 heads, use a non-attention decoder.")
if attention_states.get_shape()[2].value is None:
raise ValueError("Shape[2] of attention_states must be known: %s" %
attention_states.get_shape())
if output_size is None:
output_size = cell.output_size
with variable_scope.variable_scope(
scope or "attention_decoder", dtype=dtype) as scope:
dtype = scope.dtype
batch_size = array_ops.shape(decoder_inputs[0])[0] # Needed for reshaping.
attn_length = attention_states.get_shape()[1].value
if attn_length is None:
attn_length = array_ops.shape(attention_states)[1]
attn_size = attention_states.get_shape()[2].value
# attention 计算公式:v*tanh(w1*h_t+w2*di)
# To calculate W1 * h_t we use a 1-by-1 convolution, need to reshape before.
hidden = array_ops.reshape(attention_states,
[-1, attn_length, 1, attn_size])
hidden_features = []# 保存计算好的w1*h_t
v = []
attention_vec_size = attn_size # Size of query vectors for attention.
for a in xrange(num_heads):
k = variable_scope.get_variable("AttnW_%d" % a,
[1, 1, attn_size, attention_vec_size])
# 使用1x1卷积 计算 w1*h_t
hidden_features.append(nn_ops.conv2d(hidden, k, [1, 1, 1, 1], "SAME"))
v.append(
variable_scope.get_variable("AttnV_%d" % a, [attention_vec_size]))
state = initial_state
def attention(query):
# query 就是 di,decoder 的第i个节点的节点值
# 该函数输入decoder 节点值,得到加权求和的encoder output
"""Put attention masks on hidden using hidden_features and query."""
ds = [] # Results of attention reads will be stored here.
if nest.is_sequence(query): # If the query is a tuple, flatten it.
query_list = nest.flatten(query)
for q in query_list: # Check that ndims == 2 if specified.
ndims = q.get_shape().ndims
if ndims:
assert ndims == 2
query = array_ops.concat(query_list, 1)
for a in xrange(num_heads):
with variable_scope.variable_scope("Attention_%d" % a):
# y = w2*di+b
y = linear(query, attention_vec_size, True)
y = array_ops.reshape(y, [-1, 1, 1, attention_vec_size])
# 执行 s = v*tanh(w1*h_t + w2*di)
# Attention mask is a softmax of v^T * tanh(...).
s = math_ops.reduce_sum(v[a] * math_ops.tanh(hidden_features[a] + y),
[2, 3])
# s是每个encoder output的attention值、经过softmax计算得到权重值 a
a = nn_ops.softmax(s)
# d = sum( h_t* a_t )
# Now calculate the attention-weighted vector d.
d = math_ops.reduce_sum(
array_ops.reshape(a, [-1, attn_length, 1, 1]) * hidden, [1, 2])
# 每一个attention_head 会得到一个 d,num_heads>1时会得到一组d
ds.append(array_ops.reshape(d, [-1, attn_size]))
return ds
outputs = []
prev = None
batch_attn_size = array_ops.stack([batch_size, attn_size])
# attns 是由上一个decoder hidden state 计算出来的加权求和的encoder output
attns = [
array_ops.zeros(
batch_attn_size, dtype=dtype) for _ in xrange(num_heads)
]
for a in attns: # Ensure the second shape of attention vectors is set.
a.set_shape([None, attn_size])
# 使用encoder state 初始化第一个decoder 节点的attention
if initial_state_attention:
attns = attention(initial_state)
for i, inp in enumerate(decoder_inputs):
if i > 0:
variable_scope.get_variable_scope().reuse_variables()
# 如果设置了loop function,使用loop function 获得当前cell的 input
# If loop_function is set, we use it instead of decoder_inputs.
if loop_function is not None and prev is not None:
with variable_scope.variable_scope("loop_function", reuse=True):
inp = loop_function(prev, i)
# Merge input and previous attentions into one vector of the right size.
input_size = inp.get_shape().with_rank(2)[1]
if input_size.value is None:
raise ValueError("Could not infer input size from input: %s" % inp.name)
# 当前cell输入是 decoder input 和对应的 attention 的线性组合
# x' = w*concate(x,attens)
x = linear([inp] + attns, input_size, True)
# Run the RNN.
cell_output, state = cell(x, state)
# 用当前state 计算attention
# Run the attention mechanism.
if i == 0 and initial_state_attention:
with variable_scope.variable_scope(
variable_scope.get_variable_scope(), reuse=True):
attns = attention(state)
else:
attns = attention(state)
# cell 真正输出是cell输出和当前attention的线性组合
with variable_scope.variable_scope("AttnOutputProjection"):
output = linear([cell_output] + attns, output_size, True)
if loop_function is not None:
prev = output
outputs.append(output)
return outputs, state
attention_decoder函数源码详细分析见 http://blog.csdn.net/vincent_hbl/article/details/77097804
(3)定义loss
with tf.variable_scope('loss'), tf.device(self._next_device()):
def sampled_loss_func(inputs, labels):
with tf.device('/cpu:0'): # Try gpu.
labels = tf.reshape(labels, [-1, 1])
#候选采样损失函数,详见 http://www.algorithmdog.com/tf-candidate-sampling
return tf.nn.sampled_softmax_loss(
weights=w_t, biases=v, labels=labels, inputs=inputs,
num_sampled=hps.num_softmax_samples, num_classes=vsize)
#计算序列的加权交叉熵
if hps.num_softmax_samples != 0 and hps.mode == 'train':
self._loss = seq2seq_lib.sampled_sequence_loss(
decoder_outputs, targets, loss_weights, sampled_loss_func)
else:
self._loss = tf.contrib.legacy_seq2seq.sequence_loss(
model_outputs, targets, loss_weights)
tf.summary.scalar('loss', tf.minimum(12.0, self._loss))
最后是beam_search.py 文件,实现了beam_search算法,在decode后,生成摘要词。
作者对该模块作用的描述
"""
Beam search takes the top K results from the model, predicts the K results for
each of the previous K result, getting K*K results. Pick the top K results from
K*K results, and start over again until certain number of results are fully
decoded.
"""
从模型给出的结果中,选取top K个最优的abstract词结果。具体来说,根据第K个词前面的所有词,都给出一个最后面可能的词,这样就得到K*K个结果,选择前K个最优的结果。
beam_search.py
看一下BeamSearch函数主要过程如下,原理详见论文 A Neural Attention Model for Abstractive Sentence Summarization[J]. empirical methods in natural language processing, 2015 https://arxiv.org/abs/1509.00685
def BeamSearch(self, sess, enc_inputs, enc_seqlen):
"""Performs beam search for decoding.
Args:
sess: tf.Session, session
enc_inputs: ndarray of shape (enc_length, 1), the document ids to encode
enc_seqlen: ndarray of shape (1), the length of the sequnce
Returns:
hyps: list of Hypothesis, the best hypotheses found by beam search,
ordered by score
"""
# Run the encoder and extract the outputs and final state.
enc_top_states, dec_in_state = self._model.encode_top_state(
sess, enc_inputs, enc_seqlen)
# Replicate the initial states K times for the first step.
hyps = [Hypothesis([self._start_token], 0.0, dec_in_state)
] * self._beam_size
results = []
steps = 0
while steps < self._max_steps and len(results) < self._beam_size:
latest_tokens = [h.latest_token for h in hyps]
states = [h.state for h in hyps]
topk_ids, topk_log_probs, new_states = self._model.decode_topk(
sess, latest_tokens, enc_top_states, states)
# Extend each hypothesis.
all_hyps = []
# The first step takes the best K results from first hyps. Following
# steps take the best K results from K*K hyps.
num_beam_source = 1 if steps == 0 else len(hyps)
for i in xrange(num_beam_source):
h, ns = hyps[i], new_states[i]
for j in xrange(self._beam_size*2):
all_hyps.append(h.Extend(topk_ids[i, j], topk_log_probs[i, j], ns))
# Filter and collect any hypotheses that have the end token.
hyps = []
for h in self._BestHyps(all_hyps):
if h.latest_token == self._end_token:
# Pull the hypothesis off the beam if the end token is reached.
results.append(h)
else:
# Otherwise continue to the extend the hypothesis.
hyps.append(h)
if len(hyps) == self._beam_size or len(results) == self._beam_size:
break
steps += 1
if steps == self._max_steps:
results.extend(hyps)
return self._BestHyps(results)
参考:
[1]http://blog.csdn.net/real_myth/article/details/69569169
[2]http://blog.csdn.net/mijiaoxiaosan/article/details/75021279
[3]http://blog.csdn.net/vincent_hbl/article/details/77097804
[4]https://arxiv.org/abs/1509.00685
[5]http://blog.csdn.net/u012436149/article/details/52976413
[6]http://blog.csdn.net/real_myth/article/details/69569169