tase2seq模型
整体模型结果图如下:
图2给出了topic avare seq2seq模型,该模型在seq2seq的基础上,通过一个联合attenton机制和一个偏置生成概率引入topic 信息.
topic word的获取
采用twitter lda模型,每个输入语句x,对应一个topic z,对于topic z,语句x中语该topic有关的字有n个,取n=100,表示为K,利用输入语句x,topic words k,以及输出y,训练生成模型.
lda模型参数的估计采用colapsed gibbs sampling 算法(zhao et al.2011),之后我们使用该模型得到x的topic 在,选取概率最大的n个字作为topic words.
在学习的过程中,我们需要每个topic word的一个向量表示,首先计算topic word w的一个分布,公式为:
式中, Cwz C w z 为在训练过程中,w被赋值给topic z 的次数,这样我们就可以得到topic words的向量表示的分布.
在实验中,我们训练lda模型使用大规模的新浪微博语料.数据提供了topic 信息以及聊天对,以便我们训练对话生成模型.
除了lda模型,我们也可以使用tag recommendation,或者关键词抽取来生成topic words,也可以从其他资源如wikipedia以及其他web 文件得到topic words.
lda模型训练:
从新浪微博爬取3千万语句来训练lda模型,设置topic数为T=200,并设置lda模型参数, α=1/T α = 1 / T , β=0.01 β = 0.01 , , ,对于每个topic,选取前100个字作为topic words,为了滤除普遍性的词,我们计算这3千万语句中字的字频,滤除频率最大的2000个字,得到topic wods 字典,在此字典外的字为unk.
seq2seq
(1)在encoding 阶段,通过bi gru对输入x进行encoder,得到输出 htTt=1 h t t = 1 T ,对应代码如下:
class BidirectionalEncoder(Initializable):
"""Encoder of RNNsearch model."""
def __init__(self, vocab_size, embedding_dim, state_dim, **kwargs):
super(BidirectionalEncoder, self).__init__(**kwargs)
self.vocab_size = vocab_size
self.embedding_dim = embedding_dim
self.state_dim = state_dim
self.lookup = LookupTable(name='embeddings')
self.bidir = BidirectionalWMT15(
GatedRecurrent(activation=Tanh(), dim=state_dim))
self.fwd_fork = Fork(
[name for name in self.bidir.prototype.apply.sequences
if name != 'mask'], prototype=Linear(), name='fwd_fork')
self.back_fork = Fork(
[name for name in self.bidir.prototype.apply.sequences
if name != 'mask'], prototype=Linear(), name='back_fork')
self.children = [self.lookup, self.bidir,
self.fwd_fork, self.back_fork]
def _push_allocation_config(self):
self.lookup.length = self.vocab_size
self.lookup.dim = self.embedding_dim
self.fwd_fork.input_dim = self.embedding_dim
self.fwd_fork.output_dims = [self.bidir.children[0].get_dim(name)
for name in self.fwd_fork.output_names]
self.back_fork.input_dim = self.embedding_dim
self.back_fork.output_dims = [self.bidir.children[1].get_dim(name)
for name in self.back_fork.output_names]
@application(inputs=['source_sentence', 'source_sentence_mask'],
outputs=['representation'])
def apply(self, source_sentence, source_sentence_mask):
# Time as first dimension
source_sentence = source_sentence.T
source_sentence_mask = source_sentence_mask.T
embeddings = self.lookup.apply(source_sentence)
representation = self.bidir.apply(
merge(self.fwd_fork.apply(embeddings, as_dict=True),
{
'mask': source_sentence_mask}),
merge(self.back_fork.apply(embeddings, as_dict=True),
{
'mask': source_sentence_mask})
)
return representation#[seq_len,batch_size,2*state_dim]
同时,根据等式(4)计算的到的topic words的向量表示表,查找当前信息x的topic words k的向量表示,即图中hyderate,skin,face,facemask,moisturize为topic words,k1,k2,k3,…,k10为查找表得到的向量表示.
在代码实现中,则是将topic words输入MLP中,得到topic representation,代码如下:
class topicalq_transformer(Initializable):
def __init__(self, vocab_size, topical_embedding_dim, state_dim,word_num,batch_size,
**kwargs):
super(topicalq_transformer, self).__init__(**kwargs)
self.vocab_size = vocab_size;
self.word_embedding_dim = topical_embedding_dim;
self.state_dim = state_dim;
self.word_num=word_num;
self.batch_size=batch_size;
self.look_up=LookupTable(name='topical_embeddings');
self.transformer=MLP(activations=[Tanh()],
dims=[self.word_embedding_dim*self.word_num, self.state_dim],
name='topical_transformer');
self.children = [self.look_up,self.transformer];
def _push_allocation_config(self):
self.look_up.length = self.vocab_size
self.look_up.dim = self.word_embedding_dim
# do we have to push_config? remain unsure
@application(inputs=['source_topical_word_sequence'],
outputs=['topical_embedding'])
def apply(self, source_topical_word_sequence):#suource topic words
# Time as first dimension
source_topical_word_sequence=source_topical_word_sequence.T#[word_num,batch_size]
word_topical_embeddings = self.look_up.apply(source_topical_word_sequence)#[word_num,batch_size,embedding_dim]
word_topical_embeddings&#