论文阅读:Best of Both Worlds: Transferring Knowledge from D to G

首先pretrain D和G,然后fix D,让G不断sample response,然后根据D的监督信号进行更新。这里使用Gumbel Softmax来解决non-differentialable problem。

作者从MLE(or  equivalently CE)的generic and safe response问题入手,指出MLE训练的生成模型容易“game” MLE,会倾向于“average” training corpus然后就容易生成frequent responses。

One reason for this emergent beavior is that the space of possible next utterances in a dialog is highly multi-modal (there are many possible paths a dialog may take in the future). In the face of such highly multi-modal output distributions, models ‘game’ MLE by latching on to the head of the distribution or the frequent responses, which by nature tend to be generic and widely applicable

为了解决MLE的这个问题,可以考虑的一类解决办法是sequence level training,  specifically, using reinforcement learning to optimize taskspecific sequence metrics. 但是unfortunately,dialogue没有一个automatic metric that is highly correlated with human judgement。(作者其实是说在dialogue中没有一个很好的objective function。)

这里有一个感觉,由于source 对target的约束太小,导致target sequence有非常大的space(semantic or else),MLE的 constraint 过于tight,而且对于大多数生成数据集都存在的问题是对于一个source只有一个对应的target更加剧了这个情况。这个时候一个learnable supervisor就非常重要了,noted as D。这里就涉及到一个问题,如何学到一个真正好的D,让它真的能够应付huge output space,给不是gound truth但是also appropriate response打一个高分?在同一个数据集?

 这也是作者的想法,这样的responses应该在D映射到的hidden space中有着近的距离。

D learns a task-dependent perceptual similarity and learns to recognize multiple correct responses in the feature space.  The interaction between responses is captured via the similarity between the learned embeddings.

值得注意的,这里是Visual dialogue的问题,和open-domain相比there holds more constraints.

核心部分,discriminator loss:

In particular, it needs to encourage perceptually meaningful similarities. 

The N-pair loss objective encourages learning a space in which the ground truth answer is scored higher than other options, and at the same time, encourages options similar to ground truth answers to score better than dissimilar ones.

Unlike the multiclass logistic loss, the options that are correct but different from the correct option may not be overly penalized, and thus can be useful in providing a reliable signal to the generator. We regularize the L2 norm of the embedding vectors to be small.

Generator loss

 

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值