论文阅读：RUBER: An Unsupervised Method for Automatic Evaluation of Open-Domain Dialog Systems

最新推荐文章于 2021-06-15 20:30:00 发布

Lcyztf

最新推荐文章于 2021-06-15 20:30:00 发布

阅读量956

点赞数

分类专栏： Evaluation Metric Dialogue Systems

本文链接：https://blog.csdn.net/Lcyztf/article/details/81146844

版权

Dialogue Systems 同时被 2 个专栏收录

9 篇文章 0 订阅

订阅专栏

Evaluation Metric

1 篇文章 0 订阅

订阅专栏

核心问题：What makes a good reply in open-domain dialog systems?

一、Observation

1、Resembling the groundtruth generally implies a good reply.

生成的reply和groundtruth相似度越高越好。这是一个general assumption。

我们需要注意：short-text conversation场景下query and reply are typically short and casual。而常见的非参数化的基于朴素统计的BLEU, ROUGE, and METEOR指标直觉上都是看的word-overlapping ，在这种短而语义多变丰富的情况下并不是一个理想的指标——exact word overlapping is too strict in the dialog setting。

2、A groundtruth reply is merely one way to respond.

对于一个query，合适的reply并不唯一，不同meaning，甚至是相反meaning的replies可能都是好结果。

在训练中能看到很多答复类似于"I don't know." 这种似乎合适但是毫无意义的回答。

The observation implies that a groundtruth alone is insufﬁcient for the evaluation of opendomain dialog systems.

因此单单一个groundtruth reply其实并能够很好的去evaluate/train 模型。这里只考虑了evaluate而没有考虑train。

3、A query itself provides useful information in judging the quality of a reply.

拿生成的reply直接去和query进行某种比对同样可以直接说明这个reply生成的怎么样。换句话说，query中提供了足够的（甚至超出预期的）信息。

二、Model

1、Referenced Metric，即用groundtruth来衡量reply的质量。选择基于embedding+pooling+cosine similarity的操作。

We adopt the vector pooling approach that summarizes sentence information by choosing the maximum and minimum values in each dimension; the closeness of a sentence pair is measured by the cosine score.

简言之：embedding每个维度进行maxpooling得到vmax，minpooling得到vmin，然后将他们concat起来，计算cosine similarity。

疑问：对于embedding每个维度这样操作的intuition？embedding的每个维度应该是没有确切语义的。

We use such heuristic matching because we assume no groundtruth scores, making it infeasible to train a parametric model.

这里很有启发。non-parametric method 也不需要进行标注和训练。

2、Unreferenced Metric，即用query来衡量reply的质量。这里用一个设计的neural network来做。

注意，这个结构的功能就是给定query，对任意reply打分，和检索式的matching model M（·）的功能本质上一样。

①用bi-GRU做encoder，last hidden当作sentence embedding。因为是short-text所以rnn还just do well。

②bi-linear function qT·M·r来捕捉 quadratic featur。

③MLP（linear + tanh + linear +sigmoid）输出score s。

训练的过程使用negative sampling，given a ground truth query-reply pair, we randomly choose another reply r in the training set as a negative sample.

训练的objective function是hinge loss（到底什么时候用hinge loss？）

3、Hybrid Approach 两个score综合起来。

三、Experiment 两个比较有趣的点

1、query自身提供的信息量甚至不会比ground truth少。（一种直觉认识吧）

2、consine similarity对于复杂而多变的、需要捕捉丰富上下文语义信息的场景的情况下效果并不理想。具体的的适用场景还需要总结和思考。下图可见差距还是蛮大的。

Our neural network scorer outperforms the embedding-based cosine measure. This is because cosine mainly captures similarity, but the rich semantic relationship between queries and replies necessitates more complicated mechanisms like neural networks.

四、关于对话系统evaluation的一些观点和启发（from others）

MT 领域里 BLEU 之所以合理，很大程度上是因为它切中了 MT 的一个核心特征——翻译文本间存在 alignment，所以可以从 word-level 来入手设计 metric。感觉现在 dialog 里合适的 metric 没被提出来很大程度上其实是因为大家还没想清楚 dialog 这个领域相对 MT 或者别的领域它最核心的特点是什么，感觉还是要从这个特点入手，比想各种 trick 要靠谱些。

在 How NOT To Evaluate Your Dialogue System 中作者指出：

We ﬁnd that all metrics show either weak or no correlation with human judgements, despite the fact that word overlap metrics have been used extensively in the literature for evaluating dialogue response models.

即

在闲聊性质的数据集上，上述 metric 和人工判断有一定微弱的关联 (only a small positive correlation on chitchat oriented Twitter dataset)
在技术类的数据集上，上述 metric 和人工判断完全没有关联(no correlation at all on the technical UDC)
当局限于一个特别具体的领域时，BLEU会有不错的表现

Lcyztf

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
论文阅读：RUBER: An Unsupervised Method for Automatic Evaluation of Open-Domain Dialog Systems

核心问题：What makes a good reply in open-domain dialog systems?一、Observation1、Resembling the groundtruth generally implies a good reply.生成的reply和groundtruth相似度越高越好。这是一个general assumption。我们需要注意：sh...
复制链接

扫一扫