论文阅读：STC data set for single-turn short text conversation——Wang 2013 Noah's Ark Lab

最新推荐文章于 2024-04-25 10:03:18 发布

Lcyztf

最新推荐文章于 2024-04-25 10:03:18 发布

阅读量1k

点赞数

分类专栏： Dialogue Systems NLP Data Sets 文章标签： NLP dialogue STC data set

本文链接：https://blog.csdn.net/Lcyztf/article/details/81187017

版权

Dialogue Systems 同时被 2 个专栏收录

9 篇文章 0 订阅

订阅专栏

NLP Data Sets

1 篇文章 0 订阅

订阅专栏

首先吐槽一句，不公开完整human labelled 数据集……

这是一个基于Sina微博的数据集，是从一些中国搞NLP的高级知识分子的微博posts中爬下来的（posts的质量较高），但是comments（replies）是所有人都可以发的。

一、data set构建的方法如下：

1、 crawling the community of users

首先确定10个在sina微博上活跃的NLP大牛（followee很多、活跃），然后爬他们的followee，得到3200多个NLPer/MLer。

2、crawling their Weibo posts and their responses

爬了两个月。指出他们的topic都主要集中在Research、General Arts and Science、IT Technology、Life等方面。

3、Processing,Filtering,and Data Cleaning

首先是a four-step ﬁltering :注意是以post为单位的

①len(post)>10 and len(response)>5的才保留。（同时去掉了一些" Wow"，“Well said!” or “Nice”这样的）

②对于每个post，只保留前100个response，因为（时序）前面的responses一般和posts的联系更加紧密，更加相关。但是所有的response都会保留在response bank中。

③根据heuristic methods去除potential advertisement。

④去除标点符号和感情符号 remove punctuation marks and emoticons，并进行中文分词。

4、Labelling

We employ a pooling strategy widely used in information retrieval for getting the instance to label.

对于一个post，用三个baseline retrieval model各自选出10条candidate，然后把这<=30条的candidate set人工打分——suitable or unsuitable。（candidates不包含这个posts对应的human reply）

More speciﬁcally the suitability of a response is judged based on the following three criteria：

Semantic Relevance：除了抽象的语义以外，显而易见的就是entity alignment。

Logic Consistency：。。。就像字面说的那样。

Speech Act Alignment：比如我问。。。在哪里？你不能告诉我。。。不在哪里……

二、data set的内容如下：

1、statistics of the dataset

2、结构与划分：

1）Original (Post, Response) Pairs

就是爬下来的然后process、clean、filter的q-r pairs，是有noise的，for example, they could be spams or targeting some responses given earlier.有628，833个pairs。

2）Labeled Pairs

①The labeling is only on a small subset of posts

②For each selected post, the labeled responses are not originally given to it.

③So the obtained labels are much more informative than the those on randomly selected pairs(over98% of which are negative). This part of data can be directly used for training and testing of retrieval-based response models. We have labeled 422 posts and for each of them, about 30 candidate responses.

这个labeled pairs的response并不是这个posts对应的reply。而且在1）中描述的original pairs，即post与它真正的reply，98%都是不合适的（negative）？这……

3）Responses

This part of dataset contains only responses, but they are not necessarily for a certain post.

It collects all the responses, including but not limited to the responses in Part 1 and 2.

只有responses，并不是qr pair。

三、基于这个data set的特点，应该怎样使用呢

1、Training Low-level Matching Features

The rather abundant original (post, response) pairs provide rather rich supervision signal for learning different matching patterns between a post and a response.

2、Training Automatic Response Models

Although the original (post, response) pairs are rather abundant, they are not enough for discriminative training and testing of retrieval models.

下面这是最关键的一点。因为本身用来做label的reduced candidate set就是用几个baseline retrieval model搞出来的，所以对于labeled pairs来说分辨positive 和 negative就更难，相较于随机sample而言参数更新也会更好，result in a better model。

In the labeled pairs,both positive and negative ones are ranked high by some baseline models, and hence more difﬁcult to tell apart. This supervision will naturally tune the model parameters to ﬁnd the real good responses from the seemingly good ones. Please note that without the labeled negative pairs, we need to generate negative pairs with randomly chosen responses, which in most of the cases are too easy to differentiate by the ranking model and cannot fully tune the model parameters.

3、Testing Automatic Response Models

尽管我们可以简单地直接使用original post-response pair作为positive，然后其他不是这个post的作为negative。

（question：随机sample的时候确定query不一样吗……）

这种做法suffer from false negative的问题。对于retrieval出来的reduced candidate而言，它们即使不是original reply但通常也是比较好的答复了。（这就是为什么human label要用这种办法来标，就是和应用场景：先baseline选candidate然后再rank 完全匹配！这就通了）

In testing a retrieval-based system, although we can simply use the original responses associated with the query post as positive and treat all the others as negative, this strategy suffers from the problem of spurious negative examples. In other words, with a reasonably good model, the retrieved responses are often good even if they are not the original ones, which brings signiﬁcant bias to the evaluation.

四、Retrieval-based Response Model

方法比较古老，但是可以吸收一点intuition。

Inaretrieval-basedresponsemodel, foragivenpost x we pick from the candidate set the response with the highest ranking score, where the score is the ensemble of several individual matching features

with y stands for a candidate response.

training：训练的时候，y+是用labeled suitable response，y-是用labeled unsuitable response或者random sample到的。因为在前面也讲了，先是一些baseline retrieval model选出的reduced candidate set然后进行human labeling，labeled negative response在很多retrieval model里面的matching score本身也很高，所以这里究竟还是用human labeled会比较好一点。

（注意：positive response不用这个post本身的response反而在response bank里面去检索……应该说有可能检索到这个post本身的response，也有可能检索到其他response吧，因为这里不是tfidf去检索，检索到的不是一模一样的。前面也分析了，直接拿一个post和它的一个随机response，被human判断为negative的概率也很大，不能算是很好的正样本）。

From the labeled data, we can extract triples (x,y+,y−) to ensure that score(x,y+) > score(x,y−). Apparently y+ can be selected from labeled positive response of x, while y− can be sampled either from labeled negative negative or randomly selected ones. Since the manually labeled negative instances are top-ranked candidates according to some individual retrieval model (see Section 5.1) and therefore generally yield slightly better results.

五、指标