官方网址:http://clic.cimec.unitn.it/composes/sick.html
SICK是Sentences Involving Compositional Knowledge 的首字母缩写
SICK数据集包含一万个英语句子对, 来自于两个已经存在的paraphrase数据集:
一个是8k imageFlickrbuilt, (http://nlp.cs.illinois.edu/HockenmaierGroup/data.html)
另一个是SEMEVAL-2012的语义文本相似度视频描述数据集 (http://www.cs.york.ac.uk/semeval-2012/task6/index.php?id=data).
每个句子对按照含义的关系标注以及两者的蕴含(entailment)关系标注
SICK 的发布遵照以下协议:
Creative Commons Attribution-NonCommercial-ShareAlike 3.0
Unported License (http://creativecommons.org/licenses/by-nc-sa/3.0/deed.en_US)
在发布的研究中应用SICK时,请应用:
M. Marelli, S. Menini, M. Baroni, L. Bentivogli, R. Bernardi and R. Zamparelli. 2014. A SICK cure
for the evaluation of compositional distributional semantic models. Proceedings of LREC 2014,
Reykjavik (Iceland): ELRA.
SICK数据集用于SemEval 2014 - Task 1:
Evaluation of compositional distributional semantic models on full sentences through semantic relatedness and textual entailment
文件结构: tab分割的文本文件
各个域的定义:
- pair_ID: 句子对ID
- sentence_A: A句
- sentence_B: B局
- entailment_label: 文本蕴含关系的标注(gold truth/ground truth) (NEUTRAL, ENTAILMENT, or CONTRADICTION)
- relatedness_score: 语义关系度的标注分数 gold score (on a 1-5 continuous scale)
- entailment_AB: A到B的蕴含关系entailment for the A-B order (A_neutral_B, A_entails_B, or A_contradicts_B)
- entailment_BA: B到A的蕴含关系entailment for the B-A order (B_neutral_A, B_entails_A, or B_contradicts_A)
- sentence_A_original: 导出句子A的原始句子original sentence from which sentence A is derived
- sentence_B_original: 导出句子B的原始句子original sentence from which sentence B is derived
- sentence_A_dataset: 句子A的来源数据集dataset from which the original sentence A was extracted (FLICKR vs. SEMEVAL)
- sentence_B_dataset: 句子B的来源数据集dataset from which the original sentence B was extracted (FLICKR vs. SEMEVAL)
- SemEval_set: set including the sentence pair in SemEval 2014 Task 1 (TRIAL, TRAIN, or TEST)