2012 - train
- MSR-Paraphrase, Microsoft Research Paraphrase Corpus
http://research.microsoft.com/en-us/downloads/607d14d9-20cd-47e3-85bc-a2f65cd28042/
750 pairs of sentences.
- MSR-Video, Microsoft Research Video Description Corpus
http://research.microsoft.com/en-us/downloads/38cf15fd-b8df-477e-a4e4-a4680caa75af/
750 pairs of sentences.
- SMTeuroparl: WMT2008 develoment dataset (Europarl section)
http://www.statmt.org/wmt08/shared-evaluation-task.html
734 pairs of sentences.
2012 - test
- MSR-Paraphrase, Microsoft Research Paraphrase Corpus
http://research.microsoft.com/en-us/downloads/607d14d9-20cd-47e3-85bc-a2f65cd28042/
750 pairs of sentences.
- MSR-Video, Microsoft Research Video Description Corpus
http://research.microsoft.com/en-us/downloads/38cf15fd-b8df-477e-a4e4-a4680caa75af/
750 pairs of sentences.
- SMTeuroparl: WMT2008 develoment dataset (Europarl section)
http://www.statmt.org/wmt08/shared-evaluation-task.html
459 pairs of sentences.
In addition, it contains two surprise datasets comprising the
following collections:
- SMTnews: news conversation sentence pairs from WMT
399 pairs of sentences.
- OnWN: pairs of sentences where the first comes from Ontonotes and
the second from a WordNet definition.
750 pairs of sentences.
2013 - test
- STS.input.headlines.txt: we used headlines mined from several news
sources by European Media Monitor using the RSS feed.
http://emm.newsexplorer.eu/NewsExplorer/home/en/latest.html
- STS.input.OnWN.txt: The sentences are sense definitions from WordNet
and OntoNotes.
- STS.input.FNWN.txt: The sentences are sense definitions from WordNet
and FrameNet. Note that some FrameNet definitions involve more than
one sentence.
// 丢失
- STS.input.SMT.txt: This SMT dataset comes from DARPA GALE HTER and
HyTER, where one sentence is a MT output and the other is a
reference translation where a reference is generated based on human
post editing (provided by LDC) or an original human reference
(provided by LDC) or a human generated reference based on FSM as
described in (Dreyer and Marcu, NAACL 2012). The reference comes
from post edited translations.
2014 - test
- STS.input.image.txt: The Image Descriptions data set is a subset of
the PASCAL VOC-2008 data set (Rashtchian et al., 2010) . PASCAL
VOC-2008 data set consists of 1,000 images and has been used by a
number of image description systems. The image captions of the data
set are released under a CreativeCommons Attribution-ShareAlike
license, the descriptions itself are free.
- STS.input.OnWN.txt: The sentences are sense definitions from WordNet
and OntoNotes. 5 pairs of sentences.
- STS.input.tweet-news.txt: The tweet-news data set is a subset of the
Linking-Tweets-to-News data set (Guo et al., 2013), which consists
of 34,888 tweets and 12,704 news articles. The tweets are the
comments on the news articles. The news sentences are the titles of
news articles.
- STS.input.deft-news.txt: A subset of news article data in the DEFT
project.
- STS.input.deft-forum.txt: A subset of discussion forum data in the
DEFT project.
- STS.input.headlines.txt: we used headlines mined from several news
sources by European Media Monitor using the RSS feed.
http://emm.newsexplorer.eu/NewsExplorer/home/en/latest.html
2015 - test(with some raw data)
- STS.input.image.txt: The Image Descriptions data set is a subset of
the Flickr dataset presented in (Rashtchian et al., 2010), which
consisted on 8108 hand-selected images from Flickr, depicting
actions and events of people or animals, with five captions per
image. The image captions of the data set are released under a
CreativeCommons Attribution-ShareAlike license.
- STS.input.headlines.txt: We used headlines mined from several news
sources by European Media Monitor using their RSS feed from April 2,
2013 to July 28, 2014. This period was selected to avoid overlap
with STS 2014 data.
http://emm.newsexplorer.eu/NewsExplorer/home/en/latest.html
- STS.input.answers-students.txt: The source of these pairs is the
BEETLE corpus (Dzikovska et al., 2010), is a question-answer data
set collected and annotated during the evaluation of the BEETLE II
tutorial dialogue system. The BEETLE II system is an intelligent
tutoring engine that teaches students in basic electricity and
electronics. The corpus was used in the student response analysis
task of semeval-2013. Given a question, a known correct "reference
answer" and the "student answer", the goal of the task was to assess
student answers as correct, contradictory or incorrect (partially
correct, irrelevant or not in the domain). For STS, we selected
pairs of answers made up by single sentences.
- STS.input.answers-forum.txt: This data set consists of paired
answers collected from the Stack Exchange question and answer
websites (http://stackexchange.com/). Some of the paired answers are
in response to the same question, while others are in response to
different questions. Each answer in the pair consists of a statement
composed of a single sentence or sentence fragment. For
multi-sentence answers, we extract the single sentence from the
larger answer that appears to best summarize the answer. The Stack
Exchange data license requires that we provide additional metadata
that allows participants to recover the source of each paired
answer. Systems submitted to the shared task must not make use of
this meta-data in anyway to assign STS scores or to otherwise inform
the operation of the system.
- STS.input.belief: The data is collected from DEFT Committed Belief
Annotation dataset (LDC2014E55). All source documents are English
Discussion Forum data.