STS 数据分析

2012 - train

- MSR-Paraphrase, Microsoft Research Paraphrase Corpus
  750 pairs of sentences.

- MSR-Video, Microsoft Research Video Description Corpus
  750 pairs of sentences.

- SMTeuroparl: WMT2008 develoment dataset (Europarl section)
  734 pairs of sentences.

2012 - test

- MSR-Paraphrase, Microsoft Research Paraphrase Corpus
  750 pairs of sentences.

- MSR-Video, Microsoft Research Video Description Corpus
  750 pairs of sentences.

- SMTeuroparl: WMT2008 develoment dataset (Europarl section)
  459 pairs of sentences.

In addition, it contains two surprise datasets comprising the
following collections:

- SMTnews: news conversation sentence pairs from WMT
  399 pairs of sentences.

- OnWN: pairs of sentences where the first comes from Ontonotes and
  the second from a WordNet definition.
  750 pairs of sentences.

2013 - test

- STS.input.headlines.txt: we used headlines mined from several news
  sources by European Media Monitor using the RSS feed.

- STS.input.OnWN.txt: The sentences are sense definitions from WordNet
  and OntoNotes. 

- STS.input.FNWN.txt: The sentences are sense definitions from WordNet
  and FrameNet. Note that some FrameNet definitions involve more than
  one sentence.

// 丢失
- STS.input.SMT.txt: This SMT dataset comes from DARPA GALE HTER and
  HyTER, where one sentence is a MT output and the other is a
  reference translation where a reference is generated based on human
  post editing (provided by LDC) or an original human reference
  (provided by LDC) or a human generated reference based on FSM as
  described in (Dreyer and Marcu, NAACL 2012). The reference comes
  from post edited translations.

2014 - test

- STS.input.image.txt: The Image Descriptions data set is a subset of
  the PASCAL VOC-2008 data set (Rashtchian et al., 2010) . PASCAL
  VOC-2008 data set consists of 1,000 images and has been used by a
  number of image description systems. The image captions of the data
  set are released under a CreativeCommons Attribution-ShareAlike
  license, the descriptions itself are free.

- STS.input.OnWN.txt: The sentences are sense definitions from WordNet
  and OntoNotes. 5 pairs of sentences.

- STS.input.tweet-news.txt: The tweet-news data set is a subset of the
  Linking-Tweets-to-News data set (Guo et al., 2013), which consists
  of 34,888 tweets and 12,704 news articles.  The tweets are the
  comments on the news articles.  The news sentences are the titles of
  news articles.

- STS.input.deft-news.txt: A subset of news article data in the DEFT

- STS.input.deft-forum.txt: A subset of discussion forum data in the
  DEFT project.

- STS.input.headlines.txt: we used headlines mined from several news
  sources by European Media Monitor using the RSS feed. 

  2015 - test(with some raw data)


- STS.input.image.txt: The Image Descriptions data set is a subset of
  the Flickr dataset presented in (Rashtchian et al., 2010), which
  consisted on 8108 hand-selected images from Flickr, depicting
  actions and events of people or animals, with five captions per
  image. The image captions of the data set are released under a
  CreativeCommons Attribution-ShareAlike license.

- STS.input.headlines.txt: We used headlines mined from several news
  sources by European Media Monitor using their RSS feed from April 2,
  2013 to July 28, 2014. This period was selected to avoid overlap
  with STS 2014 data.

- STS.input.answers-students.txt: The source of these pairs is the
  BEETLE corpus (Dzikovska et al., 2010), is a question-answer data
  set collected and annotated during the evaluation of the BEETLE II
  tutorial dialogue system. The BEETLE II system is an intelligent
  tutoring engine that teaches students in basic electricity and
  electronics. The corpus was used in the student response analysis
  task of semeval-2013. Given a question, a known correct "reference
  answer" and the "student answer", the goal of the task was to assess
  student answers as correct, contradictory or incorrect (partially
  correct, irrelevant or not in the domain). For STS, we selected
  pairs of answers made up by single sentences.

- STS.input.answers-forum.txt: This data set consists of paired
  answers collected from the Stack Exchange question and answer
  websites ( Some of the paired answers are
  in response to the same question, while others are in response to
  different questions. Each answer in the pair consists of a statement
  composed of a single sentence or sentence fragment. For
  multi-sentence answers, we extract the single sentence from the
  larger answer that appears to best summarize the answer. The Stack
  Exchange data license requires that we provide additional metadata
  that allows participants to recover the source of each paired
  answer. Systems submitted to the shared task must not make use of
  this meta-data in anyway to assign STS scores or to otherwise inform
  the operation of the system.

- STS.input.belief: The data is collected from DEFT Committed Belief
  Annotation dataset (LDC2014E55).  All source documents are English
  Discussion Forum data.





当前余额3.43前往充值 >
领取后你会自动成为博主和红包主的粉丝 规则
钱包余额 0


