NLP 数据集整理(持续更新。。。)



Semantic Similarity



Semantic&Syntactic3: 包括8869个语义问题和10675个句法问题。在T.Mikolov的word2vec中使用。问题都类似于“man is to (woman) as king is to queen”或者“predict is to (predicting) as dance is to dancing”.


IMDB4: 这个数据包括3个部分,训练集,测试集和未标记的数据集。训练集和测试集用于训练和测试文本分类模型,未标记的数据集用于训练词矢量。被用于情感分析。

Stanford Sentiment Treebank5:这个数据比较小,被用于基于CNN的情感分类中。同样,还可用于语义表示的实验上。

Google Snippets6:这个数据包括10060个训练样本和2280个测试样本,分为8个类。平均下来每个snippet有18.07个词。


Sentiment analysis

MR: Sentiment polarity dataset from Movie Review. URL

  • document-level: polarity dataset v2.0: 1000 positive and 1000 negative processed reviews.
  • sentence-level: sentence polarity dataset v1.0: 5331 positive and 5331 negative processed sentences/snippets.
  • Sentiment-scale datasets: scale dataset v1.0: a collection of documents whose labels come from a rating scale.
  • Subjectivity dataset v1.0: 5000 subjective and 5000 objective processed sentences

Subj: Subjectivity dataset.URL这个语料的标注为这些观点是否主观

consists of the top 20 results returned by the Yahoo! search engine in response to each of a set of 69 queries containing the word “review.” The queries were drawn from the publicly available list of real MSN users’ queries released for the 2005 KDD Cup competition; the KDD data itself is available at that “sales pitches” were marked objective on the premise that they represent biased reviews that users might wish to avoid seeing.

CR: Customer review dataset(Hu and Liu,2004)URL

This dataset, consists of reviews of five electronics products downloaded from Amazon and Cnet.The sentences have been manually labeled as to whether an opinion is expressed, and if so, what feature from a pre-defined list is being evaluated. An addendum with nine products is also available(∼liub/FBS/Reviews-9-products.rar) . The curator, Bing Liu, also distributes a comparative-sentence dataset that is available by request.

MPQA: Opinion polarity dataset(Wiebe et al,2005)URL这个语料包含535篇新闻各个来源的文章,在句子层和子句层已经手工标注好了观点和其他私人属性(例如信仰,情绪,情感,揣测等).Wiebe et al. Annotating expressions of opinions and emotions in language中有对这个标注有比较详细的描述。



CoNLL03: 被用在了NER(命名实体识别)中9.

WallStreet Journal:被用在了POS(part of speech)任务中10。但是实际上是开放性的文本,可以用于更多问题。


  1. L.Finkelstein. et al.Placing search in context:The concept revisited.TOIS,2002
  2. T.K.Landauer and S.T.Dumais. A solution to plato’s problem:The latent semantic analysis theory of acquisition,induction,and representation of knowledge. Psychology review.1997
  3. T.Mikolov et al.Distributed representations of words and phrases and their compositionality.NIPS,2013
  4. A.L.Mass et al.Learning word vectors for sentiment analysis.ACL ,2011.
  5. R. Socher et al.Recursive deep models for semantic compositionality over a sentiment treebank. EMNLP, 2013
  6. Xuan-Hieu Phan et al.Learning to classify short and sparse text&web with hidden topics from large-scale data collections.ACM, 2008
  7. Xin Li and Dan Roth.Learning question classifiers.ACL, 2002
  8. B.Pang and L.Lee.Opinion mining and sentiment analysis.Foundations and trends in information retrieval, 2008
  9. L.Ratinov and D.Roth.Design challenges and misconceptions in name entity recognition.CoNLL,2009
  10. K.Toutanova et al.Feature-rich part-of-speech tagging with a cyclic word dependency network, NAACL-HLT, 2003