关于文本分类(情感分析)的英文数据集汇总

关于文本分类(情感分析)的英文数据集汇总

20 Newsgroups数据集

The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. The data is organized into 20 different newsgroups, each corresponding to a different topic.

数据集地址:http://qwone.com/~jason/20Newsgroups/

Reuters-21578 Text Categorization Collection Data Set数据集

This is a collection of documents that appeared on Reuters newswire in 1987. The documents were assembled and indexed with categories.
在这里插入图片描述
数据集地址
https://archive.ics.uci.edu/ml/datasets/Reuters-21578+Text+Categorization+Collection

Spambase Data Set数据集

Classifying Email as Spam or Non-Spam
在这里插入图片描述
数据集地址
https://archive.ics.uci.edu/ml/datasets/Spambase

1996 English Broadcast News Speech 数据集
The 1996 Broadcast News Speech Corpus contains a total of 104 hours of broadcasts from ABC, CNN and CSPAN television networks and NPR and PRI radio networks with corresponding transcripts. The primary motivation for this collection is to provide training data for the DARPA “HUB4” Project on continuous speech recognition in the broadcast domain.

数据集地址:https://catalog.ldc.upenn.edu/LDC97S44

谷歌云盘文本分类数据集
来自Zhang et al., 2015。用于文本分类的八个数据集合集。这些是用于新文本分类基线的基准。样本大小从 120K 至 3.6M 不等,范围从二进制到 14个分类问题。数据集来自 DBPedia、亚马逊、Yelp、Yahoo!和 AG。

数据集地址
https://drive.google.com/drive/u/0/folders/0Bz8a_Dbh9Qhbfll6bVpmNUtUcFdjYmF2SEpmZUZUcVNiMUw1TWN6RDV3a0JHT3kxLVhVR2M

The Corpus of Linguistic Acceptability数据集
纽约大学发布的有关语法的数据集,该任务主要是对一个给定句子,判定其是否语法正确,因此CoLA属于单个句子的文本二分类任务;

数据集地址:https://nyu-mll.github.io/CoLA/

SST数据集
斯坦福大学发布的一个情感分析数据集,主要针对电影评论来做情感分类,因此SST属于单个句子的文本分类任务(其中SST-2是二分类,SST-5是五分类,SST-5的情感极性区分的更细致);

数据集地址:https://nlp.stanford.edu/sentiment/index.html

MRPC数据集
由微软发布,判断两个给定句子,是否具有相同的语义,属于句子对的文本二分类任务;

数据集地址:https://www.microsoft.com/en-us/download/details.aspx?id=52398

STS-B数据集
主要是来自于历年SemEval中的一个任务(同时该数据集也包含在了SentEval),具体来说是用1到5的分数来表征两个句子的语义相似性,本质上是一个回归问题,但依然可以用分类的方法做,因此可以归类为句子对的文本五分类任务;

数据集地址:http://ixa2.si.ehu.es/stswiki/index.php/STSbenchmark

QQP数据集
由Quora发布的两个句子是否语义一致的数据集,属于句子对的文本二分类任务;

数据集地址:https://data.quora.com/First-Quora-Dataset-Release-Question-Pairs

MNLI数据集
由纽约大学发布,是一个文本蕴含的任务,在给定前提(Premise)下,需要判断假设(Hypothesis)是否成立,其中因为MNLI主打卖点是集合了许多不同领域风格的文本,因此又分为matched和mismatched两个版本的MNLI数据集,前者指训练集和测试集的数据来源一致,而后者指来源不一致。该任务属于句子对的文本三分类问题。

数据集地址:http://www.nyu.edu/projects/bowman/multinli/

Large Movie Review Dataset数据集
This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well. Raw text and already processed bag of words formats are provided. See the README file contained in the release for more details.

数据集地址:http://ai.stanford.edu/~amaas/data/sentiment/

WebKB数据集
The documents in the WebKB are webpages collected by the World Wide Knowledge Base (Web->Kb) project of the CMU text learning group, and were downloaded from The 4 Universities Data Set Homepage. These pages were collected from computer science departments of various universities in 1997, manually classified into seven different classes: student, faculty, staff, department, course, project, and other.
在这里插入图片描述

数据集地址:http://www.webkb.org/

AG News数据集
The AG News corpus consists of news articles from the AG’s corpus of news articles on the web pertaining to the 4 largest classes. The dataset contains 30,000 training examples for each class 1,900 examples for each class for testing. Models are evaluated based on error rate (lower is better).

数据集地址
数据集-官网完整版:
http://www.di.unipi.it/~gulli/AG_corpus_of_news_articles.html

数据集-分类任务集
https://github.com/mhjabreel/CharCNN/tree/master/data/

DBpedia数据集
DBpedia provides three different classification schemata for things.

  • Wikipedia Categories are represented using the SKOS vocabulary and DCMI terms.
  • The YAGO Classification is derived from the Wikipedia category system using WordNet. Please refer to Yago: A Core of Semantic Knowledge – Unifying WordNet and Wikipedia (PDF) for more details.
  • WordNet Synset Links were generated by manually relating Wikipedia infobox templates and WordNet synsets, and adding a corresponding link to each thing that uses a specific template. In theory, this classification should be more precise then the Wikipedia category system.

数据集地址:https://wiki.dbpedia.org/services-resources/datasets/dbpedia-datasets#h434-6

  • 3
    点赞
  • 18
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值