关于文本分类（情感分析）的英文数据集汇总

最新推荐文章于 2024-07-30 11:11:01 发布

饭饭童鞋

最新推荐文章于 2024-07-30 11:11:01 发布

阅读量2.6k

点赞数 3

文章标签：自然语言处理

原文链接：https://blog.csdn.net/alip39/article/details/97928348?ops_request_misc=%257B%2522request%255Fid%2522%253A%2522162339244216780264090829%2522%252C%2522scm%2522%253A%252220140713.130102334.pc%255Fblog.%2522%257D&request_id=162339244216780264090829&biz_id=0&u

版权

关于文本分类（情感分析）的英文数据集汇总

20 Newsgroups数据集：

The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. The data is organized into 20 different newsgroups, each corresponding to a different topic.

数据集地址：http://qwone.com/~jason/20Newsgroups/

Reuters-21578 Text Categorization Collection Data Set数据集：

This is a collection of documents that appeared on Reuters newswire in 1987. The documents were assembled and indexed with categories.
在这里插入图片描述
数据集地址：
https://archive.ics.uci.edu/ml/datasets/Reuters-21578+Text+Categorization+Collection

Spambase Data Set数据集：

Classifying Email as Spam or Non-Spam
在这里插入图片描述
数据集地址：
https://archive.ics.uci.edu/ml/datasets/Spambase

1996 English Broadcast News Speech 数据集：
The 1996 Broadcast News Speech Corpus contains a total of 104 hours of broadcasts from ABC, CNN and CSPAN television networks and NPR and PRI radio networks with corresponding transcripts. The primary motivation for this collection is to provide training data for the DARPA “HUB4” Project on continuous speech recognition in the broadcast domain.

数据集地址：https://catalog.ldc.upenn.edu/LDC97S44

谷歌云盘文本分类数据集：
来自Zhang et al., 2015。用于文本分类的八个数据集合集。这些是用于新文本分类基线的基准。样本大小从 120K 至 3.6M 不等，范围从二进制到 14个分类问题。数据集来自 DBPedia、亚马逊、Yelp、Yahoo！和 AG。

数据集地址：
https://drive.google.com/drive/u/0/folders/0Bz8a_Dbh9Qhbfll6bVpmNUtUcFdjYmF2SEpmZUZUcVNiMUw1TWN6RDV3a0JHT3kxLVhVR2M

The Corpus of Linguistic Acceptability数据集：
纽约大学发布的有关语法的数据集，该任务主要是对一个给定句子，判定其是否语法正确，因此CoLA属于单个句子的文本二分类任务；

数据集地址：https://nyu-mll.github.io/CoLA/

SST数据集：
斯坦福大学发布的一个情感分析数据集，主要针对电影评论来做情感分类，因此SST属于单个句子的文本分类任务（其中SST-2是二分类，SST-5是五分类，SST-5的情感极性区分的更细致）；

数据集地址：https://nlp.stanford.edu/sentiment/index.html

MRPC数据集：
由微软发布，判断两个给定句子，是否具有相同的语义，属于句子对的文本二分类任务；

数据集地址：https://www.microsoft.com/en-us/download/details.aspx?id=52398

STS-B数据集：
主要是来自于历年SemEval中的一个任务（同时该数据集也包含在了SentEval），具体来说是用1到5的分数来表征两个句子的语义相似性，本质上是一个回归问题，但依然可以用分类的方法做，因此可以归类为句子对的文本五分类任务；

数据集地址：http://ixa2.si.ehu.es/stswiki/index.php/STSbenchmark

QQP数据集：
由Quora发布的两个句子是否语义一致的数据集，属于句子对的文本二分类任务；

数据集地址：https://data.quora.com/First-Quora-Dataset-Release-Question-Pairs

MNLI数据集：
由纽约大学发布，是一个文本蕴含的任务，在给定前提（Premise）下，需要判断假设（Hypothesis）是否成立，其中因为MNLI主打卖点是集合了许多不同领域风格的文本，因此又分为matched和mismatched两个版本的MNLI数据集，前者指训练集和测试集的数据来源一致，而后者指来源不一致。该任务属于句子对的文本三分类问题。

数据集地址：http://www.nyu.edu/projects/bowman/multinli/

Large Movie Review Dataset数据集：
This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well. Raw text and already processed bag of words formats are provided. See the README file contained in the release for more details.

数据集地址：http://ai.stanford.edu/~amaas/data/sentiment/

WebKB数据集：
The documents in the WebKB are webpages collected by the World Wide Knowledge Base (Web->Kb) project of the CMU text learning group, and were downloaded from The 4 Universities Data Set Homepage. These pages were collected from computer science departments of various universities in 1997, manually classified into seven different classes: student, faculty, staff, department, course, project, and other.
在这里插入图片描述

数据集地址：http://www.webkb.org/

AG News数据集：
The AG News corpus consists of news articles from the AG’s corpus of news articles on the web pertaining to the 4 largest classes. The dataset contains 30,000 training examples for each class 1,900 examples for each class for testing. Models are evaluated based on error rate (lower is better).

数据集地址：
数据集-官网完整版:
http://www.di.unipi.it/~gulli/AG_corpus_of_news_articles.html

数据集-分类任务集：
https://github.com/mhjabreel/CharCNN/tree/master/data/

DBpedia数据集：
DBpedia provides three different classification schemata for things.

Wikipedia Categories are represented using the SKOS vocabulary and DCMI terms.
The YAGO Classification is derived from the Wikipedia category system using WordNet. Please refer to Yago: A Core of Semantic Knowledge – Unifying WordNet and Wikipedia (PDF) for more details.
WordNet Synset Links were generated by manually relating Wikipedia infobox templates and WordNet synsets, and adding a corresponding link to each thing that uses a specific template. In theory, this classification should be more precise then the Wikipedia category system.

数据集地址：https://wiki.dbpedia.org/services-resources/datasets/dbpedia-datasets#h434-6

饭饭童鞋

关注

3
点赞
踩
18

收藏

觉得还不错? 一键收藏
1
评论
关于文本分类（情感分析）的英文数据集汇总

关于文本分类（情感分析）的英文数据集汇总 20 Newsgroups数据集： The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. The data is organize...
复制链接

扫一扫