她不讲"武德",北航博士竟然把60年来的文本分类综述都整理了!!!

Text Classification papers/surveys(文本分类资料综述总结)更新中...

更多详见Github:
https://github.com/xiaoqian19940510/text-classification-surveys

This repository contains resources for Natural Language Processing (NLP) with a focus on the task of Text Classification. The content is mainly from paper 《A Survey on Text Classification: From Shallow to Deep Learning》 (该repository主要总结自然语言处理(NLP)中文本分类任务的资料。内容主要来自文本分类综述论文《A Survey on Text Classification: From Shallow to Deep Learning》)

Table of Contents

  • Surveys

  • Deep Learning Models

  • Shallow Learning Models

  • Datasets

  • Evaluation Metrics

  • Future Research Challenges

  • Tools and Repos

Surveys

A Survey on Text Classification: From Shallow to Deep Learning,2020 by  Qian Li, Hao Peng, Jianxin Li, Congying Xia, Renyu Yang, Lichao Sun, Philip S. Yu, Lifang He

Deep Learning Models

2020
Spanbert: Improving pre-training by representing and predicting spans  --- SpanBERT--- by Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S. Weld, Luke Zettlemoyer, Omer Levy (Github)ALBERT: A lite BERT for self-supervised learning of language representations --- ALBERT--- by Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, Radu Soricut (Github)
2019
Roberta: A robustly optimized BERT pretraining approach --- Roberta--- by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov  (Github)Xlnet: Generalized autoregressive pretraining for language understanding --- Xlnet--- by Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R. Salakhutdinov, Quoc V. Le  (Github)

凭借对双向上下文进行建模的能力,与基于自回归语言模型的预训练方法(GPT)相比,基于像BERT这种去噪自编码的预训练方法能够达到更好的性能。然而,由于依赖于使用掩码(masks)去改变输入,BERT忽略了屏蔽位置之间的依赖性并且受到预训练与微调之间差异的影响。结合这些优缺点,本文提出了XLNet,这是一种通用的自回归预训练方法,其具有以下优势:(1)通过最大化因式分解次序的概率期望来学习双向上下文,(2)由于其自回归公式,克服了BERT的局限性。此外,XLNet将最先进的自回归模型Transformer-XL的创意整合到预训练中。根据经验性测试,XLNet在20个任务上的表现优于BERT,并且往往有大幅度提升,并在18个任务中实现最先进的结果,包括问答,自然语言推理,情感分析和文档排序。

Multi-task deep neural networks for natural language understanding --- MT-DNN--- by Xiaodong Liu, Pengcheng He, Weizhu Chen, Jianfeng Gao  (Github)

本文提出了一种多任务深度神经网络 (MT-DNN) ,用于跨多种自然语言理解任务(NLU)的学习表示。MT-DNN 不仅利用大量跨任务数据,而且得益于一种正则化效果,这种效果可以帮助产生更通用的表示,从而有助于扩展到新的任务和领域。MT-DNN 扩展引入了一个预先训练的双向转换语言模型BERT。MT-DNN在十个自然语言处理任务上取得了SOTA的成果,包括SNLI、SciTail和九个GLUE任务中的八个,将GLUE的baseline提高到了82.7 % (2.2 %的绝对改进)。在SNLI和Sc-iTail数据集上的实验证明,与预先训练的BERT表示相比,MT-DNN学习到的表示可以在域内标签数据较少的情况下展现更好的领域适应性。代码和预先训练好的模型将进行开源。

BERT: pre-training of deep bidirectional transformers for language understanding --- BERT--- by Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova (Github)Graph convolutional networks for text classification --- TextGCN---  by Liang Yao, Chengsheng Mao, Yuan Luo (Github)
2018
Multi-grained attention network for aspect-level sentiment classification --- MGAN --- by  Feifan Fan, Yansong Feng, Dongyan Zhao (Github) Investigating capsule networks with dynamic routing for text classification --- TextCapsule --- by Min Yang, Wei Zhao, Jianbo Ye, Zeyang Lei, Zhou Zhao, Soufei Zhang (Github)Constructing narrative event evolutionary graph for script event prediction --- SGNN ---  by Zhongyang Li, Xiao Ding, Ting Liu (Github)SGM: sequence generation model for multi-label classification --- SGM ---  by Pengcheng Yang, Xu Sun, Wei Li, Shuming Ma, Wei Wu, Houfeng Wang (Github)Joint embedding of words and labelsfor text classification --- LEAM ---  by Guoyin Wang, Chunyuan Li, Wenlin Wang, Yizhe Zhang, Dinghan Shen, Xinyuan Zhang, Ricardo Henao, Lawrence Carin (Github)Universal language model fine-tuning for text classification --- ULMFiT ---  by Jeremy Howard, Sebastian Ruder (Github)Large-scale hierarchical text classification withrecursively regularized deep graph-cnn --- DGCNN --- by Hao Peng, Jianxin Li, Yu He, Yaopeng Liu, Mengjiao Bao, Lihong Wang, Yangqiu Song, Qiang Yang (Github)Deep contextualized word rep-resentations --- ELMo --- by Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, Luke Zettlemoyer (Github)
2017
Recurrent Attention Network on Memory for Aspect Sentiment Analysis --- RAM --- by Peng Chen, Zhongqian Sun, Lidong Bing, Wei Yang (Github)Using millions of emoji occurrences to learn any-domain representations for detecting sentiment, emotion and sarcasm --- DeepMoji --- by Bjarke Felbo, Alan Mislove, Anders Søgaard, Iyad Rahwan, Sune Lehmann (Github)Interactive attention networks for aspect-level sentiment classification --- IAN --- by Dehong Ma, Sujian Li, Xiaodong Zhang, Houfeng Wang (Github)Deep pyramid convolutional neural networks for text categorization --- DPCNN --- by Rie Johnson, Tong Zhang (Github)Topicrnn: A recurrent neural network with long-range semantic dependency --- TopicRNN ---  by Adji B. Dieng, Chong Wang, Jianfeng Gao, John Paisley (Github)Adversarial training methods for semi-supervised text classification --- Miyato et al. ---  by Takeru Miyato, Andrew M. Dai, Ian Goodfellow (Github)Bag of tricks for efficient text classification --- FastText ---  by Armand Joulin, Edouard Grave, Piotr Bojanowski, Tomas Mikolov (Github)
2016
Long short-term memory-networks for machine reading --- LSTMN ---  by  Jianpeng Cheng, Li Dong, Mirella Lapata (Github) Recurrent neural network for text classification with multi-task learning --- Multi-Task --- by Pengfei Liu, Xipeng Qiu, Xuanjing Huang (Github)Hierarchical attention networks for document classification --- HAN --- by Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alex Smola, Eduard Hovy (Github)
2015
Character-level convolutional networks for text classification --- CharCNN --- by Xiang Zhang, Junbo Zhao, Yann LeCun (Github)Improved semantic representations from tree-structured long short-term memory networks --- Tree-LSTM ---  by Kai Sheng Tai, Richard Socher, Christopher D. Manning (Github)Deep unordered composition rivals syntactic methods for text classification --- DAN ---  by Mohit Iyyer, Varun Manjunatha, Jordan Boyd-Graber, Hal Daumé III (Github)Recurrent convolutional neural networks for text classification --- TextRCNN ---  by Siwei Lai, Liheng Xu, Kang Liu, Jun Zhao (Github)
2014
Distributed representations of sentences and documents --- Paragraph-Vec ---  by Quoc Le, Tomas Mikolov (Github)A convolutional neural network for modelling sentences --- DCNN ---  by Nal Kalchbrenner, Edward Grefenstette, Phil Blunsom (Github)Convolutional Neural Networks for Sentence Classification --- TextCNN --- by Yoon Kim (Github)
2013
Recursive deep models for semantic compositionality over a sentiment treebank --- RNTN --- by Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Ng, Christopher Potts (Github)
2012
Semantic compositionality through recursive matrix-vector spaces --- MV-RNN --- by Richard Socher, Brody Huval, Christopher D. Manning, Andrew Y. Ng (Github)
2011
Semi-supervised recursive autoencoders forpredicting sentiment distributions --- RAE --- by Richard Socher, Jeffrey Pennington, Eric H. Huang, Andrew Y. Ng, Christopher D. Manning (Github)

Shallow Learning Models

2017
Lightgbm: A highly efficient gradient boosting decision tree --- LightGBM --- by Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, Tie-Yan Liu (Github)
2016
Xgboost: A scalable tree boosting system --- XGBoost ---  by Tianqi Chen, Carlos Guestrin (Github)

2001
--- Random Forests (RF) --- by Leo Breiman ({Github})
1998
Text categorization with Support Vector Machines: Learning with many relevant features (SVM)  by JOACHIMS,T. ({Github})
1993
C4.5: Programs for Machine Learning (C4.5) by Steven L. Salzberg  ({Github})
1984
Classification and Regression Trees (CART) by Chyon-HwaYeh ({Github})
1967
Nearest neighbor pattern classification (k-nearest neighbor classification,KNN) by M. E. Maron ({Github})
1961
Automatic indexing: An experimental inquiry by M. E. Maron ({Github})

Datasets

Sentiment Analysis (SA) 情感分析

SA is the process of analyzing and reasoning the subjective text withinemotional color. It is crucial to get information on whether it supports a particular point of view fromthe text that is distinct from the traditional text classification that analyzes the objective content ofthe text. SA can be binary or multi-class. Binary SA is to divide the text into two categories, includingpositive and negative. Multi-class SA classifies text to multi-level or fine-grained labels.

情感分析(Sentiment Analysis,SA)是在情感色彩中对主观文本进行分析和推理的过程。通过分析文本来判断作者是否支持特定观点的信息至关重要,这与分析文本客观内容的传统文本分类任务不同。SA可以是二分类也可以是多分类。Binary SA将文本分为两类,包括肯定和否定。多类SA将文本分类为多级或细粒度更高的不同标签。

Movie Review (MR) 电影评论数据集Stanford Sentiment Treebank (SST) 斯坦福情感库The Multi-Perspective Question Answering (MPQA)多视角问答数据集IMDB reviews IMDB评论Yelp reviews Yelp评论Amazon Reviews (AM) 亚马逊评论数据集
News Classification (NC) 新闻分类数据集

News content is one of the most crucial information sources which hasa critical influence on people. The NC system facilitates users to get vital knowledge in real-time.News classification applications mainly encompass: recognizing news topics and recommendingrelated news according to user interest. The news classification datasets include 20NG, AG, R8, R52,Sogou, and so on. Here we detail several of the primary datasets.

新闻内容是最关键的信息来源之一,对人们的生活具有重要的影响。数控系统方便用户实时获取重要知识。新闻分类应用主要包括:识别新闻主题并根据用户兴趣推荐相关新闻。新闻分类数据集包括20NG,AG,R8,R52,Sogou等。在这里,我们详细介绍了一些主要数据集。

20 Newsgroups (20NG)AG News (AG)R8 and R52Sogou News (Sogou) 搜狗新闻
Topic Labeling (TL) 话题标签

The topic analysis attempts to get the meaning of the text by defining thesophisticated text theme. The topic labeling is one of the essential components of the topic analysistechnique, intending to assign one or more subjects for each document to simplify the topic analysis.

话题分析旨在通过定义复杂的文本主题来获取文本的含义。话题标记是话题分析技术的重要组成部分之一,旨在为每个文档分配一个或多个话题标签以简化话题分析。

DBpediaOhsumedYahoo answers (YahooA) 雅虎问答
Question Answering (QA) 问答

The QA task can be divided into two types: the extractive QA and thegenerative QA. The extractive QA gives multiple candidate answers for each question to choosewhich one is the right answer. Thus, the text classification models can be used for the extractiveQA task. The QA discussed in this paper is all extractive QA. The QA system can apply the textclassification model to recognize the correct answer and set others as candidates. The questionanswering datasets include SQuAD, MS MARCO, TREC-QA, WikiQA, and Quora [209]. Here wedetail several of the primary datasets.

问答任务可以分为两种:抽取式问答(extractiveQA)和生成式问答(extractiveQA)。抽取式问答为每个问题提供了多个候选答案,以选择哪个是正确答案。因此,文本分类模型可以用于抽取式问答任务。QA系统可以使用文本分类模型来识别正确答案,并将其他答案设置为候选答案。问答数据集包括SQuAD,MS MARCO,TREC-QA,WikiQA和Quora [209]。这里我们详细介绍了几个主要数据集。

Stanford Question Answering Dataset (SQuAD) 斯坦福问答数据集MS MARCOTREC-QAWikiQA
Natural Language Inference (NLI) 自然语言推理

NLI is used to predict whether the meaning of one text canbe deduced from another. Paraphrasing is a generalized form of NLI. It uses the task of measuringthe semantic similarity of sentence pairs to decide whether one sentence is the interpretation ofanother. The NLI datasets include SNLI, MNLI, SICK, STS, RTE, SciTail, MSRP, etc. Here we detailseveral of the primary datasets.

NLI用于预测一个文本的含义是否可以从另一个文本推论得出。释义是NLI的一种广义形式。它使用测量句子对语义相似性的任务来确定一个句子是否是另一句子的解释。NLI数据集包括SNLI,MNLI,SICK,STS,RTE,SciTail,MSRP等。在这里,我们详细介绍了所有主要数据集。

The Stanford Natural Language Inference (SNLI)Multi-Genre Natural Language Inference (MNLI)Sentences Involving Compositional Knowledge (SICK)Microsoft Research Paraphrase (MSRP)
Dialog Act Classification (DAC) 对话行为分类

A dialog act describes an utterance in a dialog based on semantic,pragmatic, and syntactic criteria. DAC labels a piece of a dialog according to its category of meaningand helps learn the speaker’s intentions. It is to give a label according to dialog. Here we detailseveral of the primary datasets, including DSTC 4, MRDA, and SwDA.

对话行为基于语义,语用和句法标准来描述对话中的话语。DAC根据其含义类别标记一个对话框,并帮助理解讲话者的意图。它是根据对话框给标签。在这里,我们详细介绍了所有主要数据集,包括DSTC 4,MRDA和SwDA。

Dialog State Tracking Challenge 4 (DSTC 4)ICSI Meeting Recorder Dialog Act (MRDA)Switchboard Dialog Act (SwDA)
Multi-label datasets 多标签数据集

In multi-label classification, an instance has multiple labels, and each la-bel can only take one of the multiple classes. There are many datasets based on multi-label textclassification. It includes Reuters, Education, Patent, RCV1, RCV1-2K, AmazonCat-13K, BlurbGen-reCollection, WOS-11967, AAPD, etc. Here we detail several of the main datasets.

在多标签分类中,一个实例具有多个标签,并且每个la-bel只能采用多个类之一。有许多基于多标签文本分类的数据集。它包括路透社,Education,Patent,RCV1,RCV1-2K,AmazonCat-13K,BlurbGen-reCollection,WOS-11967,AAPD等。这里我们详细介绍了一些主要数据集。

Reuters newsPatent DatasetReuters Corpus Volume I (RCV1) and RCV1-2KWeb of Science (WOS-11967)Arxiv Academic Paper Dataset (AAPD)
Others 其他

There are some datasets for other applications, such as Geonames toponyms, Twitter posts,and so on.

还有一些用于其他应用程序的数据集,比如Geonames toponyms、Twitter帖子等等。

Evaluation Metrics

In terms of evaluating text classification models, accuracy and F1 score are the most used to assessthe text classification methods. Later, with the increasing difficulty of classification tasks or theexistence of some particular tasks, the evaluation metrics are improved. For example, evaluationmetrics such as P@K and Micro-F1 are used to evaluate multi-label text classification performance,and MRR is usually used to estimate the performance of QA tasks.

在评估文本分类模型方面,准确率和F1分数是评估文本分类方法最常用的指标。随着分类任务难度的增加或某些特定任务的存在,评估指标也得到了改进。例如P @ K和Micro-F1评估指标用于评估多标签文本分类性能,而MRR通常用于评估QA任务的性能。

Single-label metrics 单标签评价指标

Single-label text classification divides the text into one of the most likelycategories applied in NLP tasks such as QA, SA, and dialogue systems [9]. For single-label textclassification, one text belongs to just one catalog, making it possible not to consider the relationsamong labels. Here we introduce some evaluation metrics used for single-label text classificationtasks.

单标签文本分类将文本划分为NLP任务(如QA,SA和对话系统)中最相似的类别之一[9]。对于单标签文本分类,一个文本仅属于一个目录,这使得不考虑标签之间的关系成为可能。在这里,我们介绍一些用于单标签文本分类任务的评估指标。

Accuracy and Error Rate

Accuracy and Error Rate are the fundamental metrics for a text classification model. The Accuracy and Error Rate are respectively defined as

准确性和错误率是文本分类模型的基本指标。准确度和错误率分别定义为:


Precision, Recall and F1

These are vital metrics utilized for unbalanced test sets regardless ofthe standard type and error rate. For example, most of the test samples have a class label. F1 is theharmonic average of Precision and Recall. Accuracy, Recall, and F1 as defined

无论标准类型和错误率如何,这些都是用于不平衡测试集的重要指标。例如,大多数测试样本都具有类别标签。F1是Precision和Recall的谐波平均值。准确性,召回率和F1分数定义为:


The desired results will be obtained when the accuracy, F1 and recall value reach 1. On the contrary,when the values become 0, the worst result is obtained. For the multi-class classification problem,the precision and recall value of each class can be calculated separately, and then the performanceof the individual and whole can be analyzed.

当准确率、F1和recall值达到1时,就可以得到预期的结果。相反,当值为0时,得到的结果最差。对于多类分类问题,可以分别计算各类的查准率和查全率,进而分析个体和整体的性能。

Exact Match (EM)Mean Reciprocal Rank (MRR)Hamming-loss (HL)
Multi-label metrics 多标签评价指标

Compared with single-label text classification, multi-label text classifica-tion divides the text into multiple category labels, and the number of category labels is variable. These metrics are designed for single label text classification, which are not suitable for multi-label tasks. Thus, there are some metrics designed for multi-label text classification.

与单标签文本分类相比,多标签文本分类将文本分为多个类别标签,并且类别标签的数量是可变的。然而上述的度量标准是为单标签文本分类设计的,不适用于多标签任务。因此,存在一些为多标签文本分类而设计的度量标准。

Micro−F1

The Micro−F1 is a measure that considers the overall accuracy and recall of alllabels. The Micro−F1is defined as

Micro-F1是一种考虑所有标签的整体精确率和召回率的措施。Micro-F1定义为:

Macro−F1

The Macro−F1 calculates the average F1 of all labels. Unlike Micro−F1, which setseven weight to every example, Macro−F1 sets the same weight to all labels in the average process. Formally, Macro−F1is defined as

Marco-F1计算所有标签的平均F1分数。与Micro-F1(每个示例都设置权重)不同,Macro-F1在平均过程中为所有标签设置相同的权重。形式上,Macro-F1定义为

In addition to the above evaluation metrics, there are some rank-based evaluation metrics forextreme multi-label classification tasks, including P@K and NDCG@K.

除了上述评估指标外,还有一些针对极端多标签分类任务的基于排序的评估指标,包括P @ K和NDCG @ K。

Precision at Top K (P@K)

The P@K is the precision at the top k. ForP@K, each text has a set of L ground truth labels Lt={l0,l1,l2...,lL−1}, in order of decreasing probability Pt=p0,p1,p2...,pQ−1.The precision at k is

其中P@K为排名第k处的准确率。P@K,每个文本有一组L个全局真标签Lt={l0,l1,l2...,lL−1}, 为了减少概率Pt=p0,p1,p2...,pQ−1。第k处的准确率为


Normalized Discounted Cummulated Gains (NDCG@K)

The NDCG at k is

排名第k处的NDCG值

Future Research Challenges

文本分类-作为有效的信息检索和挖掘技术-在管理文本数据中起着至关重要的作用。它使用NLP,数据挖掘,机器学习和其他技术来自动分类和发现不同的文本类型。文本分类将多种类型的文本作为输入,并且文本由预训练模型表示为矢量。然后将向量馈送到DNN中进行训练,直到达到终止条件为止,最后,下游任务验证了训练模型的性能。现有的模型已经显示出它们在文本分类中的有用性,但是仍有许多可能的改进需要探索。尽管一些新的文本分类模型反复擦写了大多数分类任务的准确性指标,但它无法指示模型是否像人类一样从语义层面“理解”文本。此外,随着噪声样本的出现,小的样本噪声可能导致决策置信度发生实质性变化,甚至导致决策逆转。因此,需要在实践中证明该模型的语义表示能力和鲁棒性。此外,由词向量表示的预训练语义表示模型通常可以提高下游NLP任务的性能。关于上下文无关单词向量的传输策略的现有研究仍是相对初步的。因此,我们从数据,模型和性能的角度得出结论,文本分类主要面临以下挑战:

数据层面

对于文本分类任务,无论是浅层学习还是深度学习方法,数据对于模型性能都是必不可少的。研究的文本数据主要包括多章,短文本,跨语言,多标签,少样本文本。对于这些数据的特征,现有的技术挑战如下:

Zero-shot/Few-shot learning外部知识  多标签文本分类任务  具有许多术语词汇的特殊领域
模型层面

现有的浅层和深度学习模型的大部分结构都被尝试用于文本分类,包括集成方法。BERT学习了一种语言表示法,可以用来对许多NLP任务进行微调。主要的方法是增加数据,提高计算能力和设计训练程序,以获得更好的结果如何在数据和计算资源和预测性能之间权衡是值得研究的。

性能评估层面

浅层模型和深层模型可以在大多数文本分类任务中取得良好的性能,但是需要提高其结果的抗干扰能力。如何实现对深度模型的解释也是一个技术挑战。

模型的语义鲁棒性  模型的可解释性

Tools and Repos

NeuralClassifierbaidu_nlp_project2Multi-label
评论 4
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值