文本分类

Introduction
Text classification is the task of assigning a sentence or document an appropriate category. The categories depend on the chosen dataset and can range from topics.

The mainstream classifiers in the traditional machine learning era are mainly based on Naive Bayes, Maximum Entropy, K-NN, and SVM. Deep Learning Algorithms for text classification would be introduced here.

Models based on CNN
CNN first achieved extraordinary results in the field of image, and made a major breakthrough in the field of NLP, especially text classification. Convolutional neural networks are effective for text categorization because they can extract significant features without changing the position of the input sequence. In particular, text classification tasks can use CNN to extract key n-grams in sentences.

Text CNN
The sentence is mapped to the embedded vector and entered into the model as a matrix. Convolution operations are performed on all the input words as appropriate using differently sized convolutional kernels. Finally, the feature map processed by the max pooling layer, then concentrate or summarize the extracted features.
TextCNN Introduction and Practical
CharCNN
Propose a text classification model based on character level. The author proves that when the training set is large enough, the convolutional network does not need the meaning of the word level, nor does it need the information such as the grammatical syntax structure of the language. Beyond that, this is a very exciting simplification because it is made up of characters regardless of the language, which is essential for building a cross-language system.
CharCNN Introduction and practical

Models based on RNN
Although CNN can perform well in many tasks, but it has one biggest problem of fixing the filter_size field view. On the one hand, it can’t model longer sequence information, on the other hand, filter_size’s super-parameter adjustment is also complex. The essence of CNN is to do the feature expression work of text, and RNN is more commonly used in NLP. RNN can better express context information. It is used in many scenes such as sequence labeling, naming body recognition, and seq2seq model.

TextRNN
This model is similar to TextCNN. The above Conv+Pooling is replaced with Bi-LSTM. Finally, the output in both directions is spliced and passed to the output layer.
Text RCNN
In order to solve the limitations of the CNN and RNN, this author proposes a new network architecture that uses two-way loop structure to obtain context information, which is more noise-reducing than traditional neural networks.
Text RCNN Introduction and Practical

Attention mechanism
Attention mechanism used to improve the effects of the RNN (LSTM or GRU) based Encoder-Decoder model. It is very popular at present, widely used in many fields such as machine translation, speech recognition, image caption, etc. Attention gives the model the ability to distinguish between regions. For example, in machine translation, each word in a sentence is given different weights, and Attention itself can be used as an alignment relationship to interpret translation between input/output sentences. The alignment relationship explains what the model has learned.
Attention Model

In this Global Attention model, we define a Conditional Probability about the target sentence:

Sunmi AI Lab > Text Classification > image2019-7-19_18-4-43.png

The ht means the hidden state of decoder, and the Ct means source context vector, Ws is the parameter matrix.

在这里插入图片描述

To compute the weighted sum of all the hidden states in encoder. And the calculation process of the alignment vector at,i is divided into two steps, first, the alignment of hidden state of decoder in i time step and hidden state of encoder, and then softmax.

And when we get the context vector, the attentional hidden state and conditional probability can be computed.

1.HAN This article proposes a hierarchical attention model for document-level classification tasks. The model has two pros: 1) the document is a hierarchical structure, the words form a sentence, and the sentences form a document. Similarly, the model is a hierarchical structure that models words and sentence levels separately to form a final document representation. 2) The model has two levels of attention mechanisms, applied to the single level and sentence level respectively. The attention mechanism allows the model to give different attention weights based on different words and sentences, making the final document representation more accurate and effective.

HAN Introduction and practical
https://zhuanlan.zhihu.com/p/53342715

Memory base transformers
BERT
BERT is a language representation model trained by oversized data, huge models, and enormous computational overhead. It is optimal in 11 natural language processing tasks (state-of-the-art) , SOTA) results.

It does give us a lot of valuable experience:

Deep learning is representation learning: “We show that pre-trained representations eliminate the needs of many heavily engineered task-specific architectures”. Of the 11 BERT’s tasks of brushing out new realms, most of them are only in advance. A linear layer is added as a linear output layer based on the pre-trained representation fine-tuning. In the task of sequence labeling (e.g. NER), even the dependency of the sequence output is ignored (i.e. non-autoregressive and no CRF), so SOTA before the spike, it can be seen that its ability to characterize learning.
Scale does matters:“One of our core claims is that the deep bi-directionality of BERT, which is enabled by masked LM pre-training, is the single most important improvement of BERT compared to previous work”. The application of the mask in the language model is not new to many people, but it is true that the author of BERT has verified its powerful representation learning ability on the basis of such a large-scale data + model + computing power. Such a model can even be extended to many other models, which may have been proposed and tested by different laboratories before, but because of the limitations of scale, the potential of these models has not been fully exploited, and they are unfortunately drowned in the roll. Among the paper torrents.

Pre-training is important: “We believe that this is the first work to demonstrate that scaling to extreme model sizes also leads to large improvements on very small-scale tasks, provided that the model has been -trained”. Pre-training has been widely used in various fields (eg ImageNet for CV, Word2Vec in NLP), mostly through large models of big data, such large models can bring improvements to small-scale tasks.

Google Bert Introduction
https://zhuanlan.zhihu.com/p/46652512

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值