Yelp NLP Text Classification Modeling 文本分类模型 featuring engineering

最新推荐文章于 2024-01-16 10:17:07 发布

ruilinch_

最新推荐文章于 2024-01-16 10:17:07 发布

阅读量556

点赞数

分类专栏： yelpNLP 项目随笔断句文章标签： feature engineering

本文链接：https://blog.csdn.net/ruilinch_/article/details/88344857

版权

@Yelp NLP项目介绍@文本预处理@创建训练集和baseline model以上三个文档，分别记录了NLP项目定义、文本预处理和标记训练集及基于prodigy的CNN模板训练出的基准模型，最终这个基准模型达到了83%的准确率。在此基础之上，我希望进一步优化模型的设计，将分类准确率提高至90%以上。prodigy的开发者曾给过一个很有趣的评价，即对于简单的文本分类问题来说，一个基本的(ba...

摘要由CSDN通过智能技术生成

@Yelp NLP项目介绍

@文本预处理

@创建训练集和baseline model

以上三个文档，分别记录了NLP项目定义、文本预处理和标记训练集及基于prodigy的CNN模板训练出的基准模型，最终这个基准模型达到了83%的准确率。在此基础之上，我希望进一步优化模型的设计，将分类准确率提高至90%以上。prodigy的开发者曾给过一个很有趣的评价，即对于简单的文本分类问题来说，一个基本的(bayesian) logistic regression classifier 就能达到很好效果，甚至很多时候比CNN的效果还好，因为对于这些简单的分类问题来说，可能比起the context of the words, which is better captured in complicated NNs such as CNN and RNN, whether or not there are certain words is a better predictor of the label. 如果分类的关键只是找到targeted terms 并判断输入文本中是否包含这些词句，并不特别在意词句使用的情景，那么linear classifier 可能比NN 的表现更好，因为前者更容易capture these straightforward relationships. 我由此得到启发，决定将优化的方向定为简化模型，而复杂化输入。我本能的感觉到，对我的location classification task来说，问题的关键是锁定关键句型，这个句型可能是word-level, character-level，也可能是topic-level，这些都可以作为features of the input输入线性模型中，最终到底哪些features 更有用，我希望能够交由模型自行决定。因此，我将建模的思路formalized为：

create as many possibly useful features as I can
throw all the features into the linear model, and rely on LASSO to help filter out/select features that are actually useful in making predictions
build bayesian logistic regression classifier based on these filtered features

正好我这个学期在系统的学习Bayesian learning，所以实现这个模型所需要的所有技术我都已经完全掌握了。我还有一个非常详尽的列表，列出了所有文本模型中常用的Features，接下来我将遵循这个列表，将它们在代码层面一个一个的实现，然后扔进LASSO中，看看效果。最后再搭建Bayesian classifier.

所以，接下来我需要做的任务如下：

feature engineering
implement LASSO
implement naive Bayesian classifier

STEP 3.1: feature engineering

在@这篇文章中，作者详细介绍了常见NLP模型的feature engineering，给出的features 包括：

Count Vectors as features
TF-IDF Vectors as features
1. Word level
2. N-Gram level
3. Character level
Word Embeddings as features:
1. Glove
2. FastText
3. Word2Vec

Text / NLP based features:

 1. Word Count of the documents – total number of words in the documents
 2. Character Count of the documents – total number of characters in the documents
 3. Average Word Density of the documents – average length of the words used in the documents
 4. Average Word Density of the documents – average length of the words used in the documents
 5. Puncutation Count in the Complete Essay – total number of punctuation marks in the documents
 6. Upper Case Count in the Complete Essay – total number of upper count words in the documents
 7. Frequency distribution of Part of Speech Tags:
 	- Noun Count
 	- Verb Count
 	- Adjective Count
 	- Adverb Count
 	- Pronoun Count

Topic Models as features

除了feature engineering之外，这篇文章中还列出了所有常见的分类模型，包括：

Naive Bayes Classifier
Linear Classifier
Support Vector Machine
Bagging Models
Boosting Models
Shallow Neural Networks
Deep Neural Networks
Convolutional Neural Network (CNN)
Long Short Term Modelr (LSTM)
Gated Recurrent Unit (GRU)
Bidirectional RNN
Recurrent Convolutional Neural Network (RCNN)

我目前想要尝试的应该是前五种非NN的方法，之后看情况，如果模型表现不够理想的话，会再考虑其他的NN模型。

所以接下来，一步步的来从代码上进行实现。

STEP 3.1.1: feature engineering - Count Vectors as features

可以直接调用sklearn里的CountVectorizer模块，传入a list of sentences, 输入一个count matrix, 各行表示各document (即sentence)对应的count vector, 各列表示各词在不同documents中的词频。

from sklearn.feature_extraction.text import CountVectorizer

def count_vector(sentence_list):
    vectorizer = CountVectorizer()
    vectorizer.fit(sentence_list)
    #print('Vocabulary size: ')
    #print(len(vectorizer.vocabulary_))
    vector = vectorizer.transform(sentence_list)
    return vector.toarray()

training_count_vector = count_vector(training_sentence_list)
## shape of this count vector: (number of documents, number of words)
validation_count_vector = count_vector(training_sentence_list)

STEP 3.1.2: feature engineering - TF-IDF Vectors as features

	1. Word level
	2. N-Gram level
	3. Character level

TF-IDF的逻辑是得到一个weighted count vector, 使得赋予每个词的权重不仅能够反映该词在特定文本中出现的频率，也将其他词在该文本的词频和该词在其他文本的词频考虑进去。TF-IDF由两部分组成，TF代表Term Frequency, 该词在特定文本中的相对词频，IDF代表Inverse Document Frequency，该词在其他文本中出现的概率的倒数。从数学上来说，可以表示为：
$w_{i,j} = tf_{i,j} * idf_{i}$

最低0.47元/天解锁文章

ruilinch_

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
1
评论
Yelp NLP Text Classification Modeling 文本分类模型 featuring engineering

@Yelp NLP项目介绍@文本预处理@创建训练集和baseline model以上三个文档，分别记录了NLP项目定义、文本预处理和标记训练集及基于prodigy的CNN模板训练出的基准模型，最终这个基准模型达到了83%的准确率。在此基础之上，我希望进一步优化模型的设计，将分类准确率提高至90%以上。prodigy的开发者曾给过一个很有趣的评价，即对于简单的文本分类问题来说，一个基本的(ba...
复制链接

扫一扫