Yelp NLP Text Classification Modeling 文本分类模型 featuring engineering

@Yelp NLP项目介绍

@文本预处理

@创建训练集和baseline model

以上三个文档,分别记录了NLP项目定义、文本预处理和标记训练集及基于prodigy的CNN模板训练出的基准模型,最终这个基准模型达到了83%的准确率。在此基础之上,我希望进一步优化模型的设计,将分类准确率提高至90%以上。prodigy的开发者曾给过一个很有趣的评价,即对于简单的文本分类问题来说,一个基本的(bayesian) logistic regression classifier 就能达到很好效果,甚至很多时候比CNN的效果还好,因为对于这些简单的分类问题来说,可能比起the context of the words, which is better captured in complicated NNs such as CNN and RNN, whether or not there are certain words is a better predictor of the label. 如果分类的关键只是找到targeted terms 并判断输入文本中是否包含这些词句,并不特别在意词句使用的情景,那么linear classifier 可能比NN 的表现更好,因为前者更容易capture these straightforward relationships. 我由此得到启发, 决定将优化的方向定为简化模型,而复杂化输入。我本能的感觉到,对我的location classification task来说,问题的关键是锁定关键句型,这个句型可能是word-level, character-level,也可能是topic-level,这些都可以作为features of the input输入线性模型中,最终到底哪些features 更有用,我希望能够交由模型自行决定。因此,我将建模的思路formalized为:

  1. create as many possibly useful features as I can
  2. throw all the features into the linear model, and rely on LASSO to help filter out/select features that are actually useful in making predictions
  3. build bayesian logistic regression classifier based on these filtered features

正好我这个学期在系统的学习Bayesian learning,所以实现这个模型所需要的所有技术我都已经完全掌握了。我还有一个非常详尽的列表,列出了所有文本模型中常用的Features, 接下来我将遵循这个列表,将它们在代码层面一个一个的实现,然后扔进LASSO中,看看效果。最后再搭建Bayesian classifier.

所以,接下来我需要做的任务如下:

  1. feature engineering
  2. implement LASSO
  3. implement naive Bayesian classifier

STEP 3.1: feature engineering

在@这篇文章中,作者详细介绍了常见NLP模型的feature engineering,给出的features 包括:

  • Count Vectors as features

  • TF-IDF Vectors as features
    1. Word level
    2. N-Gram level
    3. Character level

  • Word Embeddings as features:
    1. Glove
    2. FastText
    3. Word2Vec

  • Text / NLP based features:

     1. Word Count of the documents – total number of words in the documents
     2. Character Count of the documents – total number of characters in the documents
     3. Average Word Density of the documents – average length of the words used in the documents
     4. Average Word Density of the documents – average length of the words used in the documents
     5. Puncutation Count in the Complete Essay – total number of punctuation marks in the documents
     6. Upper Case Count in the Complete Essay – total number of upper count words in the documents
     7. Frequency distribution of Part of Speech Tags:
     	- Noun Count
     	- Verb Count
     	- Adjective Count
     	- Adverb Count
     	- Pronoun Count
    
  • Topic Models as features

除了feature engineering之外,这篇文章中还列出了所有常见的分类模型,包括:

  • Naive Bayes Classifier
  • Linear Classifier
  • Support Vector Machine
  • Bagging Models
  • Boosting Models
  • Shallow Neural Networks
  • Deep Neural Networks
  • Convolutional Neural Network (CNN)
  • Long Short Term Modelr (LSTM)
  • Gated Recurrent Unit (GRU)
  • Bidirectional RNN
  • Recurrent Convolutional Neural Network (RCNN)

我目前想要尝试的应该是前五种非NN的方法,之后看情况,如果模型表现不够理想的话,会再考虑其他的NN模型。

所以接下来,一步步的来从代码上进行实现。

STEP 3.1.1: feature engineering - Count Vectors as features

可以直接调用sklearn里的CountVectorizer模块,传入a list of sentences, 输入一个count matrix, 各行表示各document (即sentence)对应的count vector, 各列表示各词在不同documents中的词频。

from sklearn.feature_extraction.text import CountVectorizer

def count_vector(sentence_list):
    vectorizer = CountVectorizer()
    vectorizer.fit(sentence_list)
    #print('Vocabulary size: ')
    #print(len(vectorizer.vocabulary_))
    vector = vectorizer.transform(sentence_list)
    return vector.toarray()

training_count_vector = count_vector(training_sentence_list)
## shape of this count vector: (number of documents, number of words)
validation_count_vector = count_vector(training_sentence_list)
STEP 3.1.2: feature engineering - TF-IDF Vectors as features
	1. Word level
	2. N-Gram level
	3. Character level

TF-IDF的逻辑是得到一个weighted count vector, 使得赋予每个词的权重不仅能够反映该词在特定文本中出现的频率,也将其他词在该文本的词频和该词在其他文本的词频考虑进去。TF-IDF由两部分组成,TF代表Term Frequency, 该词在特定文本中的相对词频,IDF代表Inverse Document Frequency,该词在其他文本中出现的概率的倒数。从数学上来说,可以表示为:
w i , j = t f i , j ∗ i d f i w_{i,j} = tf_{i,j} * idf_{i} wi,j=

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值