以上三个文档,分别记录了NLP项目定义、文本预处理和标记训练集及基于prodigy的CNN模板训练出的基准模型,最终这个基准模型达到了83%的准确率。在此基础之上,我希望进一步优化模型的设计,将分类准确率提高至90%以上。prodigy的开发者曾给过一个很有趣的评价,即对于简单的文本分类问题来说,一个基本的(bayesian) logistic regression classifier 就能达到很好效果,甚至很多时候比CNN的效果还好,因为对于这些简单的分类问题来说,可能比起the context of the words, which is better captured in complicated NNs such as CNN and RNN, whether or not there are certain words is a better predictor of the label. 如果分类的关键只是找到targeted terms 并判断输入文本中是否包含这些词句,并不特别在意词句使用的情景,那么linear classifier 可能比NN 的表现更好,因为前者更容易capture these straightforward relationships. 我由此得到启发, 决定将优化的方向定为简化模型,而复杂化输入。我本能的感觉到,对我的location classification task来说,问题的关键是锁定关键句型,这个句型可能是word-level, character-level,也可能是topic-level,这些都可以作为features of the input输入线性模型中,最终到底哪些features 更有用,我希望能够交由模型自行决定。因此,我将建模的思路formalized为:
- create as many possibly useful features as I can
- throw all the features into the linear model, and rely on LASSO to help filter out/select features that are actually useful in making predictions
- build bayesian logistic regression classifier based on these filtered features
正好我这个学期在系统的学习Bayesian learning,所以实现这个模型所需要的所有技术我都已经完全掌握了。我还有一个非常详尽的列表,列出了所有文本模型中常用的Features, 接下来我将遵循这个列表,将它们在代码层面一个一个的实现,然后扔进LASSO中,看看效果。最后再搭建Bayesian classifier.
所以,接下来我需要做的任务如下:
- feature engineering
- implement LASSO
- implement naive Bayesian classifier
STEP 3.1: feature engineering
在@这篇文章中,作者详细介绍了常见NLP模型的feature engineering,给出的features 包括:
-
Count Vectors as features
-
TF-IDF Vectors as features
1. Word level
2. N-Gram level
3. Character level -
Word Embeddings as features:
1. Glove
2. FastText
3. Word2Vec -
Text / NLP based features:
1. Word Count of the documents – total number of words in the documents 2. Character Count of the documents – total number of characters in the documents 3. Average Word Density of the documents – average length of the words used in the documents 4. Average Word Density of the documents – average length of the words used in the documents 5. Puncutation Count in the Complete Essay – total number of punctuation marks in the documents 6. Upper Case Count in the Complete Essay – total number of upper count words in the documents 7. Frequency distribution of Part of Speech Tags: - Noun Count - Verb Count - Adjective Count - Adverb Count - Pronoun Count
-
Topic Models as features
除了feature engineering之外,这篇文章中还列出了所有常见的分类模型,包括:
- Naive Bayes Classifier
- Linear Classifier
- Support Vector Machine
- Bagging Models
- Boosting Models
- Shallow Neural Networks
- Deep Neural Networks
- Convolutional Neural Network (CNN)
- Long Short Term Modelr (LSTM)
- Gated Recurrent Unit (GRU)
- Bidirectional RNN
- Recurrent Convolutional Neural Network (RCNN)
我目前想要尝试的应该是前五种非NN的方法,之后看情况,如果模型表现不够理想的话,会再考虑其他的NN模型。
所以接下来,一步步的来从代码上进行实现。
STEP 3.1.1: feature engineering - Count Vectors as features
可以直接调用sklearn里的CountVectorizer模块,传入a list of sentences, 输入一个count matrix, 各行表示各document (即sentence)对应的count vector, 各列表示各词在不同documents中的词频。
from sklearn.feature_extraction.text import CountVectorizer
def count_vector(sentence_list):
vectorizer = CountVectorizer()
vectorizer.fit(sentence_list)
#print('Vocabulary size: ')
#print(len(vectorizer.vocabulary_))
vector = vectorizer.transform(sentence_list)
return vector.toarray()
training_count_vector = count_vector(training_sentence_list)
## shape of this count vector: (number of documents, number of words)
validation_count_vector = count_vector(training_sentence_list)
STEP 3.1.2: feature engineering - TF-IDF Vectors as features
1. Word level
2. N-Gram level
3. Character level
TF-IDF的逻辑是得到一个weighted count vector, 使得赋予每个词的权重不仅能够反映该词在特定文本中出现的频率,也将其他词在该文本的词频和该词在其他文本的词频考虑进去。TF-IDF由两部分组成,TF代表Term Frequency, 该词在特定文本中的相对词频,IDF代表Inverse Document Frequency,该词在其他文本中出现的概率的倒数。从数学上来说,可以表示为:
w i , j = t f i , j ∗ i d f i w_{i,j} = tf_{i,j} * idf_{i} wi,j=