前言
kaggle的这个starting competition (Bag of words meet bags of popcorns) 其实是一个word2vec-tutorial, 但是本篇文章没有用到 word2vec, 只用了 TF-IDF 的方式将句子向量化,再分别用logistic regression、multinomial Naive Bayes、 SGDClassifier 进行训练和预测。用LR得到的结果在kaggle上提交,得分是0.88+,排名已经将近300了。但是作为一个最初的尝试还是可以的,胜在简单。
简单介绍一下TF-IDF, TF是词频即一个词在其所处的句子里出现的频率,IDF(word) = log(N/N(word)+α),N是句子总数,N(word) 是出现了某个词的句子数,α是为了保证分母不为0。一个词的TF-IDF越大,说明这个词在所处的句子里出现频率很高,在其他句子里却不怎么出现,也就是说TF-IDF衡量了一个词能够在多大程度上代表他所处的这个句子。
代码实现
参考:https://github.com/jmsteinw/Notebooks/blob/master/NLP_Movies.ipynb
# coding: utf-8
import pandas as pd
import os
from lxml import etree
import re
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer as TFIV
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import SGDClassifier
#load the data
path = "H:\PyCharmProjects\Popcorn\data"
t_set_df = pd.read_csv(os.path.join(path,"labeledTrainData.tsv"), header=0, sep='\t')
test_df = pd.read_csv(os.path.join(path,"testData.tsv"), header=0, sep='\t'