NLP(VI):使用sklearn进行文本情感分类(上)
这一节我们使用sklearn训练分类模型以实现对文本数据的情感分类。
获得数据
这次的数据我已经上传了,可从以下链接下载:
训练集:https://download.csdn.net/download/swy_swy_swy/87581709
测试集:https://download.csdn.net/download/swy_swy_swy/87581702
数据标签为二分类标签,0为消极情绪,1为积极情绪。
加载数据
我们使用pandas处理csv文件。
import sklearn
import pandas as pd
import numpy as np
def csv_loader(filepath):
return pd.read_csv(filepath)
twitter_train_df = csv_loader('sentiment-train.csv')
twitter_test_df = csv_loader('sentiment-test.csv')
文本向量化
sklearn本质上是如何“阅读”一个文本呢?它本质上是不懂人类的语言的,一段文本对于它来说是一个词语的集合,或说“一袋子单词”(a bag of words)。在训练模型之前,需要对所有文本“向量化”,也就是每一个单词都有一个对应的编号,当然,相同的单词即使在统一数据集的不同样本中编号也是相同的。这样,一段文本就变成了一个编号序列,也就是一个向量。
我们使用sklearn的CountVectorizer来进行向量化:
from sklearn.feature_extraction.text import CountVectorizer
def feature_extracter(train_df, test_df, binary_flag=False, m_features=1000, has_test=True):
vectorizer = CountVectorizer(stop_words='english', max_features=m_features, binary=binary_flag)
train_texts = np.array(train_df['text']).tolist()
test_texts = []
if has_test:
test_texts = np.array(test_df['text']).tolist()
vecs = vectorizer.fit_transform(train_texts + test_texts).toarray()
train_X = vecs[:len(train_texts)]
test_X = []
if has_test:
test_X = vecs[len(train_texts):]
train_y = np.array(train_df['sentiment']).tolist()
test_y = []
if has_test:
test_y = np.array(test_df['sentiment']).tolist()
return train_X, test_X, train_y, test_y
twitter_train_X, twitter_test_X, twitter_train_y, twitter_test_y = feature_extracter(twitter_train_df, twitter_test_df)
twitter_train_bin_X, twitter_test_bin_X, twitter_train_bin_y, twitter_test_bin_y = feature_extracter(twitter_train_df, twitter_test_df, binary_flag=True)
训练贝叶斯模型
向量化之后我们可以训练贝叶斯模型并查看它在测试集上的准确率。
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB()
clf.fit(twitter_train_X, np.array(twitter_train_y))
print("The accuracy of the trained classifier is "+str(clf.score(twitter_test_X, np.array(twitter_test_y))*100)+"%")
clf = MultinomialNB()
clf.fit(twitter_train_bin_X, np.array(twitter_train_bin_y))
print("The accuracy of the trained classifier is "+str(clf.score(twitter_test_bin_X, np.array(twitter_test_bin_y))*100)+"%")
The accuracy of the trained classifier is 78.8300835654596%
The accuracy of the trained classifier is 77.99442896935933%
训练logistic模型
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression(random_state=0).fit(twitter_train_X, np.array(twitter_train_y))
print("The accuracy of the trained classifier is "+str(clf.score(twitter_test_X, np.array(twitter_test_y))*100)+"%")
clf = LogisticRegression(random_state=0).fit(twitter_train_bin_X, np.array(twitter_train_bin_y))
print("The accuracy of the trained classifier is "+str(clf.score(twitter_test_bin_X, np.array(twitter_test_bin_y))*100)+"%")
The accuracy of the trained classifier is 77.15877437325905%
The accuracy of the trained classifier is 77.15877437325905%
获得最佳参数
这里我们使用十折交叉验证来获得最佳的参数组合(特征数与binary flag)
from sklearn.model_selection import StratifiedKFold
feature_nums = [1000, 2000, 3000, 4000]
bin_flags = [False, True]
for feature_num in feature_nums:
for bin_flag in bin_flags:
X, _, y, _ = feature_extracter(twitter_train_df, None, binary_flag = bin_flag, m_features=feature_num, has_test=False)
skf = StratifiedKFold(n_splits=10)
accuracys = []
for train_index, test_index in skf.split(np.array(X), np.array(y).reshape(-1, 1)):
y = np.array(y)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
clf = MultinomialNB()
clf.fit(X_train, y_train)
accuracys.append(clf.score(X_test, y_test))
print("accuracy with "+str(feature_num)+" features, binary feature is "+str(bin_flag)+": "+str(np.mean(accuracys)*100)+"%")
accuracy with 1000 features, binary feature is False: 71.875%
accuracy with 1000 features, binary feature is True: 71.885%
accuracy with 2000 features, binary feature is False: 73.075%
accuracy with 2000 features, binary feature is True: 73.05166666666668%
accuracy with 3000 features, binary feature is False: 73.38833333333334%
accuracy with 3000 features, binary feature is True: 73.44833333333332%
accuracy with 4000 features, binary feature is False: 73.53666666666668%
accuracy with 4000 features, binary feature is True: 73.62333333333333%
可以看到,最佳的参数组合是4000个特征,binary flag设置为真。
使用此组合的模型准确度如下:
train_X, test_X, train_y, test_y= feature_extracter(twitter_train_df, twitter_test_df, binary_flag = True, m_features=4000)
clf = MultinomialNB()
clf.fit(train_X, train_y)
print("The accuray of the best model is "+str(clf.score(test_X, test_y)*100)+"%")
The accuray of the best model is 75.4874651810585%