NLP(VI):使用sklearn进行文本情感分类(上)

NLP(VI):使用sklearn进行文本情感分类(上)

这一节我们使用sklearn训练分类模型以实现对文本数据的情感分类。

获得数据

这次的数据我已经上传了,可从以下链接下载:
训练集:https://download.csdn.net/download/swy_swy_swy/87581709
测试集:https://download.csdn.net/download/swy_swy_swy/87581702
数据标签为二分类标签,0为消极情绪,1为积极情绪。

加载数据

我们使用pandas处理csv文件。

import sklearn
import pandas as pd
import numpy as np
def csv_loader(filepath):
  return pd.read_csv(filepath)

twitter_train_df = csv_loader('sentiment-train.csv')
twitter_test_df = csv_loader('sentiment-test.csv')
文本向量化

sklearn本质上是如何“阅读”一个文本呢?它本质上是不懂人类的语言的,一段文本对于它来说是一个词语的集合,或说“一袋子单词”(a bag of words)。在训练模型之前,需要对所有文本“向量化”,也就是每一个单词都有一个对应的编号,当然,相同的单词即使在统一数据集的不同样本中编号也是相同的。这样,一段文本就变成了一个编号序列,也就是一个向量。
我们使用sklearn的CountVectorizer来进行向量化:

from sklearn.feature_extraction.text import CountVectorizer
def feature_extracter(train_df, test_df, binary_flag=False, m_features=1000, has_test=True):
  vectorizer = CountVectorizer(stop_words='english', max_features=m_features, binary=binary_flag)
  train_texts = np.array(train_df['text']).tolist()
  test_texts = []
  if has_test:
    test_texts = np.array(test_df['text']).tolist()
  vecs = vectorizer.fit_transform(train_texts + test_texts).toarray()
  train_X = vecs[:len(train_texts)]
  test_X = []
  if has_test:
    test_X = vecs[len(train_texts):]
  train_y = np.array(train_df['sentiment']).tolist()
  test_y = []
  if has_test:
    test_y = np.array(test_df['sentiment']).tolist()
  return train_X, test_X, train_y, test_y

twitter_train_X, twitter_test_X, twitter_train_y, twitter_test_y = feature_extracter(twitter_train_df, twitter_test_df)

twitter_train_bin_X, twitter_test_bin_X, twitter_train_bin_y, twitter_test_bin_y = feature_extracter(twitter_train_df, twitter_test_df, binary_flag=True)
训练贝叶斯模型

向量化之后我们可以训练贝叶斯模型并查看它在测试集上的准确率。

from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB()
clf.fit(twitter_train_X, np.array(twitter_train_y))
print("The accuracy of the trained classifier is "+str(clf.score(twitter_test_X, np.array(twitter_test_y))*100)+"%")

clf = MultinomialNB()
clf.fit(twitter_train_bin_X, np.array(twitter_train_bin_y))
print("The accuracy of the trained classifier is "+str(clf.score(twitter_test_bin_X, np.array(twitter_test_bin_y))*100)+"%")
The accuracy of the trained classifier is 78.8300835654596%
The accuracy of the trained classifier is 77.99442896935933%
训练logistic模型
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression(random_state=0).fit(twitter_train_X, np.array(twitter_train_y))
print("The accuracy of the trained classifier is "+str(clf.score(twitter_test_X, np.array(twitter_test_y))*100)+"%")

clf = LogisticRegression(random_state=0).fit(twitter_train_bin_X, np.array(twitter_train_bin_y))
print("The accuracy of the trained classifier is "+str(clf.score(twitter_test_bin_X, np.array(twitter_test_bin_y))*100)+"%")
The accuracy of the trained classifier is 77.15877437325905%
The accuracy of the trained classifier is 77.15877437325905%
获得最佳参数

这里我们使用十折交叉验证来获得最佳的参数组合(特征数与binary flag)

from sklearn.model_selection import StratifiedKFold

feature_nums = [1000, 2000, 3000, 4000]
bin_flags = [False, True]
for feature_num in feature_nums:
  for bin_flag in bin_flags:
    X, _, y, _ = feature_extracter(twitter_train_df, None, binary_flag = bin_flag, m_features=feature_num, has_test=False)
    skf = StratifiedKFold(n_splits=10)
    accuracys = []
    for train_index, test_index in skf.split(np.array(X), np.array(y).reshape(-1, 1)):
      y = np.array(y)
      X_train, X_test = X[train_index], X[test_index]
      y_train, y_test = y[train_index], y[test_index]
      clf = MultinomialNB()
      clf.fit(X_train, y_train)
      accuracys.append(clf.score(X_test, y_test))
    print("accuracy with "+str(feature_num)+" features, binary feature is "+str(bin_flag)+": "+str(np.mean(accuracys)*100)+"%")
accuracy with 1000 features, binary feature is False: 71.875%
accuracy with 1000 features, binary feature is True: 71.885%
accuracy with 2000 features, binary feature is False: 73.075%
accuracy with 2000 features, binary feature is True: 73.05166666666668%
accuracy with 3000 features, binary feature is False: 73.38833333333334%
accuracy with 3000 features, binary feature is True: 73.44833333333332%
accuracy with 4000 features, binary feature is False: 73.53666666666668%
accuracy with 4000 features, binary feature is True: 73.62333333333333%

可以看到,最佳的参数组合是4000个特征,binary flag设置为真。
使用此组合的模型准确度如下:

train_X, test_X, train_y, test_y= feature_extracter(twitter_train_df, twitter_test_df, binary_flag = True, m_features=4000)

clf = MultinomialNB()
clf.fit(train_X, train_y)
print("The accuray of the best model is "+str(clf.score(test_X, test_y)*100)+"%")
The accuray of the best model is 75.4874651810585%
  • 1
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值