项目介绍
本文是对情感分析系统的二次优化。
- 优化了数据清洗部分代码
- 做了简单的数据可视化
- 更简便的超参数选择方法
- 对一些函数在时间复杂度上的优化
下面来看一下具体的实现过程
1.File Reading: 文本读取
从结构化的数据集中,提取测试数据与训练数据
import re
def read_train_file(file_path):
comments = [] # 用来存储评论
labels = [] # 存储标签
with open(file_path) as file:
# TODO 提取每一个评论,然后利用process_line函数来做处理,并添加到comments。
text = file.read().replace(' ','').replace('\n','')
reg = '<reviewid="\d{1,4}">(.*?)</review>'
result = re.findall(reg,text)
for r in result:
comments.append(r)
if file_path == 'data/train.positive.txt':
labels.append('1')
else:
labels.append('0')
return comments, labels
def read_test_file(file_path):
comments = [] # 用来存储评论
labels = [] # 存储标签
with open(file_path) as file:
# TODO 提取每一个评论,然后利用process_line函数来做处理,并添加到comments。
text = file.read().replace(' ','').replace('\n','')
reg = '<reviewid="\d{1,4}".*?</review>'
result = re.findall(reg,text)
for r in result:
label_reg = '<reviewid="\d{1,4}"label="(\d)">'
com_reg = '>(.*?)</review>'
label = re.findall(label_reg,r)[0]
comment = re.findall(com_reg,r)[0]
labels.append(label)
comments.append(comment)
return comments, labels
def process_file():
"""
读取训练数据和测试数据,并对它们做一些预处理
"""
train_pos_file = "data/train.positive.txt"
train_neg_file = "data/train.negative.txt"
test_comb_file = "data/test.combined.txt"
# TODO: 读取文件部分,把具体的内容写入到变量里面
train_pos_comments, train_pos_labels = read_train_file(train_pos_file)
train_neg_comments, train_neg_labels = read_train_file(train_neg_file)
test_comments, test_labels = read_test_file(test_comb_file)
return train_pos_comments, train_pos_labels,train_neg_comments, train_neg_labels,test_comments, test_labels
train_pos_comments, train_pos_labels,train_neg_comments, train_neg_labels,test_comments, test_labels = process_file()
2. Explorary Analysis: 做一些简单的可视化分析
拼接训练数据和测试数据,并查看长度
train_comments = train_pos_comments + train_neg_comments
train_labels = train_pos_labels + train_neg_labels
# 训练数据和测试数据大小
print (len(train_comments), len(test_comments))
使用matplotlib绘制数据的字符串长度分布
import jieba
import seaborn as sns
%matplotlib inline
# TODO: 对于训练数据中的正负样本,分别画出一个histogram, histogram的x抽是每一个样本中字符串的长度,y轴是拥有这个长度的样本的百分比。
# 并说出样本长度是否对情感有相关性 (需要先用到结巴分词)
# 参考:https://en.wikipedia.org/wiki/Histogram
def count_sentence(sentences):
len_list = []
for s in sentences:
sentence = []
for i in jieba.cut(s):
sentence.append(i)
len_list.append(len(sentence))
return len_list
sns.distplot(count_sentence(train_pos_comments)) # train_pos_comments样本中各长度样本所占百分比
sns.distplot(count_sentence(train_neg_comments)) # train_neg_comments样本中各长度样本所占百分比
# 右下图中可以观察到,正负样本的长度所占百分比的趋势基本接近,说明样本长度与情感没有相关性
打印结果如下
查看正负样本中出现频率的top20单词
import collections
# TODO: 分别列出训练数据中正负样本里的top 20单词(可以做适当的stop words removal)。
def get_top20_words(comments):
word_library = [] # 储存所有词
for comment in comments:
for i in jieba.cut(comment):
word_library.append(i)
word_dic = collections.Counter(word_library).most_common(20)
top20_list = [i[0] for i in word_dic]
return top20_list
print(get_top20_words(train_pos_comments))
print(get_top20_words(train_neg_comments))
根据正负样本里的top 20单词,手动生成停用词库
stop_words = ['的','了','是','很','我','也','在','买','有','都','就']
3.Text Cleaning: 文本预处理部分
对于train_comments, test_comments进行字符串的处理,几个考虑的点:
- 停用词过滤
- 去掉特殊符号
- 去掉数字(比如价格…)
import string
def text_preprocessing(comments):
comments_new = []
for comment in comments:
sentence = ''
for word in list(jieba.cut(comment)):
# 去除停用词、标点符号、数字
if word not in set(stop_words) and word.isalnum() and not word.isdigit():
sentence += word + ' '
comments_new.append(sentence)
return comments_new
train_comments_new = text_preprocessing(train_comments)
test_comments_new = text_preprocessing(test_comments)
4.Feature Extraction : 从文本中提取特征
提取文本特征并进行简单的检查
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
# TODO: 利用tf-idf从文本中提取特征,写到数组里面.
tfid_vec = TfidfVectorizer()
X_train = tfid_vec.fit_transform(train_comments)
y_train = np.array(train_labels)
X_test = tfid_vec.transform(test_comments)
y_test = np.array(test_labels)
print (np.shape(X_train), np.shape(X_test), np.shape(y_train), np.shape(y_test))
打印结果如下
5.Modeling: 训练模型以及选择合适的超参数
利用逻辑回归来训练模型
- 评估方式: F1-score
- 超参数(hyperparater)的选择利用grid search
- 打印出在测试数据中的最好的结果(precision, recall, f1-score, 需要分别打印出正负样本,以及综合的)
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report
parameters = { 'C':np.logspace(-3,3,7)}
lr = LogisticRegression()
clf = GridSearchCV(lr, parameters, cv=5)
clf.fit(X_train, y_train)
print(clf.best_params_)
y_predict = clf.predict(X_test)
print(classification_report(y_test, y_predict))
利用SVM来训练模型
- 评估方式: F1-score
- 超参数(hyperparater)的选择利用grid search
- 打印出在测试数据中的最好的结果(precision, recall, f1-score, 需要分别打印出正负样本,以及综合的)
from sklearn import svm
parameters = {'kernel':('linear', 'rbf', 'poly', 'sigmoid'), 'C':np.logspace(-3,3,7)}
svc = svm.SVC()
clf = GridSearchCV(svc, parameters, cv=5)
clf.fit(X_train, y_train)
print(clf.best_params_)
y_predict = clf.predict(X_test)
print(classification_report(y_test, y_predict))
对于超参数的调整,我们经常使用gridsearch,这也是工业界最常用的方法,但它的缺点是需要大量的计算,所以近年来这方面的研究也成为了重点。 其中一个比较经典的成果为Bayesian Optimization(利用贝叶斯的思路去寻找最好的超参数)。Ryan P. Adams主导的Bayesian Optimization利用高斯过程作为后验概率(posteior distribution)来寻找最优解。我们尝试使用Bayesian Optimization工具来去寻找最优的超参数。
from sklearn.cross_validation import cross_val_score
from bayes_opt import BayesianOptimization
from sklearn.svm import SVC
def svm_cv(C, gamma):
svm = SVC(C=10 ** C, gamma=10 ** gamma,random_state=1)
val = cross_val_score(svm,X_train, y_train, cv=5).mean()
return val
pbounds = {'C':(0,1),'gamma':(2,20)}
svm_bo = BayesianOptimization(svm_cv,pbounds=pbounds)
svm_bo.maximize()
打印结果如下:
小结
数据集部分都只做了简单处理,所以其实还有很多的优化空间。相关代码及训练数据及已上传至github,有疑问的同学请题Issue或在博客下方留言。