情感分析系统(v2.0)

最新推荐文章于 2024-07-23 13:30:04 发布

置顶简单随风

最新推荐文章于 2024-07-23 13:30:04 发布

阅读量5k

点赞数 3

分类专栏：自然语言处理文章标签： nlp 情感分析

本文链接：https://blog.csdn.net/lt326030434/article/details/88154071

版权

自然语言处理专栏收录该内容

28 篇文章 6 订阅

订阅专栏

项目介绍

本文是对情感分析系统的二次优化。

优化了数据清洗部分代码
做了简单的数据可视化
更简便的超参数选择方法
对一些函数在时间复杂度上的优化

下面来看一下具体的实现过程

1.File Reading: 文本读取

从结构化的数据集中，提取测试数据与训练数据

import re

def read_train_file(file_path):
    comments = []  # 用来存储评论
    labels = []   # 存储标签
    with open(file_path) as file:
        # TODO 提取每一个评论，然后利用process_line函数来做处理，并添加到comments。
        text = file.read().replace(' ','').replace('\n','')
        reg = '<reviewid="\d{1,4}">(.*?)</review>'
        result = re.findall(reg,text)
        for r in result:
            comments.append(r)
            if file_path == 'data/train.positive.txt':
                labels.append('1')
            else:
                labels.append('0')
    return comments, labels

def read_test_file(file_path):
    comments = []  # 用来存储评论
    labels = []   # 存储标签
    with open(file_path) as file:
        # TODO 提取每一个评论，然后利用process_line函数来做处理，并添加到comments。
        text = file.read().replace(' ','').replace('\n','')
        reg = '<reviewid="\d{1,4}".*?</review>'
        result = re.findall(reg,text)
        for r in result:
            label_reg = '<reviewid="\d{1,4}"label="(\d)">'
            com_reg = '>(.*?)</review>'
            label = re.findall(label_reg,r)[0]
            comment = re.findall(com_reg,r)[0]
            labels.append(label)
            comments.append(comment)
    return comments, labels
def process_file():
    """
    读取训练数据和测试数据，并对它们做一些预处理
    """    
    train_pos_file = "data/train.positive.txt"
    train_neg_file = "data/train.negative.txt"
    test_comb_file = "data/test.combined.txt"
    
    # TODO: 读取文件部分，把具体的内容写入到变量里面
    train_pos_comments, train_pos_labels = read_train_file(train_pos_file)
    train_neg_comments, train_neg_labels = read_train_file(train_neg_file)
    test_comments, test_labels = read_test_file(test_comb_file)
    return train_pos_comments, train_pos_labels,train_neg_comments, train_neg_labels,test_comments, test_labels
    
train_pos_comments, train_pos_labels,train_neg_comments, train_neg_labels,test_comments, test_labels = process_file()

2. Explorary Analysis: 做一些简单的可视化分析

拼接训练数据和测试数据，并查看长度

train_comments = train_pos_comments + train_neg_comments
train_labels = train_pos_labels + train_neg_labels

# 训练数据和测试数据大小
print (len(train_comments), len(test_comments))

使用matplotlib绘制数据的字符串长度分布

import jieba
import seaborn as sns
%matplotlib inline
# TODO: 对于训练数据中的正负样本，分别画出一个histogram， histogram的x抽是每一个样本中字符串的长度，y轴是拥有这个长度的样本的百分比。
#       并说出样本长度是否对情感有相关性 (需要先用到结巴分词)
#       参考：https://en.wikipedia.org/wiki/Histogram
def count_sentence(sentences):
    len_list = []
    for s in sentences:
        sentence = []
        for i in jieba.cut(s):
            sentence.append(i)
        len_list.append(len(sentence))
    return len_list


sns.distplot(count_sentence(train_pos_comments))   # train_pos_comments样本中各长度样本所占百分比
sns.distplot(count_sentence(train_neg_comments))   # train_neg_comments样本中各长度样本所占百分比

# 右下图中可以观察到，正负样本的长度所占百分比的趋势基本接近，说明样本长度与情感没有相关性

打印结果如下
在这里插入图片描述
查看正负样本中出现频率的top20单词

import collections

# TODO： 分别列出训练数据中正负样本里的top 20单词（可以做适当的stop words removal）。 
def get_top20_words(comments):
    word_library = []   # 储存所有词
    for comment in comments:
        for i in jieba.cut(comment):
            word_library.append(i)
    word_dic = collections.Counter(word_library).most_common(20)
    top20_list = [i[0] for i in word_dic]
    return top20_list

print(get_top20_words(train_pos_comments))
print(get_top20_words(train_neg_comments))

根据正负样本里的top 20单词，手动生成停用词库

stop_words = ['的','了','是','很','我','也','在','买','有','都','就']

3.Text Cleaning: 文本预处理部分

对于train_comments, test_comments进行字符串的处理，几个考虑的点：

停用词过滤
去掉特殊符号
去掉数字（比如价格…)

import string

def text_preprocessing(comments):
    comments_new = []
    for comment in comments:
        sentence = ''
        for word in list(jieba.cut(comment)):
            # 去除停用词、标点符号、数字
            if word not in set(stop_words) and word.isalnum() and not word.isdigit():
                sentence += word + ' '
        comments_new.append(sentence)
    return comments_new
    
train_comments_new = text_preprocessing(train_comments)
test_comments_new = text_preprocessing(test_comments)

4.Feature Extraction : 从文本中提取特征

提取文本特征并进行简单的检查

from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
# TODO: 利用tf-idf从文本中提取特征,写到数组里面. 
tfid_vec = TfidfVectorizer()
X_train = tfid_vec.fit_transform(train_comments)
y_train = np.array(train_labels)
X_test = tfid_vec.transform(test_comments)
y_test = np.array(test_labels)

print (np.shape(X_train), np.shape(X_test), np.shape(y_train), np.shape(y_test))

打印结果如下
在这里插入图片描述

5.Modeling: 训练模型以及选择合适的超参数

利用逻辑回归来训练模型

评估方式： F1-score
超参数（hyperparater）的选择利用grid search
打印出在测试数据中的最好的结果（precision, recall, f1-score, 需要分别打印出正负样本，以及综合的）

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report

parameters = { 'C':np.logspace(-3,3,7)}
lr = LogisticRegression()
clf = GridSearchCV(lr, parameters, cv=5)
clf.fit(X_train, y_train)
print(clf.best_params_)
y_predict = clf.predict(X_test)
print(classification_report(y_test, y_predict))

利用SVM来训练模型

评估方式： F1-score
超参数（hyperparater）的选择利用grid search
打印出在测试数据中的最好的结果（precision, recall, f1-score, 需要分别打印出正负样本，以及综合的）

from sklearn import svm

parameters = {'kernel':('linear', 'rbf', 'poly', 'sigmoid'), 'C':np.logspace(-3,3,7)}
svc = svm.SVC()
clf = GridSearchCV(svc, parameters, cv=5)
clf.fit(X_train, y_train)
print(clf.best_params_)
y_predict = clf.predict(X_test)
print(classification_report(y_test, y_predict))

对于超参数的调整，我们经常使用gridsearch，这也是工业界最常用的方法，但它的缺点是需要大量的计算，所以近年来这方面的研究也成为了重点。其中一个比较经典的成果为Bayesian Optimization（利用贝叶斯的思路去寻找最好的超参数）。Ryan P. Adams主导的Bayesian Optimization利用高斯过程作为后验概率（posteior distribution）来寻找最优解。我们尝试使用Bayesian Optimization工具来去寻找最优的超参数。

from sklearn.cross_validation import cross_val_score
from bayes_opt import BayesianOptimization
from sklearn.svm import SVC

def svm_cv(C, gamma):
    svm = SVC(C=10 ** C, gamma=10 ** gamma,random_state=1)
    val = cross_val_score(svm,X_train, y_train, cv=5).mean()
    return val

pbounds = {'C':(0,1),'gamma':(2,20)}
svm_bo = BayesianOptimization(svm_cv,pbounds=pbounds)

svm_bo.maximize()