情感分析项目

1. 模型理论与应用

以下几个问题都是比较经典的问题,会对模型的深入理解会有很大的帮助。 特别是对于逻辑回归的二次导数的求解过程可以用来证明一个函数是否凸函数。

1.1 逻辑回归相关

假设我们有训练数据 D = { ( x 1 , y 1 ) , . . . , ( x n , y n ) } D=\{(\mathbf{x}_1,y_1),...,(\mathbf{x}_n,y_n)\} D={(x1,y1),...,(xn,yn)}, 其中 ( x i , y i ) (\mathbf{x}_i,y_i) (xi,yi)为每一个样本,而且 x i \mathbf{x}_i xi是样本的特征并且 x i ∈ R D \mathbf{x}_i\in \mathcal{R}^D xiRD, y i y_i yi代表样本数据的标签(label), 取值为 0 0 0或者 1 1 1. 在逻辑回归中,模型的参数为 ( w , b ) (\mathbf{w},b) (w,b)。对于向量,我们一般用粗体来表达。请回答以下问题。
(a) 在逻辑回归模型下,请写出目标函数(objective function), 也就是我们需要"最小化"的目标(也称之为损失函数或者loss function),不需要考虑正则
L ( w , b ) = a r g m i n w , b − ∑ i = 1 n y i l o g σ ( w T x i + b ) + ( 1 − y i ) l o g [ 1 − σ ( w T + b ) ] L(\mathbf{w},b) = argmin_{w,b}-\sum_{i=1}^n y_ilog\sigma(w^Tx_i+b) + (1-y_i)log[1-\sigma(w^T+b)] L(w,b)=argminw,bi=1nyilogσ(wTxi+b)+(1yi)log[1σ(wT+b)]

(b) 求出 L ( w , b ) L(\mathbf{w},b) L(w,b)的梯度(或者计算导数),需要必要的中间过程。
$\frac{\partial L(\mathbf{w},b)}{\partial \mathbf{w}}\

= argmin_{w,b}-\sum_{i=1}ny_i\frac{\sigma(wTx_i + b)[1 - \sigma(w^Tx_i + b)]x_i}{\sigma(w^Tx_i+b)} + (1 - y_i)\frac{(-1)\sigma(w^Tx_i + b)[1 - \sigma(w^Tx_i + b)]x_i} {1 - \sigma(w^Tx_i + b)}\

= argmin_{w,b}-\sum_{i = 1}^n y_i[1 - \sigma(w^Tx_i + b)]x_i + (y_i - 1)\sigma(w^Tx_i + b)x_i\

= argmin_{w,b}-\sum_{i = 1}^n [y_i - y_i\sigma(w^Tx_i + b)]x_i + y_i\sigma(w^Tx_i + b)x_i - \sigma(w^Tx_i + b)x_i\

= argmin_{w,b}-\sum_{i = 1}^n [y_i - \sigma(w^Tx_i + b)]x_i\

= argmin_{w,b}\sum_{i = 1}^n [\sigma(w^Tx_i + b) - y_i]x_i$

$\frac{\partial L(\mathbf{w},b)}{\partial b}\

= argmin_{w,b}-\sum_{i=1}ny_i\frac{\sigma(wTx_i + b)[1 - \sigma(w^Tx_i + b)]} {\sigma(w^Tx_i + b)} + (1 - y_i)\frac{(-1)\sigma(w^Tx_i + b)[1 - \sigma(w^Tx_i + b)]} {1 - \sigma(w^Tx_i + b)}\

= argmin_{w,b}-\sum_{i=1}^ny_i[1 - \sigma(w^Tx_i + b)] + (1 - y_i)(-1)\sigma(w^Tx_i + b)\

= argmin_{w,b}-\sum_{i=1}^n[y_i - \sigma(w^Tx_i + b)]\

= argmin_{w,b}\sum_{i=1}n[\sigma(wTx_i + b) - y_i]$

© 请写出基于梯度下降法(batch)的对于 w \mathbf{w} w b b b的更新
w t + 1 = w t − η t ∑ i = 1 n [ σ ( w T x i + b ) − y i ] x i w^{t+1} = w^t - \eta_t\sum_{i=1}^n[\sigma(w^Tx_i + b) - y_i]x_i wt+1=wtηti=1n[σ(wTxi+b)yi]xi

b t + 1 = b t − η t ∑ i = 1 n [ σ ( w T x i + b ) − y i ] b^{t+1} = b^t - \eta_t\sum_{i=1}^n[\sigma(w^Tx_i + b) - y_i] bt+1=btηti=1n[σ(wTxi+b)yi]

(d) 假设在(a)的基础上加了一个L2正则项,请写出基于梯度下降法(batch)的对于 w \mathbf{w} w b b b的更新
w t + 1 = w t − η t ∑ i = 1 n [ σ ( w T x i + b ) − y i ] x i + 2 λ w w^{t+1} = w^t - \eta_t\sum_{i=1}^n[\sigma(w^Tx_i + b) - y_i]x_i + 2\lambda w wt+1=wtηti=1n[σ(wTxi+b)yi]xi+2λw

b t + 1 = b t − η t ∑ i = 1 n [ σ ( w T x i + b ) − y i ] + 2 λ w b^{t+1} = b^t - \eta_t\sum_{i=1}^n[\sigma(w^Tx_i + b) - y_i] + 2\lambda w bt+1=btηti=1n[σ(wTxi+b)yi]+2λw

(e) 在(b)的基础上接着对 w \mathbf{w} w求导(等于二阶导数,二阶导数的维度为 D × D D\times D D×D),这个二阶导数也称之为Hessian Matrix(https://en.wikipedia.org/wiki/Hessian_matrix) 对于矩阵、向量的求导请参考:https://www.math.uwaterloo.ca/~hwolkowi/matrixcookbook.pdf

∂ 2 L ∂ 2 w = ∑ i = 1 n [ 1 − σ ( w T x i + b ) ] σ ( w T x i + b ) x i T x i \frac{\partial^2 \mathcal{L}}{\partial^2 \mathbf{w}} = \sum_{i=1}^n[1 - \sigma(w^Tx_i + b)]\sigma(w^Tx_i + b)x_i^Tx_i 2w2L=i=1n[1σ(wTxi+b)]σ(wTxi+b)xiTxi

∂ 2 L ∂ w ∂ b = ∑ i = 1 n [ 1 − σ ( w T x i + b ) ] σ ( w T x i + b ) x i \frac{\partial^2 \mathcal{L}}{\partial \mathbf{w} \partial \mathbf{b}} = \sum_{i=1}^n[1 - \sigma(w^Tx_i + b)]\sigma(w^Tx_i + b)x_i wb2L=i=1n[1σ(wTxi+b)]σ(wTxi+b)xi

∂ 2 L ∂ b ∂ w = ∑ i = 1 n [ 1 − σ ( w T x i + b ) ] σ ( w T x i + b ) x i \frac{\partial^2 \mathcal{L}}{\partial \mathbf{b} \partial \mathbf{w}} = \sum_{i=1}^n[1 - \sigma(w^Tx_i + b)]\sigma(w^Tx_i + b)x_i bw2L=i=1n[1σ(wTxi+b)]σ(wTxi+b)xi

∂ 2 L ∂ 2 b = ∑ i = 1 n [ 1 − σ ( w T x i + b ) ] σ ( w T x i + b ) \frac{\partial^2 \mathcal{L}}{\partial^2 \mathbf{b}} = \sum_{i=1}^n[1 - \sigma(w^Tx_i + b)]\sigma(w^Tx_i + b) 2b2L=i=1n[1σ(wTxi+b)]σ(wTxi+b)

H = ( ∑ i = 1 n [ 1 − σ ( w T x i + b ) ] σ ( w T x i + b ) x i T x i ∑ i = 1 n [ 1 − σ ( w T x i + b ) ] σ ( w T x i + b ) x i ∑ i = 1 n [ 1 − σ ( w T x i + b ) ] σ ( w T x i + b ) x i ∑ i = 1 n [ 1 − σ ( w T x i + b ) ] σ ( w T x i + b ) ) H = \left( \begin{array}{cc} \sum_{i=1}^n[1 - \sigma(w^Tx_i + b)]\sigma(w^Tx_i + b)x_i^Tx_i & \sum_{i=1}^n[1 - \sigma(w^Tx_i + b)]\sigma(w^Tx_i + b)x_i \\ \sum_{i=1}^n[1 - \sigma(w^Tx_i + b)]\sigma(w^Tx_i + b)x_i & \sum_{i=1}^n[1 - \sigma(w^Tx_i + b)]\sigma(w^Tx_i + b) \end{array} \right) H=(i=1n[1σ(wTxi+b)]σ(wTxi+b)xiTxii=1n[1σ(wTxi+b)]σ(wTxi+b)xii=1n[1σ(wTxi+b)]σ(wTxi+b)xii=1n[1σ(wTxi+b)]σ(wTxi+b))

(f) 请说明在(e)的得出来的Hessian Matrix是Positive Definite. 提示:为了证明一个 D × D D\times D D×D的矩阵 H H H为Positive Semidefinite,需要证明对于任意一个非零向量 v ∈ R D v\in \mathcal{R}^D vRD, 需要得出 v T H v > = 0 v^{T}Hv >=0 vTHv>=0

请推导或者说明:

证明:

假设 v T = [ x , y ] ∈ R v^T = [x, y] \in \mathcal{R} vT=[x,y]R

那么,有

v T H v = [ x , y ] ( ∑ i = 1 n [ 1 − σ ( w T x i + b ) ] σ ( w T x i + b ) x i T x i ∑ i = 1 n [ 1 − σ ( w T x i + b ) ] σ ( w T x i + b ) x i ∑ i = 1 n [ 1 − σ ( w T x i + b ) ] σ ( w T x i + b ) x i ∑ i = 1 n [ 1 − σ ( w T x i + b ) ] σ ( w T x i + b ) ) [ x , y ] T = x 2 ∑ i = 1 n [ 1 − σ ( w T x i + b ) ] σ ( w T x i + b ) x i T x i + x y ∑ i = 1 n [ 1 − σ ( w T x i + b ) ] σ ( w T x i + b ) x i + y x ∑ i = 1 n [ 1 − σ ( w T x i + b ) ] σ ( w T x i + b ) x i + y 2 ∑ i = 1 n [ 1 − σ ( w T x i + b ) ] σ ( w T x i + b ) = x 2 ∑ i = 1 n [ 1 − σ ( w T x i + b ) ] σ ( w T x i + b ) x i T x i + 2 x y ∑ i = 1 n [ 1 − σ ( w T x i + b ) ] σ ( w T x i + b ) x i + y 2 ∑ i = 1 n [ 1 − σ ( w T x i + b ) ] σ ( w T x i + b ) = ∑ i = 1 n [ 1 − σ ( w T x i + b ) ] σ ( w T x i + b ) [ x 2 x i T x i + 2 x y x i + y 2 ] = ∑ i = 1 n [ 1 − σ ( w T x i + b ) ] σ ( w T x i + b ) ( x x i + y ) 2 v^{T}Hv\\ = [x, y]\left( \begin{array}{cc} \sum_{i=1}^n[1 - \sigma(w^Tx_i + b)]\sigma(w^Tx_i + b)x_i^Tx_i & \sum_{i=1}^n[1 - \sigma(w^Tx_i + b)]\sigma(w^Tx_i + b)x_i \\ \sum_{i=1}^n[1 - \sigma(w^Tx_i + b)]\sigma(w^Tx_i + b)x_i & \sum_{i=1}^n[1 - \sigma(w^Tx_i + b)]\sigma(w^Tx_i + b) \end{array} \right)[x, y]^T\\ = x^2\sum_{i=1}^n[1 - \sigma(w^Tx_i + b)]\sigma(w^Tx_i + b)x_i^Tx_i + xy\sum_{i=1}^n[1 - \sigma(w^Tx_i + b)]\sigma(w^Tx_i + b)x_i + yx\sum_{i=1}^n[1 - \sigma(w^Tx_i + b)]\sigma(w^Tx_i + b)x_i +\\ y^2\sum_{i=1}^n[1 - \sigma(w^Tx_i + b)]\sigma(w^Tx_i + b)\\ = x^2\sum_{i=1}^n[1 - \sigma(w^Tx_i + b)]\sigma(w^Tx_i + b)x_i^Tx_i + 2xy\sum_{i=1}^n[1 - \sigma(w^Tx_i + b)]\sigma(w^Tx_i + b)x_i + y^2\sum_{i=1}^n[1 - \sigma(w^Tx_i + b)]\sigma(w^Tx_i + b)\\ = \sum_{i=1}^n[1 - \sigma(w^Tx_i + b)]\sigma(w^Tx_i + b)[x^2x_i^Tx_i + 2xyx_i + y^2]\\ = \sum_{i=1}^n[1 - \sigma(w^Tx_i + b)]\sigma(w^Tx_i + b)(xx_i + y)^2 vTHv=[x,y](i=1n[1σ(wTxi+b)]σ(wTxi+b)xiTxii=1n[1σ(wTxi+b)]σ(wTxi+b)xii=1n[1σ(wTxi+b)]σ(wTxi+b)xii=1n[1σ(wTxi+b)]σ(wTxi+b))[x,y]T=x2i=1n[1σ(wTxi+b)]σ(wTxi+b)xiTxi+xyi=1n[1σ(wTxi+b)]σ(wTxi+b)xi+yxi=1n[1σ(wTxi+b)]σ(wTxi+b)xi+y2i=1n[1σ(wTxi+b)]σ(wTxi+b)=x2i=1n[1σ(wTxi+b)]σ(wTxi+b)xiTxi+2xyi=1n[1σ(wTxi+b)]σ(wTxi+b)xi+y2i=1n[1σ(wTxi+b)]σ(wTxi+b)=i=1n[1σ(wTxi+b)]σ(wTxi+b)[x2xiTxi+2xyxi+y2]=i=1n[1σ(wTxi+b)]σ(wTxi+b)(xxi+y)2

因为 [ 1 − σ ( w T x i + b ) ] [1 - \sigma(w^Tx_i + b)] [1σ(wTxi+b)] σ ( w T x i + b ) \sigma(w^Tx_i + b) σ(wTxi+b)以及 ( x x i + y ) 2 (xx_i + y)^2 (xxi+y)2都是大于等于0的

所以 ∑ i = 1 n [ 1 − σ ( w T x i + b ) ] σ ( w T x i + b ) ( x x i + y ) 2 > = 0 \sum_{i=1}^n[1 - \sigma(w^Tx_i + b)]\sigma(w^Tx_i + b)(xx_i + y)^2 >= 0 i=1n[1σ(wTxi+b)]σ(wTxi+b)(xxi+y)2>=0

所以,最终得出 v T H v > = 0 v^{T}Hv >=0 vTHv>=0

2. 情感分析项目

文本读取

import re
import jieba
import numpy as np


# 读取文件内容
def read_train_file(file_path='', comments=[], labels=[], val=''):
    with open(file_path, 'r', encoding='utf-8') as f:
        text = f.read().replace(' ','').replace('\n','')
        reg = '<reviewid="\d{1,4}">(.*?)</review>'
        results = re.findall(reg, text)
        for result in results:
            result = ','.join(jieba.cut(result))
            comments.append(result)
            labels.append(val)


def read_test_file(file_path='', comments=[], labels=[]):
    with open(file_path, 'r', encoding='utf-8') as f:
        text = f.read().replace(' ','').replace('\n','')
        reg = '<reviewid="\d{1,4}".*?</review>'
        results = re.findall(reg, text)
        for result in results:
            label_reg = '<reviewid="\d{1,4}"label="(\d)">'
            com_reg = '>(.*?)</review>'
            label = re.findall(label_reg, result)[0]
            comment = re.findall(com_reg, result)[0]
            labels.append(label)
            comment = ','.join(jieba.cut(comment))
            comments.append(comment)
    assert(len(comments) == len(labels))


# TODO: 读取文件部分,把具体的内容写入到变量里面
train_comments = []
train_labels = []
test_comments = []
test_labels = []
def process_file():
    """
    读取训练数据和测试数据,并对它们做一些预处理
    """
    train_pos_file = "data/train.positive.txt"
    train_neg_file = "data/train.negative.txt"
    test_comb_file = "data/test.combined.txt"
    
    # 读取正面评论文件内容
    read_train_file(train_pos_file, train_comments, train_labels, '1')
     # 读取负面评论文件内容
    read_train_file(train_neg_file, train_comments, train_labels, '0')
    # 读取测试文件数据
    read_test_file(test_comb_file, test_comments, test_labels)

process_file()
print(len(train_comments), len(train_labels), len(test_comments), len(test_labels))

简单的可视化分析

import matplotlib.pyplot as plt
import numpy as np

pos_comments_count = []
neg_comments_count = []
pos_train_comments = []
neg_train_comments = []


def get_comments():
    index = 0
    for flag in train_labels:
        comment = train_comments[index]
        length = len(comment)
        if flag == '1':
            pos_comments_count.append(length)
            pos_train_comments.append(comment)
        else:
            neg_comments_count.append(length)
            neg_train_comments.append(comment)
        index = index + 1

get_comments()
pos_total_count = len(pos_comments_count)
neg_total_count = len(neg_comments_count)
print(pos_total_count, neg_total_count)

# 计算相同长度的字符串出现的次数
def cal_statics(comments_count=[]):
    temp_dict = {}
    total_num = len(comments_count)
    for length in comments_count:
        temp_dict[length] = temp_dict.get(length, 0) + 1
    
    for key in temp_dict:
        temp_dict[key] = temp_dict[key]/total_num
    
    return temp_dict


pos_statics = cal_statics(pos_comments_count)
neg_statics = cal_statics(neg_comments_count)
print(len(pos_statics), len(neg_statics))

# 排序
pos_statics = dict(sorted(pos_statics.items(), key = lambda x:x[0]))
neg_statics = dict(sorted(neg_statics.items(), key = lambda x:x[0]))
# 画正样本histogram
pos_x = list(pos_statics.keys())
pos_y = list(pos_statics.values())
neg_x = list(neg_statics.keys())
neg_y = list(neg_statics.values())

fig = plt.figure()
plt.bar(pos_x, pos_y, 1, color="red")
plt.xlabel("every comment string length")
plt.ylabel("percentage of this string length")
plt.title("positive comments histogram")

fig = plt.figure()
plt.bar(neg_x, neg_y, 1, color="green")
plt.xlabel("every comment string length")
plt.ylabel("percentage of this string length")
plt.title("negative comments histogram")
import collections
import jieba

def get_top20_words(comments=[]):
    word_library = []   # 储存所有词
    for comment in comments:
        for i in jieba.cut(comment):
            word_library.append(i)
    word_dic = collections.Counter(word_library).most_common(20)
    top20_list = [i[0] for i in word_dic]
    return top20_list


pos_top20_words = get_top20_words(pos_train_comments)
neg_top20_words = get_top20_words(neg_train_comments)
print('pos_top20_words:' + str(pos_top20_words))
print('neg_top20_words:' + str(neg_top20_words))

# 将正面评价和负面评价中共同出现的词作为停用词
stop_words = []
for word in pos_top20_words:
    if word in neg_top20_words and word.isalnum():
        stop_words.append(word)
print('stop_words:' + str(stop_words))
pos_top20_words:[',', ',', '的', '。', '了', '是', '!', '很', '我', '也', '在', '有', '~', '都', '好', '.', '不错', '就', '买', '这']
neg_top20_words:[',', ',', '的', '。', '了', '!', '是', '我', '不', '买', '就', '也', '都', '很', '有', '在', '?', '没有', '!', '.']
stop_words:['的', '了', '是', '很', '我', '也', '在', '有', '都', '就', '买']

文本处理部分

import string

def text_preprocessing(comments=[]):
    new_comments = []
    for comment in comments:
        new_sentence = ''
        for word in jieba.cut(comment):
            # 去除停用词、标点符号、数字
            if word not in stop_words  and word.isalnum() and not word.isdigit():
                new_sentence += word
        new_comments.append(new_sentence)
    return new_comments

train_comments_new = text_preprocessing(train_comments)
test_comments_new = text_preprocessing(test_comments)
print(len(train_comments_new), len(test_comments_new))

从文本中提取特征

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(train_comments)      # 训练数据的特征
y_train = np.array(train_labels)                        # 训练数据的label
X_test  = vectorizer.transform(test_comments)           # 测试数据的特征
y_test  = np.array(test_labels)                         # 测试数据的label

print(np.shape(X_train), np.shape(X_test), np.shape(y_train), np.shape(y_test))

训练模型以及选择合适的超参数

利用逻辑回归来训练模型

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report

def process_text(text=''):
    text = ''.join(e for e in text if e.isalnum())
    return ', '.join(jieba.cut(text))


parameters = { 'C': np.logspace(-3, 3, 7)}
lr = LogisticRegression(solver='liblinear')
clf = GridSearchCV(lr, parameters, cv=5)
clf.fit(X_train, y_train)
print(clf.best_params_)
y_predict = clf.predict(X_test)
print(classification_report(y_test, y_predict))

# clf = LogisticRegression(C=1.0).fit(X_train, y_train)
# 打印在训练数据上的准确率
print("训练数据上的准确率为:" + str(clf.score(X_train, y_train)))

# # 打印在测试数据上的准确率
print("测试数据上的准确率为: " + str(clf.score(X_test, y_test)))

test_comment1 = '这个宝贝还是比较不错滴'
test_comment2 = '很不好,太差了'

test = []
test.append(process_text(test_comment2))
print(test)
print(vectorizer.transform(test))
print(clf.predict(vectorizer.transform(test)))
{'C': 1.0}
              precision    recall  f1-score   support

           0       0.86      0.54      0.66      1250
           1       0.67      0.91      0.77      1250

   micro avg       0.73      0.73      0.73      2500
   macro avg       0.76      0.73      0.72      2500
weighted avg       0.76      0.73      0.72      2500

训练数据上的准确率为:0.8721636701797892
测试数据上的准确率为: 0.7268
['很, 不好, 太差, 了']
  (0, 10188)	0.8064523512198745
  (0, 3669)	0.591299082708519
['0']

利用SVM来训练模型

from sklearn import svm
# TODO: 利用SVM来训练模型
parameters = {'kernel':('linear', 'rbf', 'poly', 'sigmoid'), 'C':np.logspace(-3, 3, 7)}
svc = svm.SVC(gamma='scale')
clf = GridSearchCV(svc, parameters, cv=5)
clf.fit(X_train, y_train)
print(clf.best_params_)
y_predict = clf.predict(X_test)
print(classification_report(y_test, y_predict))
{'C': 1.0, 'kernel': 'sigmoid'}
              precision    recall  f1-score   support

           0       0.85      0.59      0.70      1250
           1       0.69      0.89      0.78      1250

   micro avg       0.74      0.74      0.74      2500
   macro avg       0.77      0.74      0.74      2500
weighted avg       0.77      0.74      0.74      2500

仍然使用SVM模型,但在这里使用Bayesian Optimization来寻找最好的超参数

from sklearn.model_selection import cross_val_score
from bayes_opt import BayesianOptimization
from sklearn.svm import SVC

def svm_cv(C, gamma):
    svm = SVC(C=10 ** C, gamma=10 ** gamma, random_state=1)
    val = cross_val_score(svm,X_train, y_train, cv=5).mean()
    return val

pbounds = {'C':(0,1), 'gamma':(2,20)}
svm_bo = BayesianOptimization(svm_cv, pbounds=pbounds)

svm_bo.maximize()
|   iter    |  target   |     C     |   gamma   |
-------------------------------------------------
|  1        |  0.6206   |  0.3705   |  6.928    |
|  2        |  0.6206   |  0.9682   |  4.705    |
|  3        |  0.6206   |  0.7015   |  7.333    |
|  4        |  0.6206   |  0.5141   |  12.72    |
|  5        |  0.6206   |  0.6732   |  6.483    |
|  6        |  0.6206   |  0.6284   |  19.99    |
|  7        |  0.6206   |  0.04032  |  19.99    |
|  8        |  0.6208   |  0.8602   |  2.037    |
|  9        |  0.6206   |  0.1939   |  20.0     |
|  10       |  0.6208   |  0.2209   |  2.017    |
|  11       |  0.6206   |  0.8674   |  20.0     |
|  12       |  0.6208   |  0.58     |  2.033    |
|  13       |  0.6208   |  0.859    |  2.009    |
|  14       |  0.6206   |  0.9947   |  19.94    |
|  15       |  0.6208   |  0.06059  |  2.017    |
|  16       |  0.6206   |  0.2054   |  19.95    |
|  17       |  0.6208   |  0.8543   |  2.084    |
|  18       |  0.6208   |  0.103    |  2.021    |
|  19       |  0.6206   |  0.8894   |  19.97    |
|  20       |  0.6208   |  0.3986   |  2.015    |
|  21       |  0.6208   |  0.4768   |  2.015    |
|  22       |  0.6206   |  0.2138   |  19.98    |
|  23       |  0.6208   |  0.3656   |  2.135    |
|  24       |  0.6208   |  0.4324   |  2.012    |
|  25       |  0.6206   |  0.06989  |  20.0     |
|  26       |  0.6208   |  0.5922   |  2.132    |
|  27       |  0.6208   |  0.9851   |  2.101    |
|  28       |  0.6208   |  0.1901   |  2.146    |
|  29       |  0.6206   |  0.2751   |  20.0     |
|  30       |  0.6208   |  0.5422   |  2.036    |
=================================================

特征: 添加n-gram特征

vectorizer = TfidfVectorizer(ngram_range=(1, 2))
X_train = vectorizer.fit_transform(train_comments)  # 添加完bigram之后的特征
y_train = np.array(train_labels)                    # 
X_test = vectorizer.transform(test_comments)        # 添加完bigram之后的特征
y_test = np.array(test_labels)                      # 

print (np.shape(X_train), np.shape(X_test), np.shape(y_train), np.shape(y_test))

利用逻辑回归来训练模型

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report


def process_text(text=''):
    text = ''.join(e for e in text if e.isalnum())
    return ', '.join(jieba.cut(text))


parameters = { 'C': np.logspace(-3, 3, 7)}
lr = LogisticRegression(solver='liblinear')
clf = GridSearchCV(lr, parameters, cv=5)
clf.fit(X_train, y_train)
print(clf.best_params_)
y_predict = clf.predict(X_test)
print(classification_report(y_test, y_predict))

# clf = LogisticRegression(C=1.0).fit(X_train, y_train)
# 打印在训练数据上的准确率
print("训练数据上的准确率为:" + str(clf.score(X_train, y_train)))

# # 打印在测试数据上的准确率
print("测试数据上的准确率为: " + str(clf.score(X_test, y_test)))

test_comment1 = '这个宝贝还是比较不错滴'
test_comment2 = '很不好,太差了'

test = []
test.append(process_text(test_comment2))
print(test)
print(vectorizer.transform(test))
print(clf.predict(vectorizer.transform(test)))
{'C': 10.0}
              precision    recall  f1-score   support

           0       0.84      0.61      0.71      1250
           1       0.69      0.89      0.78      1250

   micro avg       0.75      0.75      0.75      2500
   macro avg       0.77      0.75      0.74      2500
weighted avg       0.77      0.75      0.74      2500

训练数据上的准确率为:0.9952882827030378
测试数据上的准确率为: 0.7484
['很, 不好, 太差, 了']
  (0, 55400)	0.8064523512198745
  (0, 15496)	0.591299082708519
['0']

利用SVM来训练模型

from sklearn import svm
parameters = {'kernel':('linear', 'rbf', 'poly', 'sigmoid'), 'C':np.logspace(-3, 3, 7)}
svc = svm.SVC(gamma='scale')
clf = GridSearchCV(svc, parameters, cv=5)
clf.fit(X_train, y_train)
print(clf.best_params_)
y_predict = clf.predict(X_test)
print(classification_report(y_test, y_predict))
{'C': 1.0, 'kernel': 'linear'}
              precision    recall  f1-score   support

           0       0.85      0.61      0.71      1250
           1       0.70      0.89      0.78      1250

   micro avg       0.75      0.75      0.75      2500
   macro avg       0.77      0.75      0.75      2500
weighted avg       0.77      0.75      0.75      2500

仍然使用SVM模型,但在这里使用Bayesian Optimization来寻找最好的超参数

from sklearn.model_selection import cross_val_score
from bayes_opt import BayesianOptimization
from sklearn.svm import SVC

def svm_cv(C, gamma):
    svm = SVC(C=10 ** C, gamma=10 ** gamma, random_state=1)
    val = cross_val_score(svm,X_train, y_train, cv=5).mean()
    return val

pbounds = {'C':(0,1), 'gamma':(2,20)}
svm_bo = BayesianOptimization(svm_cv, pbounds=pbounds)

svm_bo.maximize()
|   iter    |  target   |     C     |   gamma   |
-------------------------------------------------
|  1        |  0.6202   |  0.1987   |  16.93    |
|  2        |  0.6202   |  0.9998   |  8.928    |
|  3        |  0.6202   |  0.381    |  12.99    |
|  4        |  0.6202   |  0.2872   |  15.86    |
|  5        |  0.6202   |  0.845    |  7.817    |
|  6        |  0.6201   |  0.9857   |  2.006    |
|  7        |  0.6202   |  0.2213   |  20.0     |
|  8        |  0.6202   |  0.04001  |  2.014    |
|  9        |  0.6202   |  0.9077   |  20.0     |
|  10       |  0.6201   |  0.6421   |  2.061    |
|  11       |  0.6202   |  0.8399   |  20.0     |
|  12       |  0.6201   |  0.2333   |  2.027    |
|  13       |  0.6202   |  0.5979   |  19.97    |
|  14       |  0.6202   |  0.6701   |  2.109    |
|  15       |  0.6202   |  0.1631   |  19.95    |
|  16       |  0.6201   |  0.4138   |  2.022    |
|  17       |  0.6202   |  0.1256   |  19.98    |
|  18       |  0.6201   |  0.09698  |  2.062    |
|  19       |  0.6202   |  0.3008   |  19.91    |
|  20       |  0.6202   |  0.281    |  19.97    |
|  21       |  0.6202   |  0.6433   |  2.072    |
|  22       |  0.6202   |  0.5776   |  19.98    |
|  23       |  0.6201   |  0.8474   |  2.026    |
|  24       |  0.6202   |  0.8294   |  19.94    |
|  25       |  0.6202   |  0.005122 |  19.99    |
|  26       |  0.6201   |  0.9513   |  2.034    |
|  27       |  0.6202   |  0.03517  |  19.99    |
|  28       |  0.6201   |  0.1467   |  2.003    |
|  29       |  0.6202   |  0.5118   |  19.99    |
|  30       |  0.6201   |  0.7277   |  2.028    |
=================================================
  • 3
    点赞
  • 5
    收藏
    觉得还不错? 一键收藏
  • 3
    评论
评论 3
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值