1. 模型理论与应用
以下几个问题都是比较经典的问题,会对模型的深入理解会有很大的帮助。 特别是对于逻辑回归的二次导数的求解过程可以用来证明一个函数是否凸函数。
1.1 逻辑回归相关
假设我们有训练数据
D
=
{
(
x
1
,
y
1
)
,
.
.
.
,
(
x
n
,
y
n
)
}
D=\{(\mathbf{x}_1,y_1),...,(\mathbf{x}_n,y_n)\}
D={(x1,y1),...,(xn,yn)}, 其中
(
x
i
,
y
i
)
(\mathbf{x}_i,y_i)
(xi,yi)为每一个样本,而且
x
i
\mathbf{x}_i
xi是样本的特征并且
x
i
∈
R
D
\mathbf{x}_i\in \mathcal{R}^D
xi∈RD,
y
i
y_i
yi代表样本数据的标签(label), 取值为
0
0
0或者
1
1
1. 在逻辑回归中,模型的参数为
(
w
,
b
)
(\mathbf{w},b)
(w,b)。对于向量,我们一般用粗体来表达。请回答以下问题。
(a) 在逻辑回归模型下,请写出目标函数(objective function), 也就是我们需要"最小化"的目标(也称之为损失函数或者loss function),不需要考虑正则
L
(
w
,
b
)
=
a
r
g
m
i
n
w
,
b
−
∑
i
=
1
n
y
i
l
o
g
σ
(
w
T
x
i
+
b
)
+
(
1
−
y
i
)
l
o
g
[
1
−
σ
(
w
T
+
b
)
]
L(\mathbf{w},b) = argmin_{w,b}-\sum_{i=1}^n y_ilog\sigma(w^Tx_i+b) + (1-y_i)log[1-\sigma(w^T+b)]
L(w,b)=argminw,b−∑i=1nyilogσ(wTxi+b)+(1−yi)log[1−σ(wT+b)]
(b) 求出
L
(
w
,
b
)
L(\mathbf{w},b)
L(w,b)的梯度(或者计算导数),需要必要的中间过程。
$\frac{\partial L(\mathbf{w},b)}{\partial \mathbf{w}}\
= argmin_{w,b}-\sum_{i=1}ny_i\frac{\sigma(wTx_i + b)[1 - \sigma(w^Tx_i + b)]x_i}{\sigma(w^Tx_i+b)} + (1 - y_i)\frac{(-1)\sigma(w^Tx_i + b)[1 - \sigma(w^Tx_i + b)]x_i} {1 - \sigma(w^Tx_i + b)}\
= argmin_{w,b}-\sum_{i = 1}^n y_i[1 - \sigma(w^Tx_i + b)]x_i + (y_i - 1)\sigma(w^Tx_i + b)x_i\
= argmin_{w,b}-\sum_{i = 1}^n [y_i - y_i\sigma(w^Tx_i + b)]x_i + y_i\sigma(w^Tx_i + b)x_i - \sigma(w^Tx_i + b)x_i\
= argmin_{w,b}-\sum_{i = 1}^n [y_i - \sigma(w^Tx_i + b)]x_i\
= argmin_{w,b}\sum_{i = 1}^n [\sigma(w^Tx_i + b) - y_i]x_i$
$\frac{\partial L(\mathbf{w},b)}{\partial b}\
= argmin_{w,b}-\sum_{i=1}ny_i\frac{\sigma(wTx_i + b)[1 - \sigma(w^Tx_i + b)]} {\sigma(w^Tx_i + b)} + (1 - y_i)\frac{(-1)\sigma(w^Tx_i + b)[1 - \sigma(w^Tx_i + b)]} {1 - \sigma(w^Tx_i + b)}\
= argmin_{w,b}-\sum_{i=1}^ny_i[1 - \sigma(w^Tx_i + b)] + (1 - y_i)(-1)\sigma(w^Tx_i + b)\
= argmin_{w,b}-\sum_{i=1}^n[y_i - \sigma(w^Tx_i + b)]\
= argmin_{w,b}\sum_{i=1}n[\sigma(wTx_i + b) - y_i]$
© 请写出基于梯度下降法(batch)的对于
w
\mathbf{w}
w和
b
b
b的更新
w
t
+
1
=
w
t
−
η
t
∑
i
=
1
n
[
σ
(
w
T
x
i
+
b
)
−
y
i
]
x
i
w^{t+1} = w^t - \eta_t\sum_{i=1}^n[\sigma(w^Tx_i + b) - y_i]x_i
wt+1=wt−ηt∑i=1n[σ(wTxi+b)−yi]xi
b t + 1 = b t − η t ∑ i = 1 n [ σ ( w T x i + b ) − y i ] b^{t+1} = b^t - \eta_t\sum_{i=1}^n[\sigma(w^Tx_i + b) - y_i] bt+1=bt−ηt∑i=1n[σ(wTxi+b)−yi]
(d) 假设在(a)的基础上加了一个L2正则项,请写出基于梯度下降法(batch)的对于
w
\mathbf{w}
w和
b
b
b的更新
w
t
+
1
=
w
t
−
η
t
∑
i
=
1
n
[
σ
(
w
T
x
i
+
b
)
−
y
i
]
x
i
+
2
λ
w
w^{t+1} = w^t - \eta_t\sum_{i=1}^n[\sigma(w^Tx_i + b) - y_i]x_i + 2\lambda w
wt+1=wt−ηt∑i=1n[σ(wTxi+b)−yi]xi+2λw
b t + 1 = b t − η t ∑ i = 1 n [ σ ( w T x i + b ) − y i ] + 2 λ w b^{t+1} = b^t - \eta_t\sum_{i=1}^n[\sigma(w^Tx_i + b) - y_i] + 2\lambda w bt+1=bt−ηt∑i=1n[σ(wTxi+b)−yi]+2λw
(e) 在(b)的基础上接着对 w \mathbf{w} w求导(等于二阶导数,二阶导数的维度为 D × D D\times D D×D),这个二阶导数也称之为Hessian Matrix(https://en.wikipedia.org/wiki/Hessian_matrix) 对于矩阵、向量的求导请参考:https://www.math.uwaterloo.ca/~hwolkowi/matrixcookbook.pdf
∂ 2 L ∂ 2 w = ∑ i = 1 n [ 1 − σ ( w T x i + b ) ] σ ( w T x i + b ) x i T x i \frac{\partial^2 \mathcal{L}}{\partial^2 \mathbf{w}} = \sum_{i=1}^n[1 - \sigma(w^Tx_i + b)]\sigma(w^Tx_i + b)x_i^Tx_i ∂2w∂2L=∑i=1n[1−σ(wTxi+b)]σ(wTxi+b)xiTxi
∂ 2 L ∂ w ∂ b = ∑ i = 1 n [ 1 − σ ( w T x i + b ) ] σ ( w T x i + b ) x i \frac{\partial^2 \mathcal{L}}{\partial \mathbf{w} \partial \mathbf{b}} = \sum_{i=1}^n[1 - \sigma(w^Tx_i + b)]\sigma(w^Tx_i + b)x_i ∂w∂b∂2L=∑i=1n[1−σ(wTxi+b)]σ(wTxi+b)xi
∂ 2 L ∂ b ∂ w = ∑ i = 1 n [ 1 − σ ( w T x i + b ) ] σ ( w T x i + b ) x i \frac{\partial^2 \mathcal{L}}{\partial \mathbf{b} \partial \mathbf{w}} = \sum_{i=1}^n[1 - \sigma(w^Tx_i + b)]\sigma(w^Tx_i + b)x_i ∂b∂w∂2L=∑i=1n[1−σ(wTxi+b)]σ(wTxi+b)xi
∂ 2 L ∂ 2 b = ∑ i = 1 n [ 1 − σ ( w T x i + b ) ] σ ( w T x i + b ) \frac{\partial^2 \mathcal{L}}{\partial^2 \mathbf{b}} = \sum_{i=1}^n[1 - \sigma(w^Tx_i + b)]\sigma(w^Tx_i + b) ∂2b∂2L=∑i=1n[1−σ(wTxi+b)]σ(wTxi+b)
H = ( ∑ i = 1 n [ 1 − σ ( w T x i + b ) ] σ ( w T x i + b ) x i T x i ∑ i = 1 n [ 1 − σ ( w T x i + b ) ] σ ( w T x i + b ) x i ∑ i = 1 n [ 1 − σ ( w T x i + b ) ] σ ( w T x i + b ) x i ∑ i = 1 n [ 1 − σ ( w T x i + b ) ] σ ( w T x i + b ) ) H = \left( \begin{array}{cc} \sum_{i=1}^n[1 - \sigma(w^Tx_i + b)]\sigma(w^Tx_i + b)x_i^Tx_i & \sum_{i=1}^n[1 - \sigma(w^Tx_i + b)]\sigma(w^Tx_i + b)x_i \\ \sum_{i=1}^n[1 - \sigma(w^Tx_i + b)]\sigma(w^Tx_i + b)x_i & \sum_{i=1}^n[1 - \sigma(w^Tx_i + b)]\sigma(w^Tx_i + b) \end{array} \right) H=(∑i=1n[1−σ(wTxi+b)]σ(wTxi+b)xiTxi∑i=1n[1−σ(wTxi+b)]σ(wTxi+b)xi∑i=1n[1−σ(wTxi+b)]σ(wTxi+b)xi∑i=1n[1−σ(wTxi+b)]σ(wTxi+b))
(f) 请说明在(e)的得出来的Hessian Matrix是Positive Definite. 提示:为了证明一个 D × D D\times D D×D的矩阵 H H H为Positive Semidefinite,需要证明对于任意一个非零向量 v ∈ R D v\in \mathcal{R}^D v∈RD, 需要得出 v T H v > = 0 v^{T}Hv >=0 vTHv>=0
请推导或者说明:
证明:
假设 v T = [ x , y ] ∈ R v^T = [x, y] \in \mathcal{R} vT=[x,y]∈R
那么,有
v T H v = [ x , y ] ( ∑ i = 1 n [ 1 − σ ( w T x i + b ) ] σ ( w T x i + b ) x i T x i ∑ i = 1 n [ 1 − σ ( w T x i + b ) ] σ ( w T x i + b ) x i ∑ i = 1 n [ 1 − σ ( w T x i + b ) ] σ ( w T x i + b ) x i ∑ i = 1 n [ 1 − σ ( w T x i + b ) ] σ ( w T x i + b ) ) [ x , y ] T = x 2 ∑ i = 1 n [ 1 − σ ( w T x i + b ) ] σ ( w T x i + b ) x i T x i + x y ∑ i = 1 n [ 1 − σ ( w T x i + b ) ] σ ( w T x i + b ) x i + y x ∑ i = 1 n [ 1 − σ ( w T x i + b ) ] σ ( w T x i + b ) x i + y 2 ∑ i = 1 n [ 1 − σ ( w T x i + b ) ] σ ( w T x i + b ) = x 2 ∑ i = 1 n [ 1 − σ ( w T x i + b ) ] σ ( w T x i + b ) x i T x i + 2 x y ∑ i = 1 n [ 1 − σ ( w T x i + b ) ] σ ( w T x i + b ) x i + y 2 ∑ i = 1 n [ 1 − σ ( w T x i + b ) ] σ ( w T x i + b ) = ∑ i = 1 n [ 1 − σ ( w T x i + b ) ] σ ( w T x i + b ) [ x 2 x i T x i + 2 x y x i + y 2 ] = ∑ i = 1 n [ 1 − σ ( w T x i + b ) ] σ ( w T x i + b ) ( x x i + y ) 2 v^{T}Hv\\ = [x, y]\left( \begin{array}{cc} \sum_{i=1}^n[1 - \sigma(w^Tx_i + b)]\sigma(w^Tx_i + b)x_i^Tx_i & \sum_{i=1}^n[1 - \sigma(w^Tx_i + b)]\sigma(w^Tx_i + b)x_i \\ \sum_{i=1}^n[1 - \sigma(w^Tx_i + b)]\sigma(w^Tx_i + b)x_i & \sum_{i=1}^n[1 - \sigma(w^Tx_i + b)]\sigma(w^Tx_i + b) \end{array} \right)[x, y]^T\\ = x^2\sum_{i=1}^n[1 - \sigma(w^Tx_i + b)]\sigma(w^Tx_i + b)x_i^Tx_i + xy\sum_{i=1}^n[1 - \sigma(w^Tx_i + b)]\sigma(w^Tx_i + b)x_i + yx\sum_{i=1}^n[1 - \sigma(w^Tx_i + b)]\sigma(w^Tx_i + b)x_i +\\ y^2\sum_{i=1}^n[1 - \sigma(w^Tx_i + b)]\sigma(w^Tx_i + b)\\ = x^2\sum_{i=1}^n[1 - \sigma(w^Tx_i + b)]\sigma(w^Tx_i + b)x_i^Tx_i + 2xy\sum_{i=1}^n[1 - \sigma(w^Tx_i + b)]\sigma(w^Tx_i + b)x_i + y^2\sum_{i=1}^n[1 - \sigma(w^Tx_i + b)]\sigma(w^Tx_i + b)\\ = \sum_{i=1}^n[1 - \sigma(w^Tx_i + b)]\sigma(w^Tx_i + b)[x^2x_i^Tx_i + 2xyx_i + y^2]\\ = \sum_{i=1}^n[1 - \sigma(w^Tx_i + b)]\sigma(w^Tx_i + b)(xx_i + y)^2 vTHv=[x,y](∑i=1n[1−σ(wTxi+b)]σ(wTxi+b)xiTxi∑i=1n[1−σ(wTxi+b)]σ(wTxi+b)xi∑i=1n[1−σ(wTxi+b)]σ(wTxi+b)xi∑i=1n[1−σ(wTxi+b)]σ(wTxi+b))[x,y]T=x2∑i=1n[1−σ(wTxi+b)]σ(wTxi+b)xiTxi+xy∑i=1n[1−σ(wTxi+b)]σ(wTxi+b)xi+yx∑i=1n[1−σ(wTxi+b)]σ(wTxi+b)xi+y2∑i=1n[1−σ(wTxi+b)]σ(wTxi+b)=x2∑i=1n[1−σ(wTxi+b)]σ(wTxi+b)xiTxi+2xy∑i=1n[1−σ(wTxi+b)]σ(wTxi+b)xi+y2∑i=1n[1−σ(wTxi+b)]σ(wTxi+b)=∑i=1n[1−σ(wTxi+b)]σ(wTxi+b)[x2xiTxi+2xyxi+y2]=∑i=1n[1−σ(wTxi+b)]σ(wTxi+b)(xxi+y)2
因为 [ 1 − σ ( w T x i + b ) ] [1 - \sigma(w^Tx_i + b)] [1−σ(wTxi+b)]、 σ ( w T x i + b ) \sigma(w^Tx_i + b) σ(wTxi+b)以及 ( x x i + y ) 2 (xx_i + y)^2 (xxi+y)2都是大于等于0的
所以 ∑ i = 1 n [ 1 − σ ( w T x i + b ) ] σ ( w T x i + b ) ( x x i + y ) 2 > = 0 \sum_{i=1}^n[1 - \sigma(w^Tx_i + b)]\sigma(w^Tx_i + b)(xx_i + y)^2 >= 0 ∑i=1n[1−σ(wTxi+b)]σ(wTxi+b)(xxi+y)2>=0
所以,最终得出 v T H v > = 0 v^{T}Hv >=0 vTHv>=0
2. 情感分析项目
文本读取
import re
import jieba
import numpy as np
# 读取文件内容
def read_train_file(file_path='', comments=[], labels=[], val=''):
with open(file_path, 'r', encoding='utf-8') as f:
text = f.read().replace(' ','').replace('\n','')
reg = '<reviewid="\d{1,4}">(.*?)</review>'
results = re.findall(reg, text)
for result in results:
result = ','.join(jieba.cut(result))
comments.append(result)
labels.append(val)
def read_test_file(file_path='', comments=[], labels=[]):
with open(file_path, 'r', encoding='utf-8') as f:
text = f.read().replace(' ','').replace('\n','')
reg = '<reviewid="\d{1,4}".*?</review>'
results = re.findall(reg, text)
for result in results:
label_reg = '<reviewid="\d{1,4}"label="(\d)">'
com_reg = '>(.*?)</review>'
label = re.findall(label_reg, result)[0]
comment = re.findall(com_reg, result)[0]
labels.append(label)
comment = ','.join(jieba.cut(comment))
comments.append(comment)
assert(len(comments) == len(labels))
# TODO: 读取文件部分,把具体的内容写入到变量里面
train_comments = []
train_labels = []
test_comments = []
test_labels = []
def process_file():
"""
读取训练数据和测试数据,并对它们做一些预处理
"""
train_pos_file = "data/train.positive.txt"
train_neg_file = "data/train.negative.txt"
test_comb_file = "data/test.combined.txt"
# 读取正面评论文件内容
read_train_file(train_pos_file, train_comments, train_labels, '1')
# 读取负面评论文件内容
read_train_file(train_neg_file, train_comments, train_labels, '0')
# 读取测试文件数据
read_test_file(test_comb_file, test_comments, test_labels)
process_file()
print(len(train_comments), len(train_labels), len(test_comments), len(test_labels))
简单的可视化分析
import matplotlib.pyplot as plt
import numpy as np
pos_comments_count = []
neg_comments_count = []
pos_train_comments = []
neg_train_comments = []
def get_comments():
index = 0
for flag in train_labels:
comment = train_comments[index]
length = len(comment)
if flag == '1':
pos_comments_count.append(length)
pos_train_comments.append(comment)
else:
neg_comments_count.append(length)
neg_train_comments.append(comment)
index = index + 1
get_comments()
pos_total_count = len(pos_comments_count)
neg_total_count = len(neg_comments_count)
print(pos_total_count, neg_total_count)
# 计算相同长度的字符串出现的次数
def cal_statics(comments_count=[]):
temp_dict = {}
total_num = len(comments_count)
for length in comments_count:
temp_dict[length] = temp_dict.get(length, 0) + 1
for key in temp_dict:
temp_dict[key] = temp_dict[key]/total_num
return temp_dict
pos_statics = cal_statics(pos_comments_count)
neg_statics = cal_statics(neg_comments_count)
print(len(pos_statics), len(neg_statics))
# 排序
pos_statics = dict(sorted(pos_statics.items(), key = lambda x:x[0]))
neg_statics = dict(sorted(neg_statics.items(), key = lambda x:x[0]))
# 画正样本histogram
pos_x = list(pos_statics.keys())
pos_y = list(pos_statics.values())
neg_x = list(neg_statics.keys())
neg_y = list(neg_statics.values())
fig = plt.figure()
plt.bar(pos_x, pos_y, 1, color="red")
plt.xlabel("every comment string length")
plt.ylabel("percentage of this string length")
plt.title("positive comments histogram")
fig = plt.figure()
plt.bar(neg_x, neg_y, 1, color="green")
plt.xlabel("every comment string length")
plt.ylabel("percentage of this string length")
plt.title("negative comments histogram")
import collections
import jieba
def get_top20_words(comments=[]):
word_library = [] # 储存所有词
for comment in comments:
for i in jieba.cut(comment):
word_library.append(i)
word_dic = collections.Counter(word_library).most_common(20)
top20_list = [i[0] for i in word_dic]
return top20_list
pos_top20_words = get_top20_words(pos_train_comments)
neg_top20_words = get_top20_words(neg_train_comments)
print('pos_top20_words:' + str(pos_top20_words))
print('neg_top20_words:' + str(neg_top20_words))
# 将正面评价和负面评价中共同出现的词作为停用词
stop_words = []
for word in pos_top20_words:
if word in neg_top20_words and word.isalnum():
stop_words.append(word)
print('stop_words:' + str(stop_words))
pos_top20_words:[',', ',', '的', '。', '了', '是', '!', '很', '我', '也', '在', '有', '~', '都', '好', '.', '不错', '就', '买', '这']
neg_top20_words:[',', ',', '的', '。', '了', '!', '是', '我', '不', '买', '就', '也', '都', '很', '有', '在', '?', '没有', '!', '.']
stop_words:['的', '了', '是', '很', '我', '也', '在', '有', '都', '就', '买']
文本处理部分
import string
def text_preprocessing(comments=[]):
new_comments = []
for comment in comments:
new_sentence = ''
for word in jieba.cut(comment):
# 去除停用词、标点符号、数字
if word not in stop_words and word.isalnum() and not word.isdigit():
new_sentence += word
new_comments.append(new_sentence)
return new_comments
train_comments_new = text_preprocessing(train_comments)
test_comments_new = text_preprocessing(test_comments)
print(len(train_comments_new), len(test_comments_new))
从文本中提取特征
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(train_comments) # 训练数据的特征
y_train = np.array(train_labels) # 训练数据的label
X_test = vectorizer.transform(test_comments) # 测试数据的特征
y_test = np.array(test_labels) # 测试数据的label
print(np.shape(X_train), np.shape(X_test), np.shape(y_train), np.shape(y_test))
训练模型以及选择合适的超参数
利用逻辑回归来训练模型
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report
def process_text(text=''):
text = ''.join(e for e in text if e.isalnum())
return ', '.join(jieba.cut(text))
parameters = { 'C': np.logspace(-3, 3, 7)}
lr = LogisticRegression(solver='liblinear')
clf = GridSearchCV(lr, parameters, cv=5)
clf.fit(X_train, y_train)
print(clf.best_params_)
y_predict = clf.predict(X_test)
print(classification_report(y_test, y_predict))
# clf = LogisticRegression(C=1.0).fit(X_train, y_train)
# 打印在训练数据上的准确率
print("训练数据上的准确率为:" + str(clf.score(X_train, y_train)))
# # 打印在测试数据上的准确率
print("测试数据上的准确率为: " + str(clf.score(X_test, y_test)))
test_comment1 = '这个宝贝还是比较不错滴'
test_comment2 = '很不好,太差了'
test = []
test.append(process_text(test_comment2))
print(test)
print(vectorizer.transform(test))
print(clf.predict(vectorizer.transform(test)))
{'C': 1.0}
precision recall f1-score support
0 0.86 0.54 0.66 1250
1 0.67 0.91 0.77 1250
micro avg 0.73 0.73 0.73 2500
macro avg 0.76 0.73 0.72 2500
weighted avg 0.76 0.73 0.72 2500
训练数据上的准确率为:0.8721636701797892
测试数据上的准确率为: 0.7268
['很, 不好, 太差, 了']
(0, 10188) 0.8064523512198745
(0, 3669) 0.591299082708519
['0']
利用SVM来训练模型
from sklearn import svm
# TODO: 利用SVM来训练模型
parameters = {'kernel':('linear', 'rbf', 'poly', 'sigmoid'), 'C':np.logspace(-3, 3, 7)}
svc = svm.SVC(gamma='scale')
clf = GridSearchCV(svc, parameters, cv=5)
clf.fit(X_train, y_train)
print(clf.best_params_)
y_predict = clf.predict(X_test)
print(classification_report(y_test, y_predict))
{'C': 1.0, 'kernel': 'sigmoid'}
precision recall f1-score support
0 0.85 0.59 0.70 1250
1 0.69 0.89 0.78 1250
micro avg 0.74 0.74 0.74 2500
macro avg 0.77 0.74 0.74 2500
weighted avg 0.77 0.74 0.74 2500
仍然使用SVM模型,但在这里使用Bayesian Optimization来寻找最好的超参数
from sklearn.model_selection import cross_val_score
from bayes_opt import BayesianOptimization
from sklearn.svm import SVC
def svm_cv(C, gamma):
svm = SVC(C=10 ** C, gamma=10 ** gamma, random_state=1)
val = cross_val_score(svm,X_train, y_train, cv=5).mean()
return val
pbounds = {'C':(0,1), 'gamma':(2,20)}
svm_bo = BayesianOptimization(svm_cv, pbounds=pbounds)
svm_bo.maximize()
| iter | target | C | gamma |
-------------------------------------------------
| 1 | 0.6206 | 0.3705 | 6.928 |
| 2 | 0.6206 | 0.9682 | 4.705 |
| 3 | 0.6206 | 0.7015 | 7.333 |
| 4 | 0.6206 | 0.5141 | 12.72 |
| 5 | 0.6206 | 0.6732 | 6.483 |
| 6 | 0.6206 | 0.6284 | 19.99 |
| 7 | 0.6206 | 0.04032 | 19.99 |
| 8 | 0.6208 | 0.8602 | 2.037 |
| 9 | 0.6206 | 0.1939 | 20.0 |
| 10 | 0.6208 | 0.2209 | 2.017 |
| 11 | 0.6206 | 0.8674 | 20.0 |
| 12 | 0.6208 | 0.58 | 2.033 |
| 13 | 0.6208 | 0.859 | 2.009 |
| 14 | 0.6206 | 0.9947 | 19.94 |
| 15 | 0.6208 | 0.06059 | 2.017 |
| 16 | 0.6206 | 0.2054 | 19.95 |
| 17 | 0.6208 | 0.8543 | 2.084 |
| 18 | 0.6208 | 0.103 | 2.021 |
| 19 | 0.6206 | 0.8894 | 19.97 |
| 20 | 0.6208 | 0.3986 | 2.015 |
| 21 | 0.6208 | 0.4768 | 2.015 |
| 22 | 0.6206 | 0.2138 | 19.98 |
| 23 | 0.6208 | 0.3656 | 2.135 |
| 24 | 0.6208 | 0.4324 | 2.012 |
| 25 | 0.6206 | 0.06989 | 20.0 |
| 26 | 0.6208 | 0.5922 | 2.132 |
| 27 | 0.6208 | 0.9851 | 2.101 |
| 28 | 0.6208 | 0.1901 | 2.146 |
| 29 | 0.6206 | 0.2751 | 20.0 |
| 30 | 0.6208 | 0.5422 | 2.036 |
=================================================
特征: 添加n-gram特征
vectorizer = TfidfVectorizer(ngram_range=(1, 2))
X_train = vectorizer.fit_transform(train_comments) # 添加完bigram之后的特征
y_train = np.array(train_labels) #
X_test = vectorizer.transform(test_comments) # 添加完bigram之后的特征
y_test = np.array(test_labels) #
print (np.shape(X_train), np.shape(X_test), np.shape(y_train), np.shape(y_test))
利用逻辑回归来训练模型
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report
def process_text(text=''):
text = ''.join(e for e in text if e.isalnum())
return ', '.join(jieba.cut(text))
parameters = { 'C': np.logspace(-3, 3, 7)}
lr = LogisticRegression(solver='liblinear')
clf = GridSearchCV(lr, parameters, cv=5)
clf.fit(X_train, y_train)
print(clf.best_params_)
y_predict = clf.predict(X_test)
print(classification_report(y_test, y_predict))
# clf = LogisticRegression(C=1.0).fit(X_train, y_train)
# 打印在训练数据上的准确率
print("训练数据上的准确率为:" + str(clf.score(X_train, y_train)))
# # 打印在测试数据上的准确率
print("测试数据上的准确率为: " + str(clf.score(X_test, y_test)))
test_comment1 = '这个宝贝还是比较不错滴'
test_comment2 = '很不好,太差了'
test = []
test.append(process_text(test_comment2))
print(test)
print(vectorizer.transform(test))
print(clf.predict(vectorizer.transform(test)))
{'C': 10.0}
precision recall f1-score support
0 0.84 0.61 0.71 1250
1 0.69 0.89 0.78 1250
micro avg 0.75 0.75 0.75 2500
macro avg 0.77 0.75 0.74 2500
weighted avg 0.77 0.75 0.74 2500
训练数据上的准确率为:0.9952882827030378
测试数据上的准确率为: 0.7484
['很, 不好, 太差, 了']
(0, 55400) 0.8064523512198745
(0, 15496) 0.591299082708519
['0']
利用SVM来训练模型
from sklearn import svm
parameters = {'kernel':('linear', 'rbf', 'poly', 'sigmoid'), 'C':np.logspace(-3, 3, 7)}
svc = svm.SVC(gamma='scale')
clf = GridSearchCV(svc, parameters, cv=5)
clf.fit(X_train, y_train)
print(clf.best_params_)
y_predict = clf.predict(X_test)
print(classification_report(y_test, y_predict))
{'C': 1.0, 'kernel': 'linear'}
precision recall f1-score support
0 0.85 0.61 0.71 1250
1 0.70 0.89 0.78 1250
micro avg 0.75 0.75 0.75 2500
macro avg 0.77 0.75 0.75 2500
weighted avg 0.77 0.75 0.75 2500
仍然使用SVM模型,但在这里使用Bayesian Optimization来寻找最好的超参数
from sklearn.model_selection import cross_val_score
from bayes_opt import BayesianOptimization
from sklearn.svm import SVC
def svm_cv(C, gamma):
svm = SVC(C=10 ** C, gamma=10 ** gamma, random_state=1)
val = cross_val_score(svm,X_train, y_train, cv=5).mean()
return val
pbounds = {'C':(0,1), 'gamma':(2,20)}
svm_bo = BayesianOptimization(svm_cv, pbounds=pbounds)
svm_bo.maximize()
| iter | target | C | gamma |
-------------------------------------------------
| 1 | 0.6202 | 0.1987 | 16.93 |
| 2 | 0.6202 | 0.9998 | 8.928 |
| 3 | 0.6202 | 0.381 | 12.99 |
| 4 | 0.6202 | 0.2872 | 15.86 |
| 5 | 0.6202 | 0.845 | 7.817 |
| 6 | 0.6201 | 0.9857 | 2.006 |
| 7 | 0.6202 | 0.2213 | 20.0 |
| 8 | 0.6202 | 0.04001 | 2.014 |
| 9 | 0.6202 | 0.9077 | 20.0 |
| 10 | 0.6201 | 0.6421 | 2.061 |
| 11 | 0.6202 | 0.8399 | 20.0 |
| 12 | 0.6201 | 0.2333 | 2.027 |
| 13 | 0.6202 | 0.5979 | 19.97 |
| 14 | 0.6202 | 0.6701 | 2.109 |
| 15 | 0.6202 | 0.1631 | 19.95 |
| 16 | 0.6201 | 0.4138 | 2.022 |
| 17 | 0.6202 | 0.1256 | 19.98 |
| 18 | 0.6201 | 0.09698 | 2.062 |
| 19 | 0.6202 | 0.3008 | 19.91 |
| 20 | 0.6202 | 0.281 | 19.97 |
| 21 | 0.6202 | 0.6433 | 2.072 |
| 22 | 0.6202 | 0.5776 | 19.98 |
| 23 | 0.6201 | 0.8474 | 2.026 |
| 24 | 0.6202 | 0.8294 | 19.94 |
| 25 | 0.6202 | 0.005122 | 19.99 |
| 26 | 0.6201 | 0.9513 | 2.034 |
| 27 | 0.6202 | 0.03517 | 19.99 |
| 28 | 0.6201 | 0.1467 | 2.003 |
| 29 | 0.6202 | 0.5118 | 19.99 |
| 30 | 0.6201 | 0.7277 | 2.028 |
=================================================