文章目录
1. 任务
-
数据集:中、英文数据集各一份
中文数据集:THUCNews
THUCNews数据子集:https://pan.baidu.com/s/1hugrfRu 密码:qfud
英文数据集:IMDB数据集Sentiment Analysis -
IMDB数据集下载和探索
参考链接:
影评文本分类 | TensorFlow
科赛 - Kesci.com -
学习召回率、准确率、ROC曲线、AUC、PR曲线这些基本概念
参考链接
2. IMDB数据集下载和探索
3. THUCNews数据子集探索
1. 导入包
import jieba
import pandas as pd
import tensorflow as tf
from collections import Counter
from gensim.models import Word2Vec
from sklearn.feature_extraction.text import CountVectorizer
2. 读取文件
path = 'E:/机器学习/Tensorflow学习/cnews/'
train_data = pd.read_csv(path + 'cnews.train.txt', names=['title', 'content'], sep='\t',engine='python',encoding='UTF-8') # (50000, 2)
test_data = pd.read_csv(path + 'cnews.test.txt', names=['title', 'content'], sep='\t',engine='python',encoding='UTF-8') # (10000, 2)
val_data = pd.read_csv(path + 'cnews.val.txt', names=['title', 'content'], sep='\t',engine='python',encoding='UTF-8') # (5000, 2)
train_data = train_data.head(50)
test_data = test_data.head(50)
val_data = val_data.head(50)
3. 读取停用词
# 读取停用词
def read_stopword(filename):
stopword = []
fp = open(filename, 'r')
for line in fp.readlines():
stopword.append(line.replace('\n', ''))
fp.close()
return stopword
stopword = read_stopword(path + 'stopword.txt')
4. 切分数据,并删除停用词
def cut_data(data, stopword):
words = []
for content in data['content']:
word = list(jieba.cut(content))
for w in list(set(word) & set(stopword)):
while w in word:
word.remove(w)
words.append(word)
data['content'] = words
return data
train_data = cut_data(train_data, stopword)
test_data = cut_data(test_data, stopword)
val_data = cut_data(val_data, stopword)
train_data.shape
(50, 2)
5. 获取单词列表
def word_list(data):
all_word = []
for word in data['content']:
all_word.extend(word)
return all_word
word_list(train_data)
6. 提取特征 向量化
def feature(train_data, test_data, val_data):
content = pd.concat([train_data['content'], test_data['content'], val_data['content']], ignore_index=True)
# count_vec = CountVectorizer(max_features=300, min_df=2)
# count_vec.fit_transform(content)
# train_fea = count_vec.transform(train_data['content']).toarray()
# test_fea = count_vec.transform(test_data['content']).toarray()
# val_fea = count_vec.transform(val_data['content']).toarray()
model = Word2Vec(content, size=100, min_count=1, window=10, iter=10)
train_fea = train_data['content'].apply(lambda x: model[x])
test_fea = test_data['content'].apply(lambda x: model[x])
val_fea = val_data['content'].apply(lambda x: model[x])
return train_fea, test_fea, val_fea
train_fea, test_fea, val_fea = feature(train_data, test_data, val_data)
all_word = []
all_word.extend(word_list(train_data))
all_word.extend(word_list(test_data))
all_word.extend(word_list(val_data))
all_word = list(set(all_word))
word_list(train_data)
4. 召回率、准确率、ROC曲线、AUC、PR曲线基本概念
介绍上面的概念之前,我们先来了解一下TP, FP, TN, FN
- TP(True Positives):预测为正样本,实际也为正样本的特征数
- FP(False Positives):预测为正样本,实际为负样本的特征数
- TN(True Negatives):预测为负样本,实际也为负样本的特征数
- FN(False Negatives):预测为负样本,实际为正样本的特征数
4.1 召回率 recall
召回率是覆盖面的度量,度量有多个正例被分为正例
R
=
T
P
T
P
+
F
P
R = \frac{TP}{TP+FP}
R=TP+FPTP
召回率API:
from sklearn.metrics import recall_score
recall = recall_score(y_test, y_predict)
#recall得到的是一个list,是每一类的召回率
4.2 分类准确率 accuracy
A c c u r a c y = T P + T N T P + T N + F P + F N Accuracy = \frac{TP+TN}{TP+TN+FP+FN} Accuracy=TP+TN+FP+FNTP+TN
准确率的API:
from sklearn.metrics import accuracy
accuracy = accuracy_score(y_test, y_predict)
4.3 精确率Precision
表示被分为正例的示例中实际为正例的比例
P
=
T
P
T
P
+
F
P
P = \frac{TP}{TP+FP}
P=TP+FPTP
精确率API:
from sklearn.metrics import precision_score
precision = precision_score(y_test, y_predict)
4.4 F1值
F1是精确率和召回率的调和均值,更接近两个数中较小的那个,所有当P和R接近时,F值最大。F1-score多用于数据分布不均衡的问题、推荐系统等。
2
F
1
=
1
P
+
1
R
\frac{2}{F1} = \frac{1}{P}+\frac{1}{R}
F12=P1+R1
F
1
=
2
T
P
2
T
P
+
F
P
+
F
N
F1 = \frac{2TP}{2TP+FP+FN}
F1=2TP+FP+FN2TP
F1值API:
from sklearn.metrics import f1_score
f1_score(y_test, y_predict)
4.5 混淆矩阵
混淆矩阵的API
from sklearn.metrics import confusion_matrix
confusion_matrix = confusion_matrix(y_test, y_predict)
4.6 AUC
AUC用于衡量“二分类问题”机器学习算法性能(泛化能力)。
from sklearn.metrics import auc
auc= auc(y_test, y_predict)
4.7 Roc曲线
ROC全称是“受试者工作特征”(Receiver Operating Characteristic)。ROC曲线的面积就是AUC(Area Under the Curve)
ROC曲线越接近左上角,代表模型越好,即ACU接近1
from sklearn.metrics import roc_auc_score, auc
import matplotlib.pyplot as plt
y_predict = model.predict(x_test)
y_probs = model.predict_proba(x_test) #模型的预测得分
fpr, tpr, thresholds = metrics.roc_curve(y_test,y_probs)
roc_auc = auc(fpr, tpr) #auc为Roc曲线下的面积
#开始画ROC曲线
plt.plot(fpr, tpr, 'b',label='AUC = %0.2f'% roc_auc)
plt.legend(loc='lower right')
plt.plot([0,1],[0,1],'r--')
plt.xlim([-0.1,1.1])
plt.ylim([-0.1,1.1])
plt.xlabel('False Positive Rate') #横坐标是fpr
plt.ylabel('True Positive Rate') #纵坐标是tpr
plt.title('Receiver operating characteristic example')
plt.show()
还有一个概念叫”截断点”。机器学习算法对test样本进行预测后,可以输出各test样本对某个类别的相似度概率。比如t1是P类别的概率为0.3,一般我们认为概率低于0.5,t1就属于类别N。这里的0.5,就是”截断点”。 对于计算ROC,最重要的三个概念就是TPR, FPR, 截断点。
纵坐标—>TPR即灵敏度(true positive rate ),预测的正类中实际正实例占所有正实例的比例。
T
P
R
=
T
P
(
T
P
+
F
N
)
TPR = \frac{TP}{(TP+FN)}
TPR=(TP+FN)TP
横坐标—>特异度(false positive rate),预测的正类中实际负实例占所有负实例的比例。
F
P
R
=
F
P
(
T
N
+
F
P
)
FPR=\frac{FP}{(TN+FP)}
FPR=(TN+FP)FP