datawhale数据分析学习day08

森森_hi

于 2021-06-23 11:10:55 发布

阅读量65

点赞数

本文链接：https://blog.csdn.net/weixin_46350854/article/details/118102842

版权

day08：模型搭建和评估--评估

这里介绍的模型评估有三种方式：

交叉验证（cross-validation）：sklearn.model_selection

评估泛化性能的统计学方法，比单次划分训练集和测试机的方法更加稳定和全面。（数据需要被多次划分，并且需要训练多个模型）最常用的交叉验证是K折交叉验证（k-fold cross-validation），其中k是由用户指定的数字，通常取5或者10.

准确率（precision)度量的是被预测为正例的样本中由多少是真正的正例。

召回率（recall）度量的是正类样本中由多少被预测为正类。

f-分数是准确率与召回率的调和平均。

代码实现：

from sklearn.model_selection import cross_val_score
lr = LogisticRegression(C=100)
scores = cross_val_score(lr, X_train, y_train, cv=10)
print("Average cross-validation score: {:.2f}".format(scores.mean()))

详细参考链接：

https://www.jianshu.com/p/31efbe02e059

混淆矩阵（confusion matrix）：sklearn.metrics

衡量的是一个分类器分类的准确程度。

正确率（Accuracy）：被正确分类的样本比例或数量
(TP+TN)/Total = (35+50)/100 = 85%

错误率（Misclassification/Error Rate）：被错误分类的样本比例或数量
(FP+FN)/Total = (5+10)/100 = 15%

真阳率（True Positive Rate）：分类器预测为正例的样本占实际正例样本数量的比例，也叫敏感度（sensitivity）或召回率（recall），描述了分类器对正例类别的敏感程度。
TP/ actual yes = 35/40 = 87%

假阳率（False Positive Rate）：分类器预测为正例的样本占实际负例样本数量的比例。
FP/actual no = 10/60 = 17%

特异性（Specificity）：实例是负例，分类器预测结果的类别也是负例的比例。
TN/actual no = 50/60 = 83%

精度（Precision）：在所有判别为正例的结果中，真正正例所占的比例。
TP/predicted yes = 35/45 = 77%

流行程度（Prevalence）：正例在样本中所占比例。

混淆矩阵需要输入真实标签和预测标签

精确率、召回率以及f-分数可使用classification_report模块

代码实现：

from sklearn.metrics import confusion_matrix
# 训练模型
lr = LogisticRegression(C=100)
lr.fit(X_train, y_train)
# 模型预测结果
pred = lr.predict(X_train)
# 混淆矩阵
confusion_matrix(y_train, pred)
from sklearn.metrics import classification_report
# 精确率、召回率以及f1-score
print(classification_report(y_train, pred))

详细参考链接：

https://blog.csdn.net/weixin_42462804/article/details/100015334?utm_medium=distribute.pc_relevant.none-task-blog-2%7Edefault%7EBlogCommendFromMachineLearnPai2%7Edefault-3.base&depth_1-utm_source=distribute.pc_relevant.none-task-blog-2%7Edefault%7EBlogCommendFromMachineLearnPai2%7Edefault-3.base

ROC曲线（Receiver Operating Characteristic Curve）

ROC曲线在sklearn中的模块为sklearn.metrics

ROC曲线下面所包围的面积越大越好

反映敏感性和特异性连续变量的综合指标,是用构图法揭示敏感性和特异性的相互关系，它通过将连续变量设定出多个不同的临界值，从而计算出一系列敏感性和特异性，再以敏感性为纵坐标、（1-特异性）为横坐标绘制成曲线，曲线下面积越大，诊断准确性越高。在ROC曲线上，最靠近坐标图左上方的点为敏感性和特异性均较高的临界值。

condition positive (P)：the number of real positive cases in the data（数据中的所有1）

condition negatives (N)：the number of real negative cases in the data（数据中所有的0）

true positive (TP)：eqv. with hit（被判断为1的1，即1中判断正确的量）

true negative (TN)：eqv. with correct rejection（被判断为0的0，即0中判断正确的量）

false positive (FP)：eqv. with false alarm, Type I error（被判断为1的0，即第一类错误）

false negative (FN)：eqv. with miss, Type II error（被判断为0的1，即第二类错误）

sensitivity, recall, hit rate, or true positive rate (TPR)

fall-out or false positive rate (FPR)

specificity or true negative rate (TNR)

代码实现：

from sklearn.metrics import roc_curve
fpr, tpr, thresholds = roc_curve(y_test, lr.decision_function(X_test))
plt.plot(fpr, tpr, label="ROC Curve")
plt.xlabel("FPR")
plt.ylabel("TPR (recall)")
# 找到最接近于0的阈值
close_zero = np.argmin(np.abs(thresholds))
plt.plot(fpr[close_zero], tpr[close_zero], 'o', markersize=10, label="threshold zero", fillstyle="none", c='k', mew=2)
plt.legend(loc=4)

详细参考链接：

https://www.jianshu.com/p/2ca96fce7e81

https://blog.csdn.net/abcjennifer/article/details/7359370

本次的学习基于datawhale学习打卡小组：

链接：https://github.com/datawhalechina/hands-on-data-analysis

森森_hi

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
datawhale数据分析学习day08

day08：模型搭建和评估--评估这里介绍的模型评估有三种方式：交叉验证（cross-validation）：sklearn.model_selection评估泛化性能的统计学方法，比单次划分训练集和测试机的方法更加稳定和全面。（数据需要被多次划分，并且需要训练多个模型）最常用的交叉验证是K折交叉验证（k-fold cross-validation），其中k是由用户指定的数字，通常取5或者10.准确率（precision)度量的是被预测为正例的样本中由多少是真正的正例。召回率..
复制链接

扫一扫