目录
1.1.1、Computing cross-validated metrics 计算交叉验证指标
1.1.2、Cross validation iterators 交叉验证的迭代器【常用有 K-fold、StratifiedKFold】
1.2、Grid Search:查询模型的最优参数【包含 Cross-vaildation】
2.2、 二分类 ROC曲线 AUC值 与 多分类 ROC曲线AUC值【多分类ROC曲线分为 macro ROC 与 micro ROC】ROC原理介绍及利用python实现二分类和多分类的ROC曲线
数据准备
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import GridSearchCV
model_datas = pd.read_csv('adultTest.csv',sep=',',header="infer")
model_datas.info()
out:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 32561 non-null int64
1 workclass 32561 non-null object
2 fnlwgt 32561 non-null int64
3 education 32561 non-null object
4 education-num 32561 non-null int64
5 marital-status 32561 non-null object
6 occupation 32561 non-null object
7 relationship 32561 non-null object
8 race 32561 non-null object
9 sex 32561 non-null object
10 capital-gain 32561 non-null int64
11 capital-loss 32561 non-null int64
12 hours-per-week 32561 non-null int64
13 native-country 32561 non-null object
14 class 32561 non-null object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB
print(model_datas.head())
out:
age workclass fnlwgt ... hours-per-week native-country class
0 39 State-gov 77516 ... 40 United-States <=50K
1 50 Self-emp-not-inc 83311 ... 13 United-States <=50K
2 38 Private 215646 ... 40 United-States <=50K
3 53 Private 234721 ... 40 United-States <=50K
4 28 Private 338409 ... 40 Cuba <=50K
[5 rows x 15 columns]
x_data = model_datas.drop(labels='class',axis=1)
y = model_datas['class']
# label 编码
Le = LabelEncoder()
y_new = Le.fit_transform(y)
classes = Le.classes_ # 1D 数组,索引下标为 编码类别 , classes[y_new] 可以转化为原来的类别
# one-hot
x_data = pd.get_dummies(data=x_data,columns=['workclass','education','marital-status','occupation',
'relationship','race','sex','native-country'])
print(x_data.head())
out:
age fnlwgt ... native-country_ Vietnam native-country_ Yugoslavia
0 39 77516 ... 0 0
1 50 83311 ... 0 0
2 38 215646 ... 0 0
3 53 234721 ... 0 0
4 28 338409 ... 0 0
[5 rows x 108 columns]
print(y_new)
out:
[0 0 0 ... 0 0 1]
一、参数选择
1.0、官方文档查看
1.1、Cross-vaildation:验证模型的性能
1.1.1、Computing cross-validated metrics 计算交叉验证指标
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import cross_val_score
gbc = GradientBoostingClassifier(learning_rate=0.01,n_estimators=50,max_depth=2)
"""
learning_rate=0.01, # 学习率
n_estimators=50, # 子CART回归树的个数
max_depth=2 # 子CART回归树的深度
"""
scores_f1_macro = cross_val_score(estimator=gbc,X=x_data,y=y_new,scoring='f1_macro',cv=5,n_jobs=-1)
"""
estimator, # 用交叉验证的学习器
X, # X 矩阵
y=None, # y label
scoring=None, # 验证的评估函数 【['accuracy', 'adjusted_mutual_info_score', 'adjusted_rand_score', 'average_precision', 'completeness_score', 'explained_variance', 'f1', 'f1_macro', 'f1_micro', 'f1_samples', 'f1_weighted', 'fowlkes_mallows_score', 'homogeneity_score', 'mutual_info_score', 'neg_log_loss', 'neg_mean_absolute_error', 'neg_mean_squared_error', 'neg_mean_squared_log_error', 'neg_median_absolute_error', 'normalized_mutual_info_score', 'precision', 'precision_macro', 'precision_micro', 'precision_samples', 'precision_weighted', 'r2', 'recall', 'recall_macro', 'recall_micro', 'recall_samples', 'recall_weighted', 'roc_auc', 'v_measure_score']】
tip:查看 sklearn的官方手册【https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter】可以知道,f1 函数默认为 binary,如果多分类,常常用 f1_macro【考虑所有类别f1并且求均值】
cv=None, # int or cross-validation generator or an iterable 【例如 K-fold、Stratifiled K-fold ... 】
n_jobs=-1 # -1 表示会使用所有的线程
"""
print(scores_f1_macro)
out:
[0.46670152 0.47061916 0.4692159 0.46991753 0.47243059]
- 两个重要参数详解:cv 与 scoring
- cv 参数 :int or cross-validation generator or an iterable,详情见下文 cross-validation generator
1、cv: int, cross-validation generator or an iterable
1.1、int
1.2、其中,cv参数可以传入sklearn中自带的一些cv iterators:
1.2.1、K-fold
1.2.2、Stratified k-fold
1.2.3、Label k-fold
1.2.4、Leave-One-Out - LOO
1.2.5、Leave-P-Out - LPO
- scoring 参数 【不仅仅是 cross-val-score 这个函数,sklearn 中包含 scoring 参数的API 均能使用】官方文档
'accuracy',
'adjusted_mutual_info_score',
'adjusted_rand_score',
'average_precision',
'completeness_score',
'explained_variance',
'f1',
'f1_macro', # 针对多分类求均值
'f1_micro',
'f1_samples',
'f1_weighted',
'fowlkes_mallows_score',
'homogeneity_score',
'mutual_info_score',
'neg_log_loss',
'neg_mean_absolute_error',
'neg_mean_squared_error',
'neg_mean_squared_log_error',
'neg_median_absolute_error',
'normalized_mutual_info_score',
'precision',
'precision_macro', # 针对多分类求均值
'precision_micro',
'precision_samples',
'precision_weighted',
'r2',
'recall',
'recall_macro', # 针对多分类求均值
'recall_micro',
'recall_samples',
'recall_weighted',
'roc_auc',
'v_measure_score'
1.1.2、Cross validation iterators 交叉验证的迭代器【常用有 K-fold、StratifiedKFold】
K-fold 用法
from sklearn.model_selection import KFold
datas = np.array(list(range(2,10)))
y = np.array([0,0,0,0,1,1,1,1])
print(datas)
out:
[2 3 4 5 6 7 8 9]
kf = KFold(n_splits=4)
for train_index,test_index in kf.split(datas,y):
print('训练集:索引 %s 测试集:索引 %s ' %(train_index,test_index))
x_train,x_test = datas[train_index],datas[test_index]
y_train,y_test = y[train_index],y[test_index]
out:
训练集:索引 [2 3 4 5 6 7] 测试集:索引 [0 1]
训练集:索引 [0 1 4 5 6 7] 测试集:索引 [2 3]
训练集:索引 [0 1 2 3 6 7] 测试集:索引 [4 5]
训练集:索引 [0 1 2 3 4 5] 测试集:索引 [6 7]
# 使用 K-fold 的生成器作为 cross_val_score 中 cv参数
from sklearn.model_selection import KFold,cross_val_score
from sklearn.ensemble import GradientBoostingClassifier
kf = KFold(n_splits=4)
gbc = GradientBoostingClassifier(learning_rate=0.01,n_estimators=50,max_depth=2)
scores_f1_macro = cross_val_score(estimator=gbc,X=x_data,y=y_new,scoring='f1_macro',cv=kf,n_jobs=-1)
print(scores_f1_macro)
out:
[0.4682921 0.47002771 0.47175635 0.46945807]
StratifiedKFold
from sklearn.model_selection import StratifiedKFold
datas = np.array(list(range(2,10)))
y = np.array([0,0,0,0,1,1,1,1])
print(datas)
out:
[2 3 4 5 6 7 8 9]
skf = StratifiedKFold(n_splits=4)
for train_index,test_index in skf.split(datas,y):
print('训练集:索引 %s 测试集:索引 %s ' %(train_index,test_index))
x_train,x_test = datas[train_index],datas[test_index]
y_train,y_test = y[train_index],y[test_index]
out:
训练集:索引 [1 2 3 5 6 7] 测试集:索引 [0 4]
训练集:索引 [0 2 3 4 6 7] 测试集:索引 [1 5]
训练集:索引 [0 1 3 4 5 7] 测试集:索引 [2 6]
训练集:索引 [0 1 2 4 5 6] 测试集:索引 [3 7]
1.2、Grid Search:查询模型的最优参数【包含 Cross-vaildation】
1.2.1、Grid Search 重要参数
A search consists of:
- an estimator (regressor or classifier such as sklearn.svm.SVC());
- a parameter space;
- a method for searching or sampling candidates;
- a cross-validation scheme; and
- a score function.
class sklearn.model_selection.GridSearchCV(estimator, param_grid, scoring=None, n_jobs=None,
iid='deprecated',
refit=True, cv=None, verbose=0, pre_dispatch='2*n_jobs', error_score=nan, return_train_score=False)
"""
参数:
estimator——用什么模型
param__grid——参数字典(key为要寻优的参数名,value为要尝试寻优的值的列表)
scoring——用什么指标来评估(分类器默认用准确率,也可改为'f1'、'roc_auc'等)
cv—— 几折交叉验证(默认5,一般设置5-10, 也可以传入一个KFold或Stratified迭代器,但实际上传入整数默认就是用Stratified迭代器)
n_jobs——开n个进程并行计算,默认为1(建议设置-1,让之并行计算)
verbose——是否要将学习过程打印出来(如0或1或2或3,数字越大,打印信息越详细。但有的模型没有学习的过程,如这个perceptrom)
iid——假设样本是否是独立同分布的(默认是True)
refit——是否需要直接返回在整个训练集上的最佳分类器,默认为True,可直接将这个GridSearchCV实例用于predict
error_score——遇到不合理的参数是否要报错,默认'nan'
"""
1.2.1、Grid Search 网格交叉验证代码案例
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import GridSearchCV
gbc = GradientBoostingClassifier(learning_rate=0.01,n_estimators=150,max_depth=4)
param_grid = {'learning_rate':[0.1,1,0.01],
'n_estimators':[20,50,100],
'max_depth':[2,3,4],}
gbc_cv = GridSearchCV(estimator=gbc, param_grid=param_grid, scoring='f1_macro', n_jobs=-1,iid='deprecated', cv=5, verbose=1)
gbc_cv.fit(x_data,y_new)
best_score = gbc_cv.best_score_
print(best_score) # 最高的训练分数
out:
0.809569551196101
best_params = gbc_cv.best_params_
print(best_params) # 最优的参数
out:
{'learning_rate': 1, 'max_depth': 3, 'n_estimators': 50}
gbc_model = gbc_cv.best_estimator_
print(gbc_model) # 最好的 estimator 学习器
out:
GradientBoostingClassifier(ccp_alpha=0.0, criterion='friedman_mse', init=None,
learning_rate=1, loss='deviance', max_depth=3,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=50,
n_iter_no_change=None, presort='deprecated',
random_state=None, subsample=1.0, tol=0.0001,
validation_fraction=0.1, verbose=0,
warm_start=False)
二、评估指标
具体可以参考sklearn文档中列明的scoring指标:
3.3. Metrics and scoring: quantifying the quality of predictions — scikit-learn 1.4.1 documentation
例如分类任务可用的scoring指标如下:
2.1、 macroF1 宏平均 与 microF1 微平均【micro 微平均 acc == precision == recall == f1 】 f1_score/precision_score/recall_score中micro和macro的区别 、
1、macro 计算时:利用混淆矩阵分别计算每一个类别的score [将其他类别都当成负例],然后进行平均 【各个类别的F1等
权重均值池化】
2、micro 计算时:利用混淆矩阵通过计算总体的 TP ,FN ,FP 的数量,再计算F1 【micro方法其实就是 acc准确率的计算方式,
因此 micro 微平均有以下特点: acc == precision == recall == f1 恒成立 】
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import f1_score
gbc = GradientBoostingClassifier(learning_rate=1,n_estimators=50,max_depth=3)
gbc.fit(x_data,y_new)
y_pre = gbc.predict(x_data)
f1_score_1 = f1_score(y_true=y_new,y_pred=y_pre,average='macro') ...... 所有类别 f1_score 的 avg pooling
"""
y_true,
y_pred,
labels=None,
pos_label=1, # 设定 pos_label 哪个类别是正例,这个主要是针对二分类问题时,返回哪一个类别作为输出
average='binary', # binary 【二分类问题】, macro,micro ... 这些事针对多分类问题
sample_weight=None, # 多分类是加权求均值,具体请看官方手册
zero_division="warn"
"""
print(f1_score_1)
out:
0.8322981046559539
f1_score_2 = f1_score(y_true=y_new,y_pred=y_pre,average='micro')
print(f1_score_2)
out:
0.8835416602684194
f1_score_3 = f1_score(y_true=y_new,y_pred=y_pre,average=None) # 计算出所有类别的 f1_score
print(f1_score_3)
out:
[0.925 0.73959621]
f1_score_4 = f1_score(y_true=y_new,y_pred=y_pre,average="binary") # f1_score 默认是 binary 模型,因此默认只能解决二分类问题; 二分类问题时返回 positive tag 的 f1_score,而 positive tag 由 参数 pos_label 给定,默认为 1 类别。
print(f1_score_4)
out:
0.7395962093119078
from sklearn import metrics
y_test = [1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 4, 4]
y_predict = [1, 1, 1, 3, 3, 2, 2, 3, 3, 3, 4, 3, 4, 3]
print('准确率:', metrics.accuracy_score(y_test, y_predict)) #预测准确率输出
print('宏平均精确率:',metrics.precision_score(y_test,y_predict,average='macro')) #预测宏平均精确率输出
print('微平均精确率:', metrics.precision_score(y_test, y_predict, average='micro')) #预测微平均精确率输出
print('加权平均精确率:', metrics.precision_score(y_test, y_predict, average='weighted')) #预测加权平均精确率输出
print('宏平均召回率:',metrics.recall_score(y_test,y_predict,average='macro'))#预测宏平均召回率输出
print('微平均召回率:',metrics.recall_score(y_test,y_predict,average='micro'))#预测微平均召回率输出
print('加权平均召回率:',metrics.recall_score(y_test,y_predict,average='micro'))#预测加权平均召回率输出
print('宏平均F1-score:',metrics.f1_score(y_test,y_predict,labels=[1,2,3,4],average='macro'))#预测宏平均f1-score输出
print('微平均F1-score:',metrics.f1_score(y_test,y_predict,labels=[1,2,3,4],average='micro'))#预测微平均f1-score输出
print('加权平均F1-score:',metrics.f1_score(y_test,y_predict,labels=[1,2,3,4],average='weighted'))#预测加权平均f1-score输出
print('混淆矩阵输出:\n',metrics.confusion_matrix(y_test,y_predict,labels=[1,2,3,4]))#混淆矩阵输出
print('分类报告:\n', metrics.classification_report(y_test, y_predict,labels=[1,2,3,4]))#分类报告输出
输出:
准确率: 0.571428571429
宏平均精确率: 0.696428571429
微平均精确率: 0.571428571429
加权平均精确率: 0.775510204082
宏平均召回率: 0.566666666667
微平均召回率: 0.571428571429
加权平均召回率: 0.571428571429
宏平均F1-score: 0.579166666667
微平均F1-score: 0.571428571429
加权平均F1-score: 0.615476190476
混淆矩阵输出:
[[3 0 2 0]
[0 2 2 0]
[0 0 2 1]
[0 0 1 1]]
分类报告:
precision recall f1-score support
1 1.00 0.60 0.75 5
2 1.00 0.50 0.67 4
3 0.29 0.67 0.40 3
4 0.50 0.50 0.50 2
avg / total 0.78 0.57 0.62 14
2.2、 二分类 ROC曲线 AUC值 与 多分类 ROC曲线AUC值【多分类ROC曲线分为 macro ROC 与 micro ROC】ROC原理介绍及利用python实现二分类和多分类的ROC曲线
- 二分类问题的 ROC曲线与AUC值,默认是针对 1 类别为正例
# -*- coding: utf-8 -*-
import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm, datasets
from sklearn.metrics import roc_curve, auc ###计算roc和auc
from sklearn import cross_validation
# Import some data to play with
iris = datasets.load_iris()
X = iris.data
y = iris.target
##变为2分类
X, y = X[y != 2], y[y != 2]
# Add noisy features to make the problem harder
random_state = np.random.RandomState(0)
n_samples, n_features = X.shape
X = np.c_[X, random_state.randn(n_samples, 200 * n_features)]
# shuffle and split training and test sets
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size=.3,random_state=0)
# Learn to predict each class against the other
svm = svm.SVC(kernel='linear', probability=True,random_state=random_state)
###通过decision_function()计算得到的y_score的值,用在roc_curve()函数中
y_score = svm.fit(X_train, y_train).decision_function(X_test)
# Compute ROC curve and ROC area for each class
fpr,tpr,threshold = roc_curve(y_test, y_score) ###计算真正率和假正率
roc_auc = auc(fpr,tpr) ###计算auc的值
plt.figure()
lw = 2
plt.figure(figsize=(10,10))
plt.plot(fpr, tpr, color='darkorange',
lw=lw, label='ROC curve (area = %0.2f)' % roc_auc) ###假正率为横坐标,真正率为纵坐标做曲线
plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic example')
plt.legend(loc="lower right")
plt.show()
- 多分类问题的 ROC曲线与AUC值,micro roc 直接以得到最终均值的 roc曲线,而反观 macro roc 的过程中可以汇出每个类别的 roc 值
# # 绘制roc曲线 # #
y_test_one_hot = label_binarize(y_test_cls, np.arange(3)) # 将标签二值化
y_predict_one_hot = y_logits_cls # .decision_function(X_test) 决策函数生成的置信度矩阵
plt.figure()
# 绘图
mpl.rcParams['font.sans-serif'] = u'SimHei'
mpl.rcParams['axes.unicode_minus'] = False
# FPR就是横坐标,TPR就是纵坐标
# 计算ROC
fpr_dict, tpr_dict, roc_auc = dict(), dict(), dict()
for i in range(3): # 计算每一个标签的假正例率(fpr)和真正例率(tpr)
fpr_dict[i], tpr_dict[i], _ = roc_curve(y_test_one_hot[:, i], y_predict_one_hot[:, i])
roc_auc[i] = auc(fpr_dict[i], tpr_dict[i])
# 两种画法:
# 方法一:将所有的标签进行二值化处理后,如[[0,0,1],[0,1,0]] 转成[0,0,1,0,1,0] 转成二分类进行求解
fpr_dict["micro"], tpr_dict["micro"], _ = roc_curve(y_test_one_hot.ravel(),
y_predict_one_hot.ravel())
roc_auc["micro"] = auc(fpr_dict["micro"], tpr_dict["micro"])
# 方法二: 将每个标签的fpr和tpr进行累加除以种类数,即画出平均后的roc曲面
n_classes = 3
from scipy import interp
all_fpr = np.unique(np.concatenate([fpr_dict[i] for i in range(n_classes)]))
# Then interpolate all ROC curves at this points
mean_tpr = np.zeros_like(all_fpr)
for i in range(n_classes):
mean_tpr += interp(all_fpr, fpr_dict[i], tpr_dict[i])
# Finally average it and compute AUC
mean_tpr /= n_classes
fpr_dict["macro"] = all_fpr
tpr_dict["macro"] = mean_tpr
roc_auc["macro"] = auc(fpr_dict["macro"], tpr_dict["macro"])
print(roc_auc)
# 显示到当前界面,保存为svm.png
lw = 2
# plt.plot(fpr_dict[2], tpr_dict[2], color='darkorange', # 画关于正面的roc曲面
# lw=lw, label='ROC curve (area = %0.3f)' % roc_auc["micro"])
plt.plot(fpr_dict["micro"], tpr_dict["micro"], color='darkorange',
lw=lw, label='ROC curve (area = %0.3f)' % roc_auc["micro"])
plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic example')
plt.legend(loc="lower right")
plt.title(u'text_rnnROC和AUC', fontsize=17)
path = os.path.join(file_path, "img")
if not os.path.exists(path): os.makedirs(path)
plt.savefig(os.path.join(file_path, "img", "{}的ROC和AUC.png".format("model_" + str(config.model_num) + "_")))