20210329_23期_集成学习(上)_Task06_分类问题的评估及超参数调优

最新推荐文章于 2023-04-13 10:17:36 发布

余柳成荫

最新推荐文章于 2023-04-13 10:17:36 发布

阅读量176

点赞数

分类专栏：组队学习机器学习基础

原文链接：https://github.com/datawhalechina/team-learning-data-mining/tree/master/EnsembleLearning

版权

组队学习同时被 2 个专栏收录

13 篇文章 0 订阅

订阅专栏

机器学习基础

10 篇文章 0 订阅

订阅专栏

六、分类问题的评估及超参数调优

1 评估性能调参例子

1.1 网格搜索调参

数据集

from sklearn import datasets
import pandas as pd
iris = datasets.load_iris()
X = iris.data
y = iris.target
feature = iris.feature_names
data = pd.DataFrame(X,columns=feature)
data['target'] = y

# 使用网格搜索进行超参数调优：
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
import time
from sklearn.pipeline import make_pipeline  # 管道简化学习过程
from sklearn.preprocessing import StandardScaler    #SVR需要标准化

start_time = time.time()  #开始时间
pipe_svc = make_pipeline(StandardScaler(),SVC(random_state=1))
param_range = [0.0001,0.001,0.01,0.1,1.0,10.0,100.0,1000.0]  #参数范围
param_grid = [{'svc__C':param_range,'svc__kernel':['linear']},  #线性核函数
             {'svc__C':param_range,'
             svc__gamma':param_range,'svc__kernel':['rbf']}]  #rbf核函数
             
gs = GridSearchCV(estimator=pipe_svc,
               param_grid=param_grid,
               scoring='accuracy',       #分类评价指标ACC
               cv=10,n_jobs=-1)
gs = gs.fit(X,y)
end_time = time.time()  #结束时间
print("网格搜索经历时间：%.3f S" % float(end_time-start_time))
print(gs.best_score_)
print(gs.best_params_)

网格搜索经历时间：6.229 S
0.9800000000000001
{'svc__C': 1.0, 'svc__gamma': 0.1, 'svc__kernel': 'rbf'}

1.2 随机搜索调参

# 方式2：随机网格搜索RandomizedSearchCV()
from sklearn.model_selection import RandomizedSearchCV
from sklearn.svm import SVC
import time

start_time = time.time()
pipe_svc = make_pipeline(StandardScaler(),SVC(random_state=1))
param_range = [0.0001,0.001,0.01,0.1,1.0,10.0,100.0,1000.0]
param_grid = [{'svc__C':param_range,'svc__kernel':['linear']},
              {'svc__C':param_range,'svc__gamma':param_range,
               'svc__kernel':['rbf']}]
'''
# param_grid = [{'svc__C':param_range,          #另一种写法
               'svc__kernel':['linear','rbf'],
               'svc__gamma':param_range}]
'''
gs = RandomizedSearchCV(estimator=pipe_svc, 
                        param_distributions=param_grid,
                        scoring='accuracy',cv=10,n_jobs=-1)
gs = gs.fit(X,y)
end_time = time.time()
print("随机网格搜索经历时间：%.3f S" % float(end_time-start_time))
print(gs.best_score_)
print(gs.best_params_)

随机网格搜索经历时间：0.558 S
0.9800000000000001
{'svc__kernel': 'rbf', 'svc__gamma': 0.1, 'svc__C': 1.0}

1.3 混淆矩阵

标签编码LabelEncoder
作用：利用LabelEncoder() 将转换成连续的数值型变量。即是对不连续的数字或者文本进行编号例如：
比如有[dog,cat,dog,mouse,cat]，我们把其转换为[1,2,1,3,2]。这里就产生了一个奇怪的现象：dog和mouse的平均值是cat。所以目前还没有发现标签编码的广泛使用。
混淆矩阵confusion_matrix():
sklearn.metrics.confusion_matrix(y_true, y_pred, labels=None, sample_weight=None)
参数:
y_true: 是样本真实分类结果
y_pred: 是样本预测分类结果
labels：是所给出的类别，通过这个可对类别进行选择
sample_weight : 样本权重

# 混淆矩阵：
# 加载数据
%matplotlib inline
df = pd.read_csv("http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data",header=None)
'''
乳腺癌数据集：569个恶性和良性肿瘤细胞的样本，M为恶性，B为良性
'''
# 做基本的数据预处理
from sklearn.preprocessing import LabelEncoder
import matplotlib.pyplot as plt

X = df.iloc[:,2:].values
y = df.iloc[:,1].values
le = LabelEncoder()         #将M-B等字符串编码成计算机能识别的0-1
y = le.fit_transform(y)
le.transform(['M','B'])
# 数据切分8：2
from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test = train_test_split(
                                                 X,y,test_size=0.2,
                                                 stratify=y,random_state=1)
from sklearn.svm import SVC
pipe_svc = make_pipeline(StandardScaler(),SVC(random_state=1))
from sklearn.metrics import confusion_matrix

pipe_svc.fit(X_train,y_train)
y_pred = pipe_svc.predict(X_test)
confmat = confusion_matrix(y_true=y_test,
                        y_pred=y_pred)
fig,ax = plt.subplots(figsize=(2.5,2.5))
ax.matshow(confmat, cmap=plt.cm.Blues,alpha=0.3)
for i in range(confmat.shape[0]):
    for j in range(confmat.shape[1]):
        ax.text(x=j,y=i,s=confmat[i,j],
                va='center',ha='center')
plt.xlabel('predicted label')
plt.ylabel('true label')
plt.show()

在这里插入图片描述

1.4 ROC

# 绘制ROC曲线：
from sklearn.metrics import roc_curve,auc
from sklearn.metrics import make_scorer,f1_score
scorer = make_scorer(f1_score,pos_label=0)
gs = GridSearchCV(estimator=pipe_svc,param_grid=param_grid,scoring=scorer,cv=10)
y_pred = gs.fit(X_train,y_train).decision_function(X_test)
#y_pred = gs.predict(X_test)
fpr,tpr,threshold = roc_curve(y_test, y_pred) ###计算真阳率和假阳率
roc_auc = auc(fpr,tpr) ###计算auc的值
plt.figure()
lw = 2
plt.figure(figsize=(7,5))
plt.plot(fpr, tpr, color='darkorange',
         lw=lw, label='ROC curve (area = %0.2f)' % roc_auc) ###假阳率为横坐标，真阳率为纵坐标做曲线
plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
plt.xlim([-0.05, 1.0])
plt.ylim([-0.05, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic ')
plt.legend(loc="lower right")
plt.show()

在这里插入图片描述

1.5 练习

注意

首先下载数据集并放在C:\Users\Name\scikit_learn_data\lfw_home\文件夹下:
cross_validatio这个包已经不存在了,可改为from sklearn.model_selection import train_test_split
同样from sklearn.grid_search import GridSearchCV改为
from sklearn.model_selection import GridSearchCV
make_pipeline的作用:
make_pipeline可以将许多算法模型串联起来，可以用于把多个estamitors级联成一个estamitor,比如将特征提取、归一化、分类组织在一起形成一个典型的机器学习问题工作流。

载入数据集

from sklearn.datasets import fetch_lfw_people
faces = fetch_lfw_people(min_faces_per_person=60)
print(faces.target_names)
print(faces.images.shape)

查看图片数据;

import matplotlib.pyplot as plt
import seaborn as sns;sns.set()
fig, ax = plt.subplots(3,3)
fig.subplots_adjust(left=0.0625, right=1.2, wspace=1)
for i, axi in enumerate(ax.flat):
    axi.imshow(faces.images[i], cmap='bone')
    axi.set(xticks=[], yticks=[], xlabel=faces.target_names[faces.target[i]])

在这里插入图片描述
3. pca降维 , svc分类,打包管道,网格搜索优化

#使用预处理来提取更有意义的特征。这里使用主成份分析来提取150个基本元素，然后将其提供给支持向量机分类器。
#将这个预处理和分类器打包成管道

from sklearn.svm import SVC
from sklearn.decomposition import PCA
from sklearn.pipeline import make_pipeline   #管道简化
pca = PCA(n_components=150, whiten=True, random_state=42)   #主成分分析降维
svc = SVC(kernel='rbf', class_weight='balanced')
model = make_pipeline(pca, svc)

#为了测试分类器的训练效果，将数据集分解成训练集和测试集进行交叉检验
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(faces.data, faces.target, random_state=42)

#用网络搜索交叉检验来寻找最优参数组合。通过不断调整C（松弛变量）和参数gamma（控制径向基函数核的大小），确定最优模型

from sklearn.model_selection import GridSearchCV
param_grid = {'svc__C': [1,5,10,50], 
              'svc__gamma':[0.0001, 0.0005, 0.001, 0.005]}
grid = GridSearchCV(model, param_grid)

grid.fit(x_train, y_train)
print(grid.best_params_)

{'svc__C': 10, 'svc__gamma': 0.001}

4.预测

#对测试集的数据进行预测了
%matplotlib inline
model = grid.best_estimator_
y_fit = model.predict(x_test)
#比较预测结果和真实结果
fig, ax = plt.subplots(6, 6)
for i, axi in enumerate(ax.flat):
    axi.imshow(x_test[i].reshape(62, 47), cmap='bone')
    axi.set(xticks=[], yticks=[])
    axi.set_ylabel(faces.target_names[y_fit[i]].split()[-1],
                  color='black' if y_fit[i] == y_test[i] else 'red')
fig.suptitle('Predicted Names; Incorect Lables in Red', size=14)
plt.tight_layout()

在这里插入图片描述 5.结果报告

#打印分类效果报告，他会列举每个标签的统计结果，从而对评估器的性能有更全面的认识
from sklearn.metrics import classification_report
print(classification_report(y_test, y_fit, target_names=faces.target_names))

precision    recall  f1-score   support

     Ariel Sharon       0.65      0.73      0.69        15
     Colin Powell       0.80      0.87      0.83        68
  Donald Rumsfeld       0.74      0.84      0.79        31
    George W Bush       0.92      0.83      0.88       126
Gerhard Schroeder       0.86      0.83      0.84        23
      Hugo Chavez       0.93      0.70      0.80        20
Junichiro Koizumi       0.92      1.00      0.96        12
       Tony Blair       0.85      0.95      0.90        42

         accuracy                           0.85       337
        macro avg       0.83      0.84      0.84       337
     weighted avg       0.86      0.85      0.85       337

6.混淆矩阵

#画出混淆矩阵，它可以帮助我们清晰的判断那些标签容易被分类器误判
from sklearn.metrics import confusion_matrix
mat = confusion_matrix(y_test, y_fit)
sns.heatmap(mat.T, square=True, annot=True, fmt='d', cbar=False,
            xticklabels=faces.target_names,
            yticklabels=faces.target_names)
plt.xlabel('true label')
plt.ylabel('predicted label')

在这里插入图片描述

对角线表示预测的比较对的，非对角线表示容易将横纵坐标的人弄混

参考资料

https://cloud.tencent.com/developer/article/1456996LabelEncoder（标签编码）与One—Hot（独热编码）
https://blog.csdn.net/jasonzhoujx/article/details/81905923SVM人脸识别
https://blog.csdn.net/cwlseu/article/details/52356665SVM应用：人脸识别
https://blog.csdn.net/zhangweiguo_717/article/details/69815583?utm_medium=distribute.pc_relevant.none-task-blog-baidujs_title-0&spm=1001.2101.3001.4242SVM人脸识别

余柳成荫

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
20210329_23期_集成学习(上)_Task06_分类问题的评估及超参数调优

六、分类问题的评估及超参数调优目录六、分类问题的评估及超参数调优来源1 评估性能调参例子1.1 网格搜索调参1.2 随机搜索调参1.3 混淆矩阵1.4 ROC1.5 练习参考资料来源Datewhle23期__集成学习(上) :https://github.com/datawhalechina/team-learning-data-mining/tree/master/EnsembleLearning作者:李祖贤、薛传雨、赵可、杨毅远、陈琰钰论坛地址：http://datawhale.clu
复制链接

扫一扫