Datawhale 集成学习 Task06:掌握分类问题的评估及超参数调优

超参数调优,主要有GridSearchCV和RandomizedSearchCV,主要是因为上一个task代码少,我就和之前的写在一起了。回忆一下,Grid和Randomized共用了param_range和param_grid,其他的和回归中的很相近,都是先fit,然后就可以输出best_score_,以及best_params_
这一节呢,主要是两个实操练习,一个是https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data,一个是sklearn的人脸识别数据集fetch_lfw_people。
好了,一如既往,代码先行,如下的两段代码复制到记事本,然后后缀改为.py,用python解释器即可运行。

breast-cancer-wisconsin:

import matplotlib.pyplot as plt
from sklearn.svm import SVC
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler,LabelEncoder
from sklearn.model_selection import train_test_split,GridSearchCV
from sklearn.metrics import confusion_matrix,roc_curve,auc,make_scorer,f1_score

df = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data",header=None)
X = df.iloc[:,2:]
y = df.iloc[:,1]
y = LabelEncoder().fit_transform(y)
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,stratify=y,random_state=1)
pipe_svc = make_pipeline(StandardScaler(),SVC(random_state=1))
y_pred = pipe_svc.fit(X_train,y_train).predict(X_test)
confmat = confusion_matrix(y_true=y_test,y_pred=y_pred)
fig,ax = plt.subplots(figsize=(2.5,2.5))
ax.matshow(confmat,cmap=plt.cm.Blues,alpha=0.3)
for i in range(confmat.shape[0]):
	for j in range(confmat.shape[1]):
		ax.text(x=j,y=i,s=confmat[i,j],va='center',ha='center')
plt.xlabel('predicted label')
plt.ylabel('true label')
plt.title('confusion matrix')

scorer = make_scorer(f1_score,pos_label=0)
param_range = [0.0001,0.001,0.01,0.1,1.0,10.0,100.0,1000.0]
param_grid = [{"svc__C":param_range,"svc__kernel":["linear","rbf"],"svc__gamma":param_range}]
gs = GridSearchCV(estimator=pipe_svc,param_grid=param_grid,scoring=scorer,cv=10)
y_pred = gs.fit(X_train,y_train).decision_function(X_test)
fpr,tpr,threshold = roc_curve(y_test,y_pred)
roc_auc = auc(fpr,tpr)
plt.figure(figsize=(7,5))
plt.plot(fpr,tpr,color='darkorange',lw=2,label='ROC curve (area = %0.2f)' % roc_auc)
plt.plot([0,1],[0,1],color='navy',lw=2,linestyle='--')
plt.xlim([-0.05,1.0])
plt.ylim([-0.05,1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc='lower right')
plt.show()

结果:
在这里插入图片描述

在这里插入图片描述

fetch_lfw_people:

from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.datasets import fetch_lfw_people
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.decomposition import PCA
from sklearn.svm import SVC

lfw_people = fetch_lfw_people(min_faces_per_person=70, resize=0.4)
n_samples, h, w = lfw_people.images.shape
X = lfw_people.data
n_features = X.shape[1]
y = lfw_people.target
target_names = lfw_people.target_names
n_classes = target_names.shape[0]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
n_components=150
pca = PCA(n_components=n_components,svd_solver='randomized',whiten=True).fit(X_train)
X_train_pca = pca.transform(X_train)
X_test_pca = pca.transform(X_test)

param_grid = {'C': [1e3, 5e3, 1e4, 5e4, 1e5],'gamma': [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.1]}
clf = GridSearchCV(SVC(kernel='rbf', class_weight='balanced'), param_grid)
clf = clf.fit(X_train_pca, y_train)
print("clf.best_estimator_:",clf.best_estimator_)

y_pred = clf.predict(X_test_pca)
print("classification_report:",classification_report(y_test, y_pred, target_names=target_names))
print("confusion_matrix:",confusion_matrix(y_test, y_pred, labels=range(n_classes)))

import matplotlib.pyplot as plt
eigenfaces = pca.components_.reshape((n_components, h, w))
def plot_gallery(images, titles, h, w, n_row=3, n_col=4):
	plt.figure(figsize=(1.8 * n_col, 2.4 * n_row))
	plt.subplots_adjust(bottom=0, left=.01, right=.99, top=.90, hspace=.35)
	for i in range(n_row * n_col):
		plt.subplot(n_row, n_col, i + 1)
		plt.imshow(images[i].reshape((h, w)), cmap=plt.cm.gray)
		plt.title(titles[i], size=12)
		plt.xticks(())
		plt.yticks(())
def title(y_pred, y_test, target_names, i):
	pred_name = target_names[y_pred[i]].rsplit(' ', 1)[-1]
	true_name = target_names[y_test[i]].rsplit(' ', 1)[-1]
	return 'predicted: %s\ntrue:      %s' % (pred_name, true_name)
prediction_titles = [title(y_pred, y_test, target_names, i) for i in range(y_pred.shape[0])]
plot_gallery(X_test, prediction_titles, h, w)
eigenface_titles = ["eigenface %d" % i for i in range(eigenfaces.shape[0])]
plot_gallery(eigenfaces, eigenface_titles, h, w)
plt.show()

结果:
在这里插入图片描述

在这里插入图片描述

在这里插入图片描述
我惊讶的发现,但就代码来说,玄奥的机器学习方法,比如svm只需要一两行,但是matplotlib的代码却更多,充分说明有图有真相,哈哈,有时间把Datawhale的fantastic_matplotlib再好好学习。
最后一个数据集fetch_lfw_people是课后习题,然后我磨拳擦掌想试试,可是点开连接,sklearn官网,还有官网代码,抱着膜拜的心态把代码读完了,官网用了pca降维的方法,毕竟图像像素点太多了。当然,图像识别用cnn更流行,但是我觉得SVC更轻量级,也很好。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值