Datawhale 集成学习 Task06：掌握分类问题的评估及超参数调优

最新推荐文章于 2024-09-01 18:53:06 发布

Andrew_zjc

最新推荐文章于 2024-09-01 18:53:06 发布

阅读量169

点赞数

分类专栏：笔记文章标签：机器学习

本文链接：https://blog.csdn.net/Andrew_zjc/article/details/115303394

版权

笔记专栏收录该内容

32 篇文章 3 订阅

订阅专栏

超参数调优，主要有GridSearchCV和RandomizedSearchCV，主要是因为上一个task代码少，我就和之前的写在一起了。回忆一下，Grid和Randomized共用了param_range和param_grid，其他的和回归中的很相近，都是先fit，然后就可以输出best_score_，以及best_params_
这一节呢，主要是两个实操练习，一个是https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data，一个是sklearn的人脸识别数据集fetch_lfw_people。
好了，一如既往，代码先行，如下的两段代码复制到记事本，然后后缀改为.py，用python解释器即可运行。

breast-cancer-wisconsin：

import matplotlib.pyplot as plt
from sklearn.svm import SVC
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler,LabelEncoder
from sklearn.model_selection import train_test_split,GridSearchCV
from sklearn.metrics import confusion_matrix,roc_curve,auc,make_scorer,f1_score

df = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data",header=None)
X = df.iloc[:,2:]
y = df.iloc[:,1]
y = LabelEncoder().fit_transform(y)
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,stratify=y,random_state=1)
pipe_svc = make_pipeline(StandardScaler(),SVC(random_state=1))
y_pred = pipe_svc.fit(X_train,y_train).predict(X_test)
confmat = confusion_matrix(y_true=y_test,y_pred=y_pred)
fig,ax = plt.subplots(figsize=(2.5,2.5))
ax.matshow(confmat,cmap=plt.cm.Blues,alpha=0.3)
for i in range(confmat.shape[0]):
	for j in range(confmat.shape[1]):
		ax.text(x=j,y=i,s=confmat[i,j],va='center',ha='center')
plt.xlabel('predicted label')
plt.ylabel('true label')
plt.title('confusion matrix')

scorer = make_scorer(f1_score,pos_label=0)
param_range = [0.0001,0.001,0.01,0.1,1.0,10.0,100.0,1000.0]
param_grid = [{"svc__C":param_range,"svc__kernel":["linear","rbf"],"svc__gamma":param_range}]
gs = GridSearchCV(estimator=pipe_svc,param_grid=param_grid,scoring=scorer,cv=10)
y_pred = gs.fit(X_train,y_train).decision_function(X_test)
fpr,tpr,threshold = roc_curve(y_test,y_pred)
roc_auc = auc(fpr,tpr)
plt.figure(figsize=(7,5))
plt.plot(fpr,tpr,color='darkorange',lw=2,label='ROC curve (area = %0.2f)' % roc_auc)
plt.plot([0,1],[0,1],color='navy',lw=2,linestyle='--')
plt.xlim([-0.05,1.0])
plt.ylim([-0.05,1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc='lower right')
plt.show()

结果：
在这里插入图片描述

在这里插入图片描述

fetch_lfw_people:

from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.datasets import fetch_lfw_people
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.decomposition import PCA
from sklearn.svm import SVC

lfw_people = fetch_lfw_people(min_faces_per_person=70, resize=0.4)
n_samples, h, w = lfw_people.images.shape
X = lfw_people.data
n_features = X.shape[1]
y = lfw_people.target
target_names = lfw_people.target_names
n_classes = target_names.shape[0]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
n_components=150
pca = PCA(n_components=n_components,svd_solver='randomized',whiten=True).fit(X_train)
X_train_pca = pca.transform(X_train)
X_test_pca = pca.transform(X_test)

param_grid = {'C': [1e3, 5e3, 1e4, 5e4, 1e5],'gamma': [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.1]}
clf = GridSearchCV(SVC(kernel='rbf', class_weight='balanced'), param_grid)
clf = clf.fit(X_train_pca, y_train)
print("clf.best_estimator_:",clf.best_estimator_)

y_pred = clf.predict(X_test_pca)
print("classification_report:",classification_report(y_test, y_pred, target_names=target_names))
print("confusion_matrix:",confusion_matrix(y_test, y_pred, labels=range(n_classes)))

import matplotlib.pyplot as plt
eigenfaces = pca.components_.reshape((n_components, h, w))
def plot_gallery(images, titles, h, w, n_row=3, n_col=4):
	plt.figure(figsize=(1.8 * n_col, 2.4 * n_row))
	plt.subplots_adjust(bottom=0, left=.01, right=.99, top=.90, hspace=.35)
	for i in range(n_row * n_col):
		plt.subplot(n_row, n_col, i + 1)
		plt.imshow(images[i].reshape((h, w)), cmap=plt.cm.gray)
		plt.title(titles[i], size=12)
		plt.xticks(())
		plt.yticks(())
def title(y_pred, y_test, target_names, i):
	pred_name = target_names[y_pred[i]].rsplit(' ', 1)[-1]
	true_name = target_names[y_test[i]].rsplit(' ', 1)[-1]
	return 'predicted: %s\ntrue:      %s' % (pred_name, true_name)
prediction_titles = [title(y_pred, y_test, target_names, i) for i in range(y_pred.shape[0])]
plot_gallery(X_test, prediction_titles, h, w)
eigenface_titles = ["eigenface %d" % i for i in range(eigenfaces.shape[0])]
plot_gallery(eigenfaces, eigenface_titles, h, w)
plt.show()

结果：
在这里插入图片描述

在这里插入图片描述

在这里插入图片描述
我惊讶的发现，但就代码来说，玄奥的机器学习方法，比如svm只需要一两行，但是matplotlib的代码却更多，充分说明有图有真相，哈哈，有时间把Datawhale的fantastic_matplotlib再好好学习。
最后一个数据集fetch_lfw_people是课后习题，然后我磨拳擦掌想试试，可是点开连接，sklearn官网，还有官网代码，抱着膜拜的心态把代码读完了，官网用了pca降维的方法，毕竟图像像素点太多了。当然，图像识别用cnn更流行，但是我觉得SVC更轻量级，也很好。

Andrew_zjc

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
1
评论
Datawhale 集成学习 Task06：掌握分类问题的评估及超参数调优

超参数调优，主要有GridSearchCV和RandomizedSearchCV，主要是因为上一个task代码少，我就和之前的写在一起了。回忆一下，Grid和Randomized共用了param_range和param_grid，其他的和回归中的很相近，都是先fit，然后就可以输出best_score_，以及best_params_这一节呢，主要是两个实操练习，一个是https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cance
复制链接

扫一扫