机器学习（上）-分类项目如何调参

最新推荐文章于 2023-11-02 11:42:31 发布

꧁ᝰ苏苏ᝰ꧂

最新推荐文章于 2023-11-02 11:42:31 发布

阅读量433

点赞数

分类专栏：机器学习文章标签： python 机器学习人工智能

机器学习专栏收录该内容

15 篇文章

订阅专栏

本文介绍了如何使用GridSearchCV和RandomizedSearchCV对未调参的SVR进行超参数调优，包括网格搜索和随机网格搜索的方法，并通过实例演示了如何计算混淆矩阵和ROC曲线。涉及了 sklearn 库中的模型选择和评估技巧。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

分类项目评估模型的性能并调参:

更详细的可以查看笔者的知乎：https://zhuanlan.zhihu.com/p/1400407

# 导入基础包
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt 
%matplotlib inline 
plt.style.use("ggplot")  
import seaborn as sns

#  导入dataset 鸢尾花包
from sklearn import datasets
iris = datasets.load_iris()
X = iris.data
y = iris.target
feature = iris.feature_names
data = pd.DataFrame(X,columns=feature)
data['target'] = y
data.head()

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

# 使用网格搜索进行超参数调优：
# 方式1：网格搜索GridSearchCV()
# 我们先来对未调参的SVR进行评价： 
from sklearn.svm import SVR     # 引入SVR类
from sklearn.pipeline import make_pipeline   # 引入管道简化学习流程
from sklearn.preprocessing import StandardScaler # 由于SVR基于距离计算，引入对数据进行标准化的类
from sklearn.model_selection import GridSearchCV  # 引入网格搜索调优
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
import time

start_time = time.time()# 开始系统时间
pipe_svc = make_pipeline(StandardScaler(),SVC(random_state=1))
param_range = [0.0001,0.001,0.01,0.1,1.0,10.0,100.0,1000.0]
param_grid = [{'svc__C':param_range,'svc__kernel':['linear']},{'svc__C':param_range,'svc__gamma':param_range,'svc__kernel':['rbf']}]
gs = GridSearchCV(estimator=pipe_svc,param_grid=param_grid,scoring='accuracy',cv=10,n_jobs=-1)
gs = gs.fit(X,y)
end_time = time.time() # 结束系统时间
print("网格搜索经历时间：%.3f S" % float(end_time-start_time))
print(gs.best_score_)
print(gs.best_params_)

网格搜索经历时间：3.185 S
0.98
{'svc__C': 1.0, 'svc__gamma': 0.1, 'svc__kernel': 'rbf'}

# 方式2：随机网格搜索RandomizedSearchCV()
from sklearn.model_selection import RandomizedSearchCV
from sklearn.svm import SVC
import time

start_time = time.time()
pipe_svc = make_pipeline(StandardScaler(),SVC(random_state=1))
param_range = [0.0001,0.001,0.01,0.1,1.0,10.0,100.0,1000.0]
param_grid = [{'svc__C':param_range,'svc__kernel':['linear']},{'svc__C':param_range,'svc__gamma':param_range,'svc__kernel':['rbf']}]
# param_grid = [{'svc__C':param_range,'svc__kernel':['linear','rbf'],'svc__gamma':param_range}]
gs = RandomizedSearchCV(estimator=pipe_svc, param_distributions=param_grid,scoring='accuracy',cv=10,n_jobs=-1)
gs = gs.fit(X,y) 
end_time = time.time()
print("随机网格搜索经历时间：%.3f S" % float(end_time-start_time))
print(gs.best_score_)
print(gs.best_params_)

import sklearn 
print(sklearn.__version__)

0.19.0

以上的RandomizedSearchCV()方法会报错“‘list’ object has no attribute ‘values’”

经过查看fit方法，发现无论如何调整fit方法的参数，都没法运行。

如果是GridSearchCV就可以运行

查看类实现，发现两种类调用了相同的，fit方法，但是，fit方法有隐含传入的参数：

Grid的时候会遍历字典中所有参数的组合，所以字典的划分不重要。

Randomlize,当传入字典的时候，会作为带分布的进行处理，对字典取值，于是传入作为fit的参数集的时候，不是作为可遍历的对象的字典，可以.values

具体解释参考网友：https://blog.csdn.net/a790209714/article/details/56834186

但是我没有做出来，按照解决办法更改，依旧是报错……

来自队长黄元帅的解答：合理推测一下，应该是sklearn的底层代码出问题了，导致了这个bug。这个bug在0.19之后的某个版本被修复了。（应该也是版本问题，其他队员都没有报错，就我报错了，我的版本确实有点低。）

然后呢，我升级sklearn 失败，直接导致了"pyplot" 报错： No module named ‘kiwisolver’，实属人间惨剧，开始卸载重装了（跟着报错要求各种pip 没用）

# 混淆矩阵：
# 加载数据
df = pd.read_csv("http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data",header=None)
'''
乳腺癌数据集：569个恶性和良性肿瘤细胞的样本，M为恶性，B为良性
'''
# 做基本的数据预处理
from sklearn.preprocessing import LabelEncoder

X = df.iloc[:,2:].values
y = df.iloc[:,1].values
le = LabelEncoder()    #将M-B等字符串编码成计算机能识别的0-1
y = le.fit_transform(y)
le.transform(['M','B'])
# 数据切分8：2
from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,stratify=y,random_state=1)
from sklearn.svm import SVC
pipe_svc = make_pipeline(StandardScaler(),SVC(random_state=1))
from sklearn.metrics import confusion_matrix

pipe_svc.fit(X_train,y_train)
y_pred = pipe_svc.predict(X_test)
confmat = confusion_matrix(y_true=y_test,y_pred=y_pred)
fig,ax = plt.subplots(figsize=(2.5,2.5))
ax.matshow(confmat, cmap=plt.cm.Blues,alpha=0.3)
for i in range(confmat.shape[0]):
    for j in range(confmat.shape[1]):
        ax.text(x=j,y=i,s=confmat[i,j],va='center',ha='center')
plt.xlabel('predicted label')
plt.ylabel('true label')
plt.show()

在这里插入图片描述

# 绘制ROC曲线：
from sklearn.metrics import roc_curve,auc
from sklearn.metrics import make_scorer,f1_score
scorer = make_scorer(f1_score,pos_label=0)
gs = GridSearchCV(estimator=pipe_svc,param_grid=param_grid,scoring=scorer,cv=10)
y_pred = gs.fit(X_train,y_train).decision_function(X_test)
#y_pred = gs.predict(X_test)
fpr,tpr,threshold = roc_curve(y_test, y_pred) ###计算真阳率和假阳率
roc_auc = auc(fpr,tpr) ###计算auc的值
plt.figure()
lw = 2
plt.figure(figsize=(7,5))
plt.plot(fpr, tpr, color='darkorange',
         lw=lw, label='ROC curve (area = %0.2f)' % roc_auc) ###假阳率为横坐标，真阳率为纵坐标做曲线
plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
plt.xlim([-0.05, 1.0])
plt.ylim([-0.05, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic ')
plt.legend(loc="lower right")
plt.show()

<Figure size 432x288 with 0 Axes>

在这里插入图片描述

为了巩固本章的理解，在这里给个小任务，大家结合sklearn的fetch_lfw_people数据集，进行一次实战。fetch_lfw_people数据集是一个图像数据集，详细内容可以参照：
https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_lfw_people.html
案例的内容是对图像进行识别并分类。
参考资料：
https://blog.csdn.net/cwlseu/article/details/52356665
https://blog.csdn.net/jasonzhoujx/article/details/81905923