网格搜索
网格搜索 GridSearchCV我们在选择超参数有两个途径:1凭经验;2选择不同大小的参数,带入到模型中,挑选表现最好的参数。通过途径2选择超参数时,人力手动调节注意力成本太高,非常不值得。For循环或类似于for循环的方法受限于太过分明的层次,不够简洁与灵活,注意力成本高,易出错。GridSearchCV 称为网格搜索交叉验证调参,它通过遍历传入的参数的所有排列组合,通过交叉验证的方式,返回所有参数组合下的评价指标得分。
GridSearchCV听起来很高大上,其实就是暴力搜索。注意的是,该方法在小数据集上很有用,数据集大了就不太适用了。
from sklearn.metrics import roc_auc_score
import warnings
warnings.filterwarnings("ignore")
from sklearn.datasets import load_iris # 自带的样本数据集
iris = load_iris()
X = iris.data # 150个样本,4个属性
y = iris.target # 150个类标号
# 以随机森林为例介绍基本调用方法
# 穷举网格搜索
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split # 切分数据
# 切分数据 训练数据80% 验证数据20%
train_data, test_data, train_target, test_target = train_test_split(
X, y, test_size=0.2, random_state=0)
model = RandomForestClassifier()
parameters = {'n_estimators': [20, 50, 100], 'max_depth': [1, 2, 3]}
clf = GridSearchCV(model, parameters, cv=3, verbose=2)
clf.fit(train_data, train_target)
print("最优参数:")
print(clf.best_params_)
print("最优分数:")
print(clf.best_score_)
sorted(clf.cv_results_.keys())
score_test = roc_auc_score(test_target, clf.predict_proba(test_data), multi_class='ovr')
print("RandomForestClassifier GridSearchCV test AUC: ", score_test)
D:\anaconda\python.exe C:/Users/Administrator/Desktop/数据挖掘项目/代码包测试集/网格搜索调参.py
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
Fitting 3 folds for each of 9 candidates, totalling 27 fits
[CV] max_depth=1, n_estimators=20 ....................................
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s
[CV] ..................... max_depth=1, n_estimators=20, total= 0.0s
[CV] max_depth=1, n_estimators=20 ....................................
[CV] ..................... max_depth=1, n_estimators=20, total= 0.0s
[CV] max_depth=1, n_estimators=20 ....................................
[CV] ..................... max_depth=1, n_estimators=20, total= 0.0s
[CV] max_depth=1, n_estimators=50 ....................................
[CV] ..................... max_depth=1, n_estimators=50, total= 0.1s
[CV] max_depth=1, n_estimators=50 ....................................
[CV] ..................... max_depth=1, n_estimators=50, total= 0.1s
[CV] max_depth=1, n_estimators=50 ....................................
[CV] ..................... max_depth=1, n_estimators=50, total= 0.1s
[CV] max_depth=1, n_estimators=100 ...................................
[CV] .................... max_depth=1, n_estimators=100, total= 0.1s
[CV] max_depth=1, n_estimators=100 ...................................
[CV] .................... max_depth=1, n_estimators=100, total= 0.1s
[CV] max_depth=1, n_estimators=100 ...................................
[CV] .................... max_depth=1, n_estimators=100, total= 0.1s
[CV] max_depth=2, n_estimators=20 ....................................
[CV] ..................... max_depth=2, n_estimators=20, total= 0.0s
[CV] max_depth=2, n_estimators=20 ....................................
[CV] ..................... max_depth=2, n_estimators=20, total= 0.0s
[CV] max_depth=2, n_estimators=20 ....................................
[CV] ..................... max_depth=2, n_estimators=20, total= 0.0s
[CV] max_depth=2, n_estimators=50 ....................................
[CV] ..................... max_depth=2, n_estimators=50, total= 0.1s
[CV] max_depth=2, n_estimators=50 ....................................
[CV] ..................... max_depth=2, n_estimators=50, total= 0.1s
[CV] max_depth=2, n_estimators=50 ....................................
[CV] ..................... max_depth=2, n_estimators=50, total= 0.1s
[CV] max_depth=2, n_estimators=100 ...................................
[CV] .................... max_depth=2, n_estimators=100, total= 0.1s
[CV] max_depth=2, n_estimators=100 ...................................
[CV] .................... max_depth=2, n_estimators=100, total= 0.1s
[CV] max_depth=2, n_estimators=100 ...................................
[CV] .................... max_depth=2, n_estimators=100, total= 0.1s
[CV] max_depth=3, n_estimators=20 ....................................
[CV] ..................... max_depth=3, n_estimators=20, total= 0.0s
[CV] max_depth=3, n_estimators=20 ....................................
[CV] ..................... max_depth=3, n_estimators=20, total= 0.0s
[CV] max_depth=3, n_estimators=20 ....................................
[CV] ..................... max_depth=3, n_estimators=20, total= 0.0s
[CV] max_depth=3, n_estimators=50 ....................................
[CV] ..................... max_depth=3, n_estimators=50, total= 0.1s
[CV] max_depth=3, n_estimators=50 ....................................
[CV] ..................... max_depth=3, n_estimators=50, total= 0.1s
[CV] max_depth=3, n_estimators=50 ....................................
[CV] ..................... max_depth=3, n_estimators=50, total= 0.1s
[CV] max_depth=3, n_estimators=100 ...................................
[CV] .................... max_depth=3, n_estimators=100, total= 0.1s
[CV] max_depth=3, n_estimators=100 ...................................
[CV] .................... max_depth=3, n_estimators=100, total= 0.1s
[CV] max_depth=3, n_estimators=100 ...................................
[CV] .................... max_depth=3, n_estimators=100, total= 0.1s
[Parallel(n_jobs=1)]: Done 27 out of 27 | elapsed: 1.7s finished
最优参数:
{'max_depth': 2, 'n_estimators': 50}
最优分数:
0.9583333333333334
RandomForestClassifier GridSearchCV test AUC: 1.0
进程已结束,退出代码0
随机搜索
我们在搜索超参数的时候,如果超参数个数较少(三四个或者更少),那么我们可以采用网格搜索,一种穷尽式的搜索方法。
但是当超参数个数比较多的时候,我们仍然采用网格搜索,那么搜索所需时间将会指数级上升。所以有人就提出了随机搜索的方法,随机在超参数空间中搜索几十几百个点,其中就有可能有比较小的值。这种做法比上面稀疏化网格的做法快,而且实验证明,随机搜索法结果比稀疏网格法稍好。
RandomizedSearchCV使用方法和类GridSearchCV 很相似,但他不是尝试所有可能的组合,而是通过选择每一个超参数的一个随机值的特定数量的随机组合,这个方法有两个优点:相比于整体参数空间,可以选择相对较少的参数组合数量。如果让随机搜索运行,它会探索每个超参数的不同的值 可以方便的通过设定搜索次数,控制超参数搜索的计算量。添加参数节点不会影响性能,不会降低效率。RandomizedSearchCV的使用方法其实是和GridSearchCV一致的,但它以随机在参数空间中采样的方式代替了GridSearchCV对于参数的网格搜索,在对于有连续变量的参数时,RandomizedSearchCV会将其当做一个分布进行采样进行这是网格搜索做不到的,它的搜索能力取决于设定的n_iter参数。
from sklearn.metrics import roc_auc_score
import warnings
warnings.filterwarnings("ignore")
from sklearn.datasets import load_iris # 自带的样本数据集
iris = load_iris()
X = iris.data # 150个样本,4个属性
y = iris.target # 150个类标号
# 以随机森林为例介绍基本调用方法
# 随机参数优化
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split # 切分数据
# 切分数据 训练数据80% 验证数据20%
train_data, test_data, train_target, test_target = train_test_split(
X, y, test_size=0.2, random_state=0)
model = RandomForestClassifier()
parameters = {'n_estimators': [10, 20, 30, 50], 'max_depth': [1, 2, 3]}
clf = RandomizedSearchCV(model, parameters, cv=3, verbose=2)
clf.fit(train_data, train_target)
score_test = roc_auc_score(test_target, clf.predict_proba(test_data), multi_class='ovr')
print("RandomForestClassifier RandomizedSearchCV test AUC: ", score_test)
print("最优参数:")
print(clf.best_params_)
sorted(clf.cv_results_.keys())
D:\anaconda\python.exe C:/Users/Administrator/Desktop/数据挖掘项目/代码包测试集/随机参数优化调参.py
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
Fitting 3 folds for each of 10 candidates, totalling 30 fits
[CV] n_estimators=10, max_depth=3 ....................................
[CV] ..................... n_estimators=10, max_depth=3, total= 0.0s
[CV] n_estimators=10, max_depth=3 ....................................
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s
[CV] ..................... n_estimators=10, max_depth=3, total= 0.0s
[CV] n_estimators=10, max_depth=3 ....................................
[CV] ..................... n_estimators=10, max_depth=3, total= 0.0s
[CV] n_estimators=50, max_depth=2 ....................................
[CV] ..................... n_estimators=50, max_depth=2, total= 0.1s
[CV] n_estimators=50, max_depth=2 ....................................
[CV] ..................... n_estimators=50, max_depth=2, total= 0.1s
[CV] n_estimators=50, max_depth=2 ....................................
[CV] ..................... n_estimators=50, max_depth=2, total= 0.1s
[CV] n_estimators=20, max_depth=1 ....................................
[CV] ..................... n_estimators=20, max_depth=1, total= 0.0s
[CV] n_estimators=20, max_depth=1 ....................................
[CV] ..................... n_estimators=20, max_depth=1, total= 0.0s
[CV] n_estimators=20, max_depth=1 ....................................
[CV] ..................... n_estimators=20, max_depth=1, total= 0.0s
[CV] n_estimators=30, max_depth=3 ....................................
[CV] ..................... n_estimators=30, max_depth=3, total= 0.0s
[CV] n_estimators=30, max_depth=3 ....................................
[CV] ..................... n_estimators=30, max_depth=3, total= 0.0s
[CV] n_estimators=30, max_depth=3 ....................................
[CV] ..................... n_estimators=30, max_depth=3, total= 0.0s
[CV] n_estimators=10, max_depth=2 ....................................
[CV] ..................... n_estimators=10, max_depth=2, total= 0.0s
[CV] n_estimators=10, max_depth=2 ....................................
[CV] ..................... n_estimators=10, max_depth=2, total= 0.0s
[CV] n_estimators=10, max_depth=2 ....................................
[CV] ..................... n_estimators=10, max_depth=2, total= 0.0s
[CV] n_estimators=20, max_depth=2 ....................................
[CV] ..................... n_estimators=20, max_depth=2, total= 0.0s
[CV] n_estimators=20, max_depth=2 ....................................
[CV] ..................... n_estimators=20, max_depth=2, total= 0.0s
[CV] n_estimators=20, max_depth=2 ....................................
[CV] ..................... n_estimators=20, max_depth=2, total= 0.0s
[CV] n_estimators=50, max_depth=3 ....................................
[CV] ..................... n_estimators=50, max_depth=3, total= 0.1s
[CV] n_estimators=50, max_depth=3 ....................................
[CV] ..................... n_estimators=50, max_depth=3, total= 0.1s
[CV] n_estimators=50, max_depth=3 ....................................
[CV] ..................... n_estimators=50, max_depth=3, total= 0.1s
[CV] n_estimators=30, max_depth=1 ....................................
[CV] ..................... n_estimators=30, max_depth=1, total= 0.0s
[CV] n_estimators=30, max_depth=1 ....................................
[CV] ..................... n_estimators=30, max_depth=1, total= 0.0s
[CV] n_estimators=30, max_depth=1 ....................................
[CV] ..................... n_estimators=30, max_depth=1, total= 0.0s
[CV] n_estimators=10, max_depth=1 ....................................
[CV] ..................... n_estimators=10, max_depth=1, total= 0.0s
[CV] n_estimators=10, max_depth=1 ....................................
[CV] ..................... n_estimators=10, max_depth=1, total= 0.0s
[CV] n_estimators=10, max_depth=1 ....................................
[CV] ..................... n_estimators=10, max_depth=1, total= 0.0s
[CV] n_estimators=50, max_depth=1 ....................................
[CV] ..................... n_estimators=50, max_depth=1, total= 0.1s
[CV] n_estimators=50, max_depth=1 ....................................
[CV] ..................... n_estimators=50, max_depth=1, total= 0.1s
[CV] n_estimators=50, max_depth=1 ....................................
[CV] ..................... n_estimators=50, max_depth=1, total= 0.1s
[Parallel(n_jobs=1)]: Done 30 out of 30 | elapsed: 0.9s finished
RandomForestClassifier RandomizedSearchCV test AUC: 1.0
最优参数:
{'n_estimators': 30, 'max_depth': 3}
进程已结束,退出代码0