4_ParamOptimization.py_param.optimization-CSDN博客

本文链接：https://blog.csdn.net/hellosmile123456/article/details/104145088

import pickle
import xgboost as xgb
import numpy as np
from sklearn.model_selection import KFold, train_test_split, GridSearchCV
from sklearn.metrics import confusion_matrix, mean_squared_error
from sklearn.datasets import load_iris, load_digits, load_boston

rng = np.random.RandomState(31337)

print("Boston Housing: regression")
boston = load_boston()

print("Parameter optimization")
y = boston['target']
X = boston['data']
xgb_model = xgb.XGBRegressor()
clf=GridSearchCV(xgb_model,
									{'max_depth':[2,4,6],
									'n_estimators':[50,100,200]},verbose=1)
"""
Parameters：
estimator：所使用的分类器，或者pipeline
param_grid：值为字典或者列表，即需要最优化的参数的取值
scoring：准确度评价标准，默认None,这时需要使用score函数；
或者如scoring='roc_auc'，根据所选模型不同，评价准则不同。字符串（函数名），
或是可调用对象，需要其函数签名形如：scorer(estimator, X, y)；
如果是None，则使用estimator的误差估计函数。
n_jobs：并行数，int：个数,-1：跟CPU核数一致, 1:默认值。
pre_dispatch：指定总共分发的并行任务数。当n_jobs大于1时，
数据将在每个运行点进行复制，这可能导致OOM，而设置pre_dispatch参数，
则可以预先划分总共的job数量，使数据最多被复制pre_dispatch次
iid：默认True,为True时，默认为各个样本fold概率分布一致，误差估计为所有样本
之和，而非各个fold的平均。
cv：交叉验证参数，默认None，使用三折交叉验证。指定fold数量，默认为3，
也可以是yield训练/测试数据的生成器。
refit：默认为True,程序将会以交叉验证训练集得到的最佳参数，重新对所有可用的
训练集与开发集进行，作为最终用于性能评估的最佳模型参数。即在搜索参数结束后，
用最佳参数结果再次fit一遍全部数据集。
verbose：日志冗长度，int：冗长度，0：不输出训练过程，1：偶尔输出，
>1：对每个子模型都输出。
"""
clf.fit(X,y)
print("clf.best_score_=",clf.best_score_)
print("clf.best_params_=",clf.best_params_)
"""
Attributes：
best_estimator_：效果最好的分类器
best_score_：成员提供优化过程期间观察到的最好的评分
best_params_：描述了已取得最佳结果的参数的组合
best_index_：对应于最佳候选参数设置的索引(cv_results_数组的索引)。
Methods：
decision_function:使用找到的参数最好的分类器调用decision_function。
fit(X, y=None, groups=None, **fit_params):训练
get_params(deep=True):获取这个估计器的参数。
predict(X):用找到的最佳参数调用预估器。(直接预测每个样本属于哪一个类别)
predict_log_proda(X):用找到的最佳参数调用预估器。（得到每个测试集样本在每
一个类别的得分取log情况）
predict_proba(X):用找到的最佳参数调用预估器。（得到每个测试集样本在每一个
类别的得分情况）
score(X, y=None)：返回给定数据上的得分，如果预估器已经选出最优的分类器。
transform(X):调用最优分类器进行对X的转换。
"""
# The sklearn API models are picklable
print("Pickling sklearn API models")
# must open in binary format to pickle
pickle.dump(clf,open("best_boston.pkl","wb"))
clf2=pickle.load(open("best_boston.pkl","rb"))
print("np.allclose(clf.predict(X),clf2.predict(X))=",
np.allclose(clf.predict(X),clf2.predict(X)))
#numpy的allclose方法，比较两个array是不是每一元素都相等，
#默认在1e-05的误差范围内

Boston Housing: regression
Parameter optimization
Fitting 5 folds for each of 9 candidates, totalling 45 fits
clf.best_score_= 0.6697440508380478
clf.best_params_= {‘max_depth’: 2, ‘n_estimators’: 100}
np.allclose(clf.predict(X),clf2.predict(X))= True