Python机器学习库sklearn.model_selection模块的几个方法参数

原创已于 2023-05-13 16:09:39 修改 · 4.4w 阅读

217 ·

CC 4.0 BY-SA版权

文章标签：

#机器学习 #sklearn #python

于 2018-01-16 19:52:28 首次发布

算法实现专栏收录该内容

40 篇文章

订阅专栏

强大的sklearn库可以解决的问题：
这里写图片描述

train_test_split返回切分的数据集train/test

train_test_split(*array,test_size=0.25,train_size=None,random_state=None,shuffle=True,stratify=None)

*array：切分数据源（list/np.array/pd.DataFrame/scipy_sparse matrices）
test_size和train_size是互补和为1的一对值
shuffle：对数据切分前是否洗牌
stratify：是否分层抽样切分数据（ If shuffle=False then stratify must be None.）
这里写图片描述

###cross_validate返回train/test数据集上的每折得分

cross_validate(estimator,X,y=None,groups=None,scoring=None,cv=None,n_jobs=1,
verbose=0,fit_params=None,pre_dispatch='2*n_jobs',return_train_score='warn')

estimator：学习器
X：特征列数据
y：标签列（无监督学习可以无此参数）
groups：切分train/test数据集后的样本所在集合标号
scoring：在test数据集上的评估准则（以list/dict形式给出）
cv：交叉验证的折数，default=3，也可以是其余int数据，或者cv generator
n_jobs：计算执行时占用CPU个数，设置n_jobs=-1是利用全部CPU
verbose：设置评估模型的相关打印信息输出详细程度
fit_params：参数字典
pre_dispatch：设置并行任务数（保护内存）
return_train_score：返回train数据集上的评估得分

###GridSearchCV返回最佳参数组合/得分

GridSearchCV(estimator,para_grid,scoring=None,n_jobs=1,iid=True,refit=True,cv=None,
verbose=0,pre_dispatch='2*n_jobs',error_score='raise',return_train_score='warn')

estimator：学习器
para_grid：参数字典
scoring：在test数据集上的评估准则（以list/dict形式给出）
n_jobs：计算执行时占用CPU个数，设置n_jobs=-1是利用全部CPU
iid：是否假设样本同分布，建模时目标函数时计入每个样本的总损失
cv：交叉验证的折数，default=3，也可以是其余int数据，或者cv generator
verbose：设置评估模型的相关打印信息输出详细程度
pre_dispatch：设置并行任务数（保护内存）
return_train_score：返回train数据集上的评估得分
error_score：设置estimator拟合出现错误时的相关提示信息，对refit有影响
refit：利用最优参数组合做什么？待研究
（refit : boolean, or string, default=True
Refit an estimator using the best found parameters on the whole dataset.
For multiple metric evaluation, this needs to be a string denoting the scorer is used to find the best parameters for refitting the estimator at the end.
The refitted estimator is made available at the best_estimator_ attribute and permits using predict directly on this GridSearchCV instance.
Also for multiple metric evaluation, the attributes best_index_, best_score_ and best_parameters_ will only be available if refit is set and all of them will be determined w.r.t this specific scorer.
See scoring parameter to know more about multiple metric evaluation.）
这里写图片描述

这里写图片描述

###learning_curve根据设定的不同train数据集大小,依次获得交叉验证的train/test数据集上的得分

GridSearchCV(estimator,X,y,groups=None，train_sizes=array([0.1,0.33,0.55,0.78,1.]),cv=None,
scoring=None,exploit_incremental_learning=False,n_jobs=1,
pre_dispatch='all',verbose=0,shuffle=False,random_state=None)

estimator：学习器
X：特征列数据
y：标签列
groups：切分train/test数据集后的样本所在集合标号
train_sizes：设置训练集数据的变化取值范围
cv：交叉验证的折数，default=3，也可以是其余int数据，或者cv generator
scoring：在test数据集上的评估准则（以list/dict形式给出）
n_jobs：计算执行时占用CPU个数，设置n_jobs=-1是利用全部CPU
pre_dispatch：设置并行任务数（保护内存）
verbose：设置评估模型的相关打印信息输出详细程度
shuffle：对数据切分前是否洗牌
random_state：随机种子
exploit_incremental_learning：增量学习

扫码关注公众号：瑞行AI，欢迎交流AI算法、数据分析等技术，提供技术方案咨询和就业指导服务！
在这里插入图片描述