sklearn.model_selection.RandomizedSearchCV

iks325

已于 2024-09-09 10:35:28 修改

阅读量753

点赞数

文章标签： sklearn 人工智能 python

于 2023-06-08 13:43:11 首次发布

本文链接：https://blog.csdn.net/weixin_67482129/article/details/131104159

版权

class sklearn.model_selection.RandomizedSearchCV(
estimator, 
param_distributions, 
*, 
n_iter=10, 
scoring=None,
n_jobs=None, 
refit=True, 
cv=None, 
verbose=0, 
pre_dispatch='2*n_jobs', 
random_state=None, 
error_score=nan, 
return_train_score=False
)

超参的随机搜索。

RandomizedSearchCV 中有“fit” and a “score” 两种方法. 如果输入其中的估计器(estimator)有 “score_samples”, “predict”, “predict_proba”, “decision_function”, “transform” 以及 “inverse_transform” 的方法，那么它也会有这些方法。

这些估计器(传入RandomizedSearchCV的estimator)的相应方法(method)的参数，都是通过对参数设置做交叉验证搜索，从而进行优化。

和GridSearchCV会试验所有参数不一样，本随机搜索方法是从指定的分布中抽出固定数量的参数，以此基础进行搜索。“抽取多少参数设置”由n_iter给出。

如果所有的参数都以列表的形式出现，则进行无放回抽样。如果至少有一个参数是以分布形式给出的，则使用有放回抽样。scikit-learn强烈建议对连续参数使用连续分布。

本函数的参数Parameters:

1. estimator：estimator object

An object of that type is instantiated for each grid point. This is assumed to implement the scikit-learn estimator interface. Either estimator needs to provide a score function, or scoring must be passed.

第一个需要传入的参数是估计器，比如SVM，linear-regression等等。这个估计器在每一个网格节点(可以理解为由不同参数设置组成的网格，每一个节点对应一组特定的参数，参考“网格搜索”)都会进行实例化。传入这里的估计器必须有score function，或者可以传递scoring

2. param_distributions：dict or list of dicts

Dictionary with parameters names (str) as keys and distributions or lists of parameters to try. Distributions must provide a rvs method for sampling (such as those from scipy.stats.distributions). If a list is given, it is sampled uniformly. If a list of dicts is given, first a dict is sampled uniformly, and then a parameter is sampled using that dict as above.

第二个需要传入的参数是一个字典，其中的建是参数名称(str)，值是要尝试的分布或参数列表。分布必须提供rvs方法用于取样（比如scipy.stats.distributions中的方法）。

如果给定的是一个列表，那儿这个列表中的值会被均匀抽样。

如果给定的是一个dict列表(列表中都是字典)，那么：首先对dict进行均匀采样(抽取一个字典)，然后用刚刚抽取的字典中的分布对参数进行采样。

*这是一个非常重要、常见的参数。下面以SVC作为estimator举例此参数的赋值方法：

from scipy.stats import uniform
from sklearn import svm
from sklearn.model_selection import RandomizedSearchCV

"""
在scipy.stats.uniform中，参数是loc和scale，分别对应均匀分布参数(a,b)中的a和b-a
"""


"""
此处对SVC的两个参数kernel和惩罚参数C分别用列表和连续分布的形式给出了取值范围：
1. 列表['linear', 'rbf']表明，kernel只能在'linear'和'rbf'中二选一。
2. uniform(loc=1, scale=9)则是1到10之间的均匀分布
   如果想用标准正态分布，则用norm = scipy.stats.norm(loc=0, scale=1)来替换
"""

svc = svm.SVC()

distributions = {'kernel':['linear', 'rbf'], 'C':uniform(loc=1, scale=9)}

clf = RandomizedSearchCV(estimator = svc,
                         param_distributions = distributions,
                         n_iter = 4,
                         scoring = 'accuracy',
                         cv = 1,
                         n_jobs = -1,
                         random_state = 2023)

3. n_iter：int, default=10

Number of parameter settings that are sampled. n_iter trades off runtime vs quality of the solution.

采样多少个“参数设置”。 n_iter需要在运行时间和结果质量之间做权衡

4. scoring：str, callable, list, tuple or dict, default=None (默认是None，使用估计器的score)

Strategy to evaluate the performance of the cross-validated model on the test set.

评价交叉验证模型在测试集上的表现的策略。

If scoring represents a single score, one can use(如果打分代表一个单一的分数，可以使用):

a single string(一个单一的字符串) (see The scoring parameter: defining model evaluation rules);

*此处搬运一下scoring的列表：

Scoring	Function	Comment
Classification
‘accuracy’	metrics.accuracy_score
‘balanced_accuracy’	metrics.balanced_accuracy_score
‘top_k_accuracy’	metrics.top_k_accuracy_score
‘average_precision’	metrics.average_precision_score
‘neg_brier_score’	metrics.brier_score_loss
‘f1’	metrics.f1_score	for binary targets
‘f1_micro’	metrics.f1_score	micro-averaged
‘f1_macro’	metrics.f1_score	macro-averaged
‘f1_weighted’	metrics.f1_score	weighted average
‘f1_samples’	metrics.f1_score	by multilabel sample
‘neg_log_loss’	metrics.log_loss	requires `predict_proba` support
‘precision’ etc.	metrics.precision_score	suffixes apply as with ‘f1’
‘recall’ etc.	metrics.recall_score	suffixes apply as with ‘f1’
‘jaccard’ etc.	metrics.jaccard_score	suffixes apply as with ‘f1’
‘roc_auc’	metrics.roc_auc_score
‘roc_auc_ovr’	metrics.roc_auc_score
‘roc_auc_ovo’	metrics.roc_auc_score
‘roc_auc_ovr_weighted’	metrics.roc_auc_score
‘roc_auc_ovo_weighted’	metrics.roc_auc_score
Clustering
‘adjusted_mutual_info_score’	metrics.adjusted_mutual_info_score
‘adjusted_rand_score’	metrics.adjusted_rand_score
‘completeness_score’	metrics.completeness_score
‘fowlkes_mallows_score’	metrics.fowlkes_mallows_score
‘homogeneity_score’	metrics.homogeneity_score
‘mutual_info_score’	metrics.mutual_info_score
‘normalized_mutual_info_score’	metrics.normalized_mutual_info_score
‘rand_score’	metrics.rand_score
‘v_measure_score’	metrics.v_measure_score
Regression
‘explained_variance’	metrics.explained_variance_score
‘max_error’	metrics.max_error
‘neg_mean_absolute_error’	metrics.mean_absolute_error
‘neg_mean_squared_error’	metrics.mean_squared_error
‘neg_root_mean_squared_error’	metrics.mean_squared_error
‘neg_mean_squared_log_error’	metrics.mean_squared_log_error
‘neg_median_absolute_error’	metrics.median_absolute_error
‘r2’	metrics.r2_score
‘neg_mean_poisson_deviance’	metrics.mean_poisson_deviance
‘neg_mean_gamma_deviance’	metrics.mean_gamma_deviance
‘neg_mean_absolute_percentage_error’	metrics.mean_absolute_percentage_error
‘d2_absolute_error_score’	metrics.d2_absolute_error_score
‘d2_pinball_score’	metrics.d2_pinball_score
‘d2_tweedie_score’	metrics.d2_tweedie_score

a callable(返回一个单一值的可调用的参数) (see Defining your scoring strategy from metric functions) that returns a single value.

If scoring represents multiple scores, one can use(如果打分代表多个分数，可以使用):

a list or tuple of unique strings(一个唯一字符串的列表或元组);
a callable returning a dictionary where the keys are the metric names and the values are the metric scores(一个返回字典的可调用程序，字典的键是度量名称，值是度量分数);
a dictionary with metric names as keys and callables a values(一个以度量名称为键、以可调用数据为值的字典).

See Specifying multiple metrics for evaluation for an example.

If None, the estimator’s score method is used(如果此参数传入None，则使用估计器的score方法).

5. n_jobs：int, default=None

Number of jobs to run in parallel. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors. See Glossary for more details.

Changed in version v0.20: n_jobs default changed from 1 to None

并行CPU的数量，-1表示全部

6. refit：bool, str, or callable, default=True

Refit an estimator using the best found parameters on the whole dataset.

使用在整个数据集上找到的最佳参数重新匹配一个估计器。

For multiple metric evaluation, this needs to be a str denoting the scorer that would be used to find the best parameters for refitting the estimator at the end.

对于多指标评估，此处传入的必须是str，这个str表示的是：用于寻找最佳参数的评分器(scorer)

Where there are considerations other than maximum score in choosing a best estimator, refit can be set to a function which returns the selected best_index_ given the cv_results. In that case, the best_estimator_ and best_params_ will be set according to the returned best_index_ while the best_score_ attribute will not be available.

如果在选择最佳估计器时，除了最大分值外还有其他考虑，可以将refit设置为一个函数，在给定的cv_results中返回选定的best_index_。在这种情况下，best_estimator_和best_params_将根据返回的best_index_设置，而best_score_属性将不可用。

The refitted estimator is made available at the best_estimator_ attribute and permits using predict directly on this RandomizedSearchCV instance.

重新调整的估计器在best_estimator_属性中可用，并允许在这个RandomizedSearchCV实例上直接使用predict。

Also for multiple metric evaluation, the attributes best_index_, best_score_ and best_params_ will only be available if refit is set and all of them will be determined w.r.t this specific scorer.

同样，对于多指标评估，只有在设置了refit的情况下，best_index_、best_score_和best_params_属性才是可用的，并且所有这些属性都将根据这个特定的评分器来确定。

7. cv：int, cross-validation generator or an iterable, default=None

Determines the cross-validation splitting strategy. Possible inputs for cv are:

决定交叉验证的分割策略。cv的可能输入是：

None, to use the default 5-fold cross validation,(注意，如果不指定，则使用默认的5折交叉验证！)
integer, to specify the number of folds in a (Stratified)KFold,(可以是整数，不想做交叉验证应该可以设置为1)
CV splitter,
An iterable yielding (train, test) splits as arrays of indices(一个迭代器，产生（训练，测试）分割，作为索引数组)

For integer/None inputs, if the estimator is a classifier and y is either binary or multiclass, StratifiedKFold is used. In all other cases, KFold is used. These splitters are instantiated with shuffle=False so the splits will be the same across calls.

对于整数/无的输入，如果估计器是一个分类器，并且y是二进制或多类，则使用StratifiedKFold。在所有其他情况下，使用KFold。这些分割器在实例化时，shuffle=False，所以分割器在不同的调用中都是一样的。

8. verbose: int

此参数用于控制清晰度(verbosity),越高信息越多(注：其实就是在训练的过程中，是不是把训练信息(比如loss、R2等训练过程中的评价指标)返回到屏幕上。不同的输入代表返回不同的训练过程中的评价指标，一般来说，所有的类似性能的三方库都会有这个参数，而且当输入0的时候表示不返回评价指标)

>1 : the computation time for each fold and parameter candidate is displayed(显示每个折叠和候选参数的计算时间);
>2 : the score is also displayed(同时再显示得分);
>3 : the fold and candidate parameter indexes are also displayed together with the starting time of the computation(折叠和候选参数的索引也会显示，同时显示计算的开始时间).

9. pre_dispatch: int, or str, default=’2*n_jobs’

Controls the number of jobs that get dispatched during parallel execution. Reducing this number can be useful to avoid an explosion of memory consumption when more jobs get dispatched than CPUs can process.

控制并行运行期间被派发的job数量。

减少这个数量可以用于：当派发到CPU的job比CPU自身可以处理的job更多时，可以避免内存消耗的爆炸性增长。

This parameter can be:

None, in which case all the jobs are immediately created and spawned(所有的作业都被立即创建和生成). Use this for lightweight and fast-running jobs, to avoid delays due to on-demand spawning of the jobs(对轻量级和快速运行的作业使用这个参数，以避免因按需生成作业而导致的延迟).

An int, giving the exact number of total jobs that are spawned(即传入给出被产生的job总数的确切数目)

A str, giving an expression as a function of n_jobs, as in ‘2*n_jobs’ (传入一个字符串，这个字符串是一个n_jobs的函数表达式)

10. random_state：int, RandomState instance or None, default=None

Pseudo random number generator state used for random uniform sampling from lists of possible values instead of scipy.stats distributions. Pass an int for reproducible output across multiple function calls. See Glossary.

伪随机数生成器的状态，用于从可能的值列表中进行随机均匀采样，而不是scipy.stats分布。传递一个int，以便在多个函数调用中实现可重复的输出。

11. error_score：‘raise’ or numeric, default=np.nan

Value to assign to the score if an error occurs in estimator fitting. If set to ‘raise’, the error is raised. If a numeric value is given, FitFailedWarning is raised. This parameter does not affect the refit step, which will always raise the error.

如果在估计器拟合过程中出现错误，则将该值分配给score。

如果设置为'raise'，则报错。

如果给了一个数值，就会引发FitFailedWarning。

这个参数不影响重拟步骤，重拟步骤将始终引发错误。

12. return_train_score：bool, default=False

If False, the cv_results_ attribute will not include training scores. Computing training scores is used to get insights on how different parameter settings impact the overfitting/underfitting trade-off. However computing the scores on the training set can be computationally expensive and is not strictly required to select the parameters that yield the best generalization performance.

如果是False，cv_results_属性将不包括训练分数。

计算训练分数是用来了解不同的参数设置是如何影响过拟合和欠拟合之间的权衡的。

然而，在训练集上计算score可能很昂贵，同时这个score在选择产生最佳泛化性能的参数时并不是必须的。

所以可以考虑设置为false

本函数的属性Attributes:

1. cv_results_：dict of numpy (masked) ndarrays

A dict with keys as column headers and values as columns, that can be imported into a pandas DataFrame. 一个以键为列头，以值为列的dict，可以被导入到pandas DataFrame中。

For instance the below given table：

param_kernel	param_gamma	split0_test_score	…	rank_test_score
‘rbf’	0.1	0.80	…	1
‘rbf’	0.2	0.84	…	3
‘rbf’	0.3	0.70	…	2

will be represented by a cv_results_ dict of:

{
'param_kernel' : masked_array(data = ['rbf', 'rbf', 'rbf'],
                              mask = False),
'param_gamma'  : masked_array(data = [0.1 0.2 0.3], mask = False),
'split0_test_score'  : [0.80, 0.84, 0.70],
'split1_test_score'  : [0.82, 0.50, 0.70],
'mean_test_score'    : [0.81, 0.67, 0.70],
'std_test_score'     : [0.01, 0.24, 0.00],
'rank_test_score'    : [1, 3, 2],
'split0_train_score' : [0.80, 0.92, 0.70],
'split1_train_score' : [0.82, 0.55, 0.70],
'mean_train_score'   : [0.81, 0.74, 0.70],
'std_train_score'    : [0.01, 0.19, 0.00],
'mean_fit_time'      : [0.73, 0.63, 0.43],
'std_fit_time'       : [0.01, 0.02, 0.01],
'mean_score_time'    : [0.01, 0.06, 0.04],
'std_score_time'     : [0.00, 0.00, 0.00],
'params'             : [{'kernel' : 'rbf', 'gamma' : 0.1}, ...],
}

NOTE

The key 'params' is used to store a list of parameter settings dicts for all the parameter candidates. 键 "params "是用于存储所有候选参数的“参数设置字典”的列表。

The mean_fit_time, std_fit_time, mean_score_time and std_score_time are all in seconds(单位都是秒).

For multi-metric evaluation, the scores for all the scorers are available in the cv_results_ dict at the keys ending with that scorer’s name ('_<scorer_name>') instead of '_score' shown above. (‘split0_test_precision’, ‘mean_train_precision’ etc.)

对于多指标评估，所有评分者的分数在cv_results_ dict中以该评分者的名字结尾的键（'_<scorer_name>'）而不是上面显示的'_score'。('split0_test_precision', 'mean_train_precision' 等)

2. best_estimator_: estimator

Estimator that was chosen by the search, i.e. estimator which gave highest score (or smallest loss if specified) on the left out data. Not available if refit=False.

For multi-metric evaluation, this attribute is present only if refit is specified.

搜索所选择的最佳估算器

即，对遗漏的数据给出最高分的估算器，或者损失最小的估计器。

如果refit=False则不可用。

对于多指标评估，只有在指定了refit的情况下才会出现这个属性。

3. best_score_：float

Mean cross-validated score of the best_estimator. best_estimator的平均交叉验证分数。

For multi-metric evaluation, this is not available if refit is False. See refit parameter for more information.

This attribute is not available if refit is a function.

4. best_params_: dict

Parameter setting that gave the best results on the hold out data. 在持续数据上得到最佳结果的参数设置。

For multi-metric evaluation, this is not available if refit is False. See refit parameter for more information.

5. best_index_：int

The index (of the cv_results_ arrays) which corresponds to the best candidate parameter setting.

The dict at search.cv_results_['params'][search.best_index_] gives the parameter setting for the best model, that gives the highest mean score (search.best_score_).

For multi-metric evaluation, this is not available if refit is False. See refit parameter for more information.

与最佳候选参数设置相对应的（cv_results_数组的）索引。

search.cv_results_['params'][search.best_index_]中的dict给出了最佳模型的参数设置，它给出了最高的平均分（search.best_score_）。

对于多指标评估，如果refit为False，则该参数不可用。更多信息见refit参数。

6. scorer_：function or a dict

Scorer function used on the held out data to choose the best parameters for the model.

在持续数据上使用的评分器函数，用于为模型选择最佳参数。

For multi-metric evaluation, this attribute holds the validated scoring dict which maps the scorer key to the scorer callable.

对于多指标评价，该属性持有经过验证的评分字典，该字典将评分器的键映射到评分器callable。

7. n_splits_：int

The number of cross-validation splits (folds/iterations).

交叉验证分割的数量（折叠/迭代）。

8. refit_time_：float

Seconds used for refitting the best model on the whole dataset.

用于在整个数据集上重新拟合最佳模型的秒数。

This is present only if refit is not False.

9. multimetric_：bool

Whether or not the scorers compute several metrics.

scorer是否计算了几个指标。

classes_ndarray of shape (n_classes,)

Class labels.

n_features_in_int

Number of features seen during fit.

10. feature_names_in_：ndarray of shape (n_features_in_,)

Names of features seen during fit. Only defined if best_estimator_ is defined (see the documentation for the refit parameter for more details) and that best_estimator_ exposes feature_names_in_ when fit.

拟合时看到的特征名称。

只有在定义了best_estimator_的情况下才会定义（更多细节请参见refit参数的文档）

并且best_estimator_在拟合时暴露了feature_names_in_。

本函数的方法Methods:

decision_function(X)	Call decision_function on the estimator with the best found parameters.
fit(X[, y, groups])	Run fit with all sets of parameters.
get_params([deep])	Get parameters for this estimator.
inverse_transform(Xt)	Call inverse_transform on the estimator with the best found params.
predict(X)	Call predict on the estimator with the best found parameters.
predict_log_proba(X)	Call predict_log_proba on the estimator with the best found parameters.
predict_proba(X)	Call predict_proba on the estimator with the best found parameters.
score(X[, y])	Return the score on the given data, if the estimator has been refit.
score_samples(X)	Call score_samples on the estimator with the best found parameters.
set_params(**params)	Set the parameters of this estimator.
transform(X)	Call transform on the estimator with the best found parameters.

property classes_

Class labels.

Only available when refit=True and the estimator is a classifier.

1. decision_function(X)

在具有最佳参数的估计器上调用 decision_function。

只有在refit=True并且底层估计器支持decision_function时才可用。

---------------------------------------------------------------------------------------------------------------------------------

Parameters:

X：indexable, length n_samples

必须满足基础估算器的输入假设

Returns:

y_score：ndarray of shape (n_samples,) or (n_samples, n_classes) or (n_samples, n_classes * (n_classes-1) / 2)

基于具有最佳发现参数的估计器的X的决策函数结果。

---------------------------------------------------------------------------------------------------------------------------------

2. fit(X, y=None, *, groups=None, **fit_params)

Run fit with all sets of parameters.

---------------------------------------------------------------------------------------------------------------------------------

Parameters:

X: array-like of shape (n_samples, n_features)

训练向量，其中n_samples是样本的数量，n_features是特征的数量。

y: array-like of shape (n_samples, n_output) or (n_samples,), default=None

对于分类或回归,y是相对于X的目标向量；对于无监督学习则None。

groups: array-like of shape (n_samples,), default=None

在将数据集分割成训练集和测试集时使用的样本组标签。 Only used in conjunction with a “Group” cv instance (e.g., GroupKFold).

**fit_params: dict of str -> object

传递给estimator的fit方法的参数。

如果一个fit参数是一个数组，其长度等于num_samples，那么它将与X和y一起被分割到CV组。例如，sample_weight参数被分割，因为len(sample_weights) = len(X)。

Returns:

self: object

拟合估计器的实例

---------------------------------------------------------------------------------------------------------------------------------

3. get_params(deep=True)

获得本estimator的参数

---------------------------------------------------------------------------------------------------------------------------------

Parameters:

deep：bool, default=True

如果是True，将返回这个估计器和包含的子对象的估计器的参数。

params：dict

参数名称映射到它们的值。

---------------------------------------------------------------------------------------------------------------------------------

4. inverse_transform(Xt)

在估算器上调用inverse_transform，并采用最佳的参数。

只有当底层估计器实现了inverse_transform并且refit=True时才可用。

---------------------------------------------------------------------------------------------------------------------------------

Parameters:

Xt：indexable, length n_samples

必须满足基础估算器的输入假设。

Returns:

X：{ndarray, sparse matrix} of shape (n_samples, n_features)

基于最佳参数的估计器，对Xt进行反变换的结果。

---------------------------------------------------------------------------------------------------------------------------------

5. predict(X)

Call predict on the estimator with the best found parameters.

Only available if refit=True and the underlying estimator supports predict.

---------------------------------------------------------------------------------------------------------------------------------

Parameters:

X: indexable, length n_samples

Must fulfill the input assumptions of the underlying estimator.

Returns:

y_pred: ndarray of shape (n_samples,)

The predicted labels or values for X based on the estimator with the best found parameters.

---------------------------------------------------------------------------------------------------------------------------------

6. predict_log_proba(X)

7. predict_proba(X)

8. score(X, y=None)

如果估计器已经被refit，则返回给定数据的得分。

如果由定义的话，这里将使用由scoring定义的分数，否则将使用best_estimator_.score方法。

---------------------------------------------------------------------------------------------------------------------------------

Parameters:

X：array-like of shape (n_samples, n_features)

输入数据，其中n_samples是样本的数量，n_features是特征的数量。

y：array-like of shape (n_samples, n_output) or (n_samples,), default=None

Target relative to X for classification or regression; None for unsupervised learning.

Returns:

score：float

如果提供了评分，则为得分，否则为best_estimator_.score方法定义的得分。

---------------------------------------------------------------------------------------------------------------------------------

9. set_params(**params)

设置这个估计器的参数。

该方法既适用于简单的估计器，也适用于嵌套对象（如Pipeline）。后者的参数形式为<component>__<parameter>，这样就有可能更新嵌套对象的每个组件。

---------------------------------------------------------------------------------------------------------------------------------

Parameters:

**paramsdict

Estimator parameters.

Returns:

selfestimator instance

Estimator instance.

---------------------------------------------------------------------------------------------------------------------------------

10. transform(X)

对具有最佳参数的估计器调用transform。

只有当底层估计器支持转换和refit=True时才可用。

---------------------------------------------------------------------------------------------------------------------------------

Parameters:

Xindexable, length n_samples

Must fulfill the input assumptions of the underlying estimator.

Returns:

Xt{ndarray, sparse matrix} of shape (n_samples, n_features)

X transformed in the new space based on the estimator with the best found parameters.

---------------------------------------------------------------------------------------------------------------------------------

例子：

>>> from sklearn.datasets import load_iris
>>> from sklearn.linear_model import LogisticRegression
>>> from sklearn.model_selection import RandomizedSearchCV
>>> from scipy.stats import uniform
>>> iris = load_iris()
>>> logistic = LogisticRegression(solver='saga', tol=1e-2, max_iter=200,
...                               random_state=0)
>>> distributions = dict(C=uniform(loc=0, scale=4),
...                      penalty=['l2', 'l1'])
>>> clf = RandomizedSearchCV(logistic, distributions, random_state=0)
>>> search = clf.fit(iris.data, iris.target)
>>> search.best_params_
{'C': 2..., 'penalty': 'l1'}