机器学习（四）调参-CSDN博客

本文链接：https://blog.csdn.net/qq_35319215/article/details/96921169

一、常用调参方法

（一）网格搜索

查找搜索范围内的所有点来确定最优值。先选大范围和大步长，再缩小范围和步长。
非凸容易错过全局最优。

（二）随机搜索

搜索范围内随机选点。
不保证全局最优。

（三）贝叶斯优化算法

from sklearn.grid_search import GridSearchCV

SVM

C：松弛变量的惩罚系数。C越大，对误分类的惩罚增大，越容易过拟合。一般选择为0.0001到10000

逻辑斯蒂回归

决策树

max_depth：树的深度

min_samples_split：拆分内部节点所需的最小样本数

Parameters
criterion : string, optional (default="gini")
The function to measure the quality of a split. Supported criteria are
"gini" for the Gini impurity and "entropy" for the information gain.

splitter : string, optional (default="best")
‘best’ or ‘random’，前者在特征的所有划分点中找出最优的划分点。后者是随机的在部分划分点中找局部最优的划分点。
”best”适合样本量不大的时候，而如果样本数据量非常大，推荐”random” 。

max_features : int, float, string or None, optional (default=None)
划分时考虑的最大特征数max_features

max_depth : int or None, optional (default=None)
The maximum depth of the tree. If None, then nodes are expanded until
all leaves are pure or until all leaves contain less than
min_samples_split samples.

min_samples_split : int, float, optional (default=2)
The minimum number of samples required to split an internal node:

- If int, then consider `min_samples_split` as the minimum number.
- If float, then `min_samples_split` is a percentage and
`ceil(min_samples_split * n_samples)` are the minimum
number of samples for each split.

min_samples_leaf : int, float, optional (default=1)
The minimum number of samples required to be at a leaf node:

- If int, then consider `min_samples_leaf` as the minimum number.
- If float, then `min_samples_leaf` is a percentage and
`ceil(min_samples_leaf * n_samples)` are the minimum
number of samples for each node.

min_weight_fraction_leaf : float, optional (default=0.)
The minimum weighted fraction of the sum total of weights (of all
the input samples) required to be at a leaf node. Samples have
equal weight when sample_weight is not provided.

max_leaf_nodes : int or None, optional (default=None)
Grow a tree with `max_leaf_nodes` in best-first fashion.
Best nodes are defined as relative reduction in impurity.
If None then unlimited number of leaf nodes.

class_weight : dict, list of dicts, "balanced" or None, optional (default=None)
Weights associated with classes in the form `{class_label: weight}`.
If not given, all classes are supposed to have weight one. For
multi-output problems, a list of dicts can be provided in the same
order as the columns of y.

The "balanced" mode uses the values of y to automatically adjust
weights inversely proportional to class frequencies in the input data
as `n_samples / (n_classes * np.bincount(y))`

For multi-output, the weights of each column of y will be multiplied.

Note that these weights will be multiplied with sample_weight (passed
through the fit method) if sample_weight is specified.

random_state : int, RandomState instance or None, optional (default=None)
If int, random_state is the seed used by the random number generator;
If RandomState instance, random_state is the random number generator;
If None, the random number generator is the RandomState instance used
by `np.random`.

min_impurity_split : float, optional (default=1e-7)
Threshold for early stopping in tree growth. A node will split
if its impurity is above the threshold, otherwise it is a leaf.

presort : bool, optional (default=False)
Whether to presort the data to speed up the finding of best splits in
fitting. For the default settings of a decision tree on large
datasets, setting this to true may slow down the training process.
When using either a smaller dataset or a restricted depth, this may
speed up the training.

随机森林

Parameters
n_estimators : integer, optional (default=10)
森林中树的个数

criterion : string, optional (default="mse")
CART树做划分时对特征的评价标准。
分类RF对应的CART分类树默认是gini，另一个可选择的标准是信息增益。
回归RF对应的CART回归树默认是均方差mse，另一个可以选择的标准是绝对值差mae。

max_features : int, float, string or None, optional (default="auto")
RF划分时考虑的最大特征数max_features
- If int, then consider `max_features` features at each split.
- If float, then `max_features` is a percentage and
`int(max_features * n_features)` features are considered at each
split.
- If "auto", then `max_features=n_features`.
- If "sqrt", then `max_features=sqrt(n_features)`.
- If "log2", then `max_features=log2(n_features)`.
- If None, then `max_features=n_features`.
Note: the search for a split does not stop until at least one
valid partition of the node samples is found, even if it requires to
effectively inspect more than `max_features` features.

max_depth : integer or None, optional (default=None)
决策树最大深度max_depth: 默认可以不输入，如果不输入的话，决策树在建立子树的时候不会限制子树的深度。
数据少或者特征少的时候可以不管这个值。如果模型样本量多，特征也多，推荐限制这个最大深度。常用的可以取值10-100之间。

min_samples_split : int, float, optional (default=2)
限制了子树继续划分的条件，如果某节点的样本数少于min_samples_split，则不会继续再尝试选择最优特征来进行划分。
如果样本量不大，不需要管这个值。如果样本量数量级非常大，则推荐增大这个值。
- If int, then consider `min_samples_split` as the minimum number.
- If float, then `min_samples_split` is a percentage and
`ceil(min_samples_split * n_samples)` are the minimum
number of samples for each split.

min_samples_leaf : int, float, optional (default=1)
限制了叶子节点最少的样本数，如果某叶子节点数目小于样本数，则会和兄弟节点一起被剪枝。
可以输入最少的样本数的整数，或者最少样本数占样本总数的百分比。
如果样本量不大，不需要管这个值。如果样本量数量级非常大，则推荐增大这个值。
- If int, then consider `min_samples_leaf` as the minimum number.
- If float, then `min_samples_leaf` is a percentage and
`ceil(min_samples_leaf * n_samples)` are the minimum
number of samples for each node.

min_weight_fraction_leaf : float, optional (default=0.)
这个值限制了叶子节点所有样本权重和的最小值，如果小于这个值，则会和兄弟节点一起被剪枝。
一般来说，如果我们有较多样本有缺失值，或者分类树样本的分布类别偏差很大，就会引入样本权重，这时我们就要注意这个值了。

max_leaf_nodes : int or None, optional (default=None)
通过限制最大叶子节点数，可以防止过拟合。
如果特征不多，可以不考虑这个值，但是如果特征分成多的话，可以加以限制，具体的值可以通过交叉验证得到。

min_impurity_split : float, optional (default=1e-7)
这个值限制了决策树的增长，如果某节点的不纯度(基于基尼系数，均方差)小于这个阈值，则该节点不再生成子节点。
即为叶子节点。一般不推荐改动默认值1e-7。

bootstrap : boolean, optional (default=True)
Whether bootstrap samples are used when building trees.

oob_score : bool, optional (default=False)
whether to use out-of-bag samples to estimate
the R^2 on unseen data.

n_jobs : integer, optional (default=1)
The number of jobs to run in parallel for both `fit` and `predict`.
If -1, then the number of jobs is set to the number of cores.

random_state : int, RandomState instance or None, optional (default=None)
If int, random_state is the seed used by the random number generator;
If RandomState instance, random_state is the random number generator;
If None, the random number generator is the RandomState instance used
by `np.random`.

verbose : int, optional (default=0)
控制树构建过程的冗长性。

warm_start : bool, optional (default=False)
当设置为“真”时，重用上一个调用的解决方案以适应，并向集合中添加更多的估计量，
否则，只需适应一个全新的林。