一、常用调参方法
(一)网格搜索
- 查找搜索范围内的所有点来确定最优值。先选大范围和大步长,再缩小范围和步长。
- 非凸容易错过全局最优。
(二)随机搜索
- 搜索范围内随机选点。
- 不保证全局最优。
(三)贝叶斯优化算法
from sklearn.grid_search import GridSearchCV
SVM
- C:松弛变量的惩罚系数。C越大,对误分类的惩罚增大,越容易过拟合。一般选择为0.0001到10000
逻辑斯蒂回归
决策树
max_depth:树的深度
min_samples_split:拆分内部节点所需的最小样本数
Parameters |
---|
criterion : string, optional (default="gini") |
The function to measure the quality of a split. Supported criteria are |
"gini" for the Gini impurity and "entropy" for the information gain. |
splitter : string, optional (default="best") |
‘best’ or ‘random’,前者在特征的所有划分点中找出最优的划分点。后者是随机的在部分划分点中找局部最优的划分点。 |
”best”适合样本量不大的时候,而如果样本数据量非常大,推荐”random” 。 |
max_features : int, float, string or None, optional (default=None) |
划分时考虑的最大特征数max_features |
max_depth : int or None, optional (default=None) |
The maximum depth of the tree. If None, then nodes are expanded until |
all leaves are pure or until all leaves contain less than |
min_samples_split samples. |
min_samples_split : int, float, optional (default=2) |
The minimum number of samples required to split an internal node: |
- If int, then consider min_samples_split as the minimum number. |
- If float, then min_samples_split is a percentage and |
ceil(min_samples_split * n_samples) are the minimum |
number of samples for each split. |
min_samples_leaf : int, float, optional (default=1) |
The minimum number of samples required to be at a leaf node: |
- If int, then consider min_samples_leaf as the minimum number. |
- If float, then min_samples_leaf is a percentage and |
ceil(min_samples_leaf * n_samples) are the minimum |
number of samples for each node. |
min_weight_fraction_leaf : float, optional (default=0.) |
The minimum weighted fraction of the sum total of weights (of all |
the input samples) required to be at a leaf node. Samples have |
equal weight when sample_weight is not provided. |
max_leaf_nodes : int or None, optional (default=None) |
Grow a tree with max_leaf_nodes in best-first fashion. |
Best nodes are defined as relative reduction in impurity. |
If None then unlimited number of leaf nodes. |
class_weight : dict, list of dicts, "balanced" or None, optional (default=None) |
Weights associated with classes in the form {class_label: weight} . |
If not given, all classes are supposed to have weight one. For |
multi-output problems, a list of dicts can be provided in the same |
order as the columns of y. |
The "balanced" mode uses the values of y to automatically adjust |
weights inversely proportional to class frequencies in the input data |
as n_samples / (n_classes * np.bincount(y)) |
For multi-output, the weights of each column of y will be multiplied. |
Note that these weights will be multiplied with sample_weight (passed |
through the fit method) if sample_weight is specified. |
random_state : int, RandomState instance or None, optional (default=None) |
If int, random_state is the seed used by the random number generator; |
If RandomState instance, random_state is the random number generator; |
If None, the random number generator is the RandomState instance used |
by np.random . |
min_impurity_split : float, optional (default=1e-7) |
Threshold for early stopping in tree growth. A node will split |
if its impurity is above the threshold, otherwise it is a leaf. |
presort : bool, optional (default=False) |
Whether to presort the data to speed up the finding of best splits in |
fitting. For the default settings of a decision tree on large |
datasets, setting this to true may slow down the training process. |
When using either a smaller dataset or a restricted depth, this may |
speed up the training. |
随机森林
Parameters |
---|
n_estimators : integer, optional (default=10) |
森林中树的个数 |
criterion : string, optional (default="mse") |
CART树做划分时对特征的评价标准。 |
分类RF对应的CART分类树默认是gini,另一个可选择的标准是信息增益。 |
回归RF对应的CART回归树默认是均方差mse,另一个可以选择的标准是绝对值差mae。 |
max_features : int, float, string or None, optional (default="auto") |
RF划分时考虑的最大特征数max_features |
- If int, then consider max_features features at each split. |
- If float, then max_features is a percentage and |
int(max_features * n_features) features are considered at each |
split. |
- If "auto", then max_features=n_features . |
- If "sqrt", then max_features=sqrt(n_features) . |
- If "log2", then max_features=log2(n_features) . |
- If None, then max_features=n_features . |
Note: the search for a split does not stop until at least one |
valid partition of the node samples is found, even if it requires to |
effectively inspect more than max_features features. |
max_depth : integer or None, optional (default=None) |
决策树最大深度max_depth: 默认可以不输入,如果不输入的话,决策树在建立子树的时候不会限制子树的深度。 |
数据少或者特征少的时候可以不管这个值。如果模型样本量多,特征也多,推荐限制这个最大深度。常用的可以取值10-100之间。 |
min_samples_split : int, float, optional (default=2) |
限制了子树继续划分的条件,如果某节点的样本数少于min_samples_split,则不会继续再尝试选择最优特征来进行划分。 |
如果样本量不大,不需要管这个值。如果样本量数量级非常大,则推荐增大这个值。 |
- If int, then consider min_samples_split as the minimum number. |
- If float, then min_samples_split is a percentage and |
ceil(min_samples_split * n_samples) are the minimum |
number of samples for each split. |
min_samples_leaf : int, float, optional (default=1) |
限制了叶子节点最少的样本数,如果某叶子节点数目小于样本数,则会和兄弟节点一起被剪枝。 |
可以输入最少的样本数的整数,或者最少样本数占样本总数的百分比。 |
如果样本量不大,不需要管这个值。如果样本量数量级非常大,则推荐增大这个值。 |
- If int, then consider min_samples_leaf as the minimum number. |
- If float, then min_samples_leaf is a percentage and |
ceil(min_samples_leaf * n_samples) are the minimum |
number of samples for each node. |
min_weight_fraction_leaf : float, optional (default=0.) |
这个值限制了叶子节点所有样本权重和的最小值,如果小于这个值,则会和兄弟节点一起被剪枝。 |
一般来说,如果我们有较多样本有缺失值,或者分类树样本的分布类别偏差很大,就会引入样本权重,这时我们就要注意这个值了。 |
max_leaf_nodes : int or None, optional (default=None) |
通过限制最大叶子节点数,可以防止过拟合。 |
如果特征不多,可以不考虑这个值,但是如果特征分成多的话,可以加以限制,具体的值可以通过交叉验证得到。 |
min_impurity_split : float, optional (default=1e-7) |
这个值限制了决策树的增长,如果某节点的不纯度(基于基尼系数,均方差)小于这个阈值,则该节点不再生成子节点。 |
即为叶子节点 。一般不推荐改动默认值1e-7。 |
bootstrap : boolean, optional (default=True) |
Whether bootstrap samples are used when building trees. |
oob_score : bool, optional (default=False) |
whether to use out-of-bag samples to estimate |
the R^2 on unseen data. |
n_jobs : integer, optional (default=1) |
The number of jobs to run in parallel for both fit and predict . |
If -1, then the number of jobs is set to the number of cores. |
random_state : int, RandomState instance or None, optional (default=None) |
If int, random_state is the seed used by the random number generator; |
If RandomState instance, random_state is the random number generator; |
If None, the random number generator is the RandomState instance used |
by np.random . |
verbose : int, optional (default=0) |
控制树构建过程的冗长性。 |
warm_start : bool, optional (default=False) |
当设置为“真”时,重用上一个调用的解决方案以适应,并向集合中添加更多的估计量, |
否则,只需适应一个全新的林。 |