机器学习(四)调参

一、常用调参方法

(一)网格搜索
  • 查找搜索范围内的所有点来确定最优值。先选大范围和大步长,再缩小范围和步长。
  • 非凸容易错过全局最优。
(二)随机搜索
  • 搜索范围内随机选点。
  • 不保证全局最优。
(三)贝叶斯优化算法

from sklearn.grid_search import GridSearchCV


SVM

  • C:松弛变量的惩罚系数。C越大,对误分类的惩罚增大,越容易过拟合。一般选择为0.0001到10000

逻辑斯蒂回归


决策树

max_depth:树的深度

min_samples_split:拆分内部节点所需的最小样本数

Parameters
criterion : string, optional (default="gini")
The function to measure the quality of a split. Supported criteria are
"gini" for the Gini impurity and "entropy" for the information gain.
splitter : string, optional (default="best")
‘best’ or ‘random’,前者在特征的所有划分点中找出最优的划分点。后者是随机的在部分划分点中找局部最优的划分点。
”best”适合样本量不大的时候,而如果样本数据量非常大,推荐”random” 。
max_features : int, float, string or None, optional (default=None)
划分时考虑的最大特征数max_features
max_depth : int or None, optional (default=None)
The maximum depth of the tree. If None, then nodes are expanded until
all leaves are pure or until all leaves contain less than
min_samples_split samples.
min_samples_split : int, float, optional (default=2)
The minimum number of samples required to split an internal node:
- If int, then consider min_samples_split as the minimum number.
- If float, then min_samples_split is a percentage and
ceil(min_samples_split * n_samples) are the minimum
number of samples for each split.
min_samples_leaf : int, float, optional (default=1)
The minimum number of samples required to be at a leaf node:
- If int, then consider min_samples_leaf as the minimum number.
- If float, then min_samples_leaf is a percentage and
ceil(min_samples_leaf * n_samples) are the minimum
number of samples for each node.
min_weight_fraction_leaf : float, optional (default=0.)
The minimum weighted fraction of the sum total of weights (of all
the input samples) required to be at a leaf node. Samples have
equal weight when sample_weight is not provided.
max_leaf_nodes : int or None, optional (default=None)
Grow a tree with max_leaf_nodes in best-first fashion.
Best nodes are defined as relative reduction in impurity.
If None then unlimited number of leaf nodes.
class_weight : dict, list of dicts, "balanced" or None, optional (default=None)
Weights associated with classes in the form {class_label: weight}.
If not given, all classes are supposed to have weight one. For
multi-output problems, a list of dicts can be provided in the same
order as the columns of y.
The "balanced" mode uses the values of y to automatically adjust
weights inversely proportional to class frequencies in the input data
as n_samples / (n_classes * np.bincount(y))
For multi-output, the weights of each column of y will be multiplied.
Note that these weights will be multiplied with sample_weight (passed
through the fit method) if sample_weight is specified.
random_state : int, RandomState instance or None, optional (default=None)
If int, random_state is the seed used by the random number generator;
If RandomState instance, random_state is the random number generator;
If None, the random number generator is the RandomState instance used
by np.random.
min_impurity_split : float, optional (default=1e-7)
Threshold for early stopping in tree growth. A node will split
if its impurity is above the threshold, otherwise it is a leaf.
presort : bool, optional (default=False)
Whether to presort the data to speed up the finding of best splits in
fitting. For the default settings of a decision tree on large
datasets, setting this to true may slow down the training process.
When using either a smaller dataset or a restricted depth, this may
speed up the training.

随机森林

Parameters
n_estimators : integer, optional (default=10)
森林中树的个数
criterion : string, optional (default="mse")
CART树做划分时对特征的评价标准。
分类RF对应的CART分类树默认是gini,另一个可选择的标准是信息增益。
回归RF对应的CART回归树默认是均方差mse,另一个可以选择的标准是绝对值差mae。
max_features : int, float, string or None, optional (default="auto")
RF划分时考虑的最大特征数max_features
- If int, then consider max_features features at each split.
- If float, then max_features is a percentage and
int(max_features * n_features) features are considered at each
split.
- If "auto", then max_features=n_features.
- If "sqrt", then max_features=sqrt(n_features).
- If "log2", then max_features=log2(n_features).
- If None, then max_features=n_features.
Note: the search for a split does not stop until at least one
valid partition of the node samples is found, even if it requires to
effectively inspect more than max_features features.
max_depth : integer or None, optional (default=None)
决策树最大深度max_depth: 默认可以不输入,如果不输入的话,决策树在建立子树的时候不会限制子树的深度。
数据少或者特征少的时候可以不管这个值。如果模型样本量多,特征也多,推荐限制这个最大深度。常用的可以取值10-100之间。
min_samples_split : int, float, optional (default=2)
限制了子树继续划分的条件,如果某节点的样本数少于min_samples_split,则不会继续再尝试选择最优特征来进行划分。
如果样本量不大,不需要管这个值。如果样本量数量级非常大,则推荐增大这个值。
- If int, then consider min_samples_split as the minimum number.
- If float, then min_samples_split is a percentage and
ceil(min_samples_split * n_samples) are the minimum
number of samples for each split.
min_samples_leaf : int, float, optional (default=1)
限制了叶子节点最少的样本数,如果某叶子节点数目小于样本数,则会和兄弟节点一起被剪枝。
可以输入最少的样本数的整数,或者最少样本数占样本总数的百分比。
如果样本量不大,不需要管这个值。如果样本量数量级非常大,则推荐增大这个值。
- If int, then consider min_samples_leaf as the minimum number.
- If float, then min_samples_leaf is a percentage and
ceil(min_samples_leaf * n_samples) are the minimum
number of samples for each node.
min_weight_fraction_leaf : float, optional (default=0.)
这个值限制了叶子节点所有样本权重和的最小值,如果小于这个值,则会和兄弟节点一起被剪枝。
一般来说,如果我们有较多样本有缺失值,或者分类树样本的分布类别偏差很大,就会引入样本权重,这时我们就要注意这个值了。
max_leaf_nodes : int or None, optional (default=None)
通过限制最大叶子节点数,可以防止过拟合。
如果特征不多,可以不考虑这个值,但是如果特征分成多的话,可以加以限制,具体的值可以通过交叉验证得到。
min_impurity_split : float, optional (default=1e-7)
这个值限制了决策树的增长,如果某节点的不纯度(基于基尼系数,均方差)小于这个阈值,则该节点不再生成子节点。
即为叶子节点 。一般不推荐改动默认值1e-7。
bootstrap : boolean, optional (default=True)
Whether bootstrap samples are used when building trees.
oob_score : bool, optional (default=False)
whether to use out-of-bag samples to estimate
the R^2 on unseen data.
n_jobs : integer, optional (default=1)
The number of jobs to run in parallel for both fit and predict.
If -1, then the number of jobs is set to the number of cores.
random_state : int, RandomState instance or None, optional (default=None)
If int, random_state is the seed used by the random number generator;
If RandomState instance, random_state is the random number generator;
If None, the random number generator is the RandomState instance used
by np.random.
verbose : int, optional (default=0)
控制树构建过程的冗长性。
warm_start : bool, optional (default=False)
当设置为“真”时,重用上一个调用的解决方案以适应,并向集合中添加更多的估计量,
否则,只需适应一个全新的林。

GBDT


XGBoost


LightBGM

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值