bagging 集成
基于自助采样法
每次选择m个样本
基于每个采样集训练出一个基学习器
然后将这些基学习器进行结合
关注降低方差
from sklearn.ensemble import BaggingClassifier
base_estimator=None, n_estimators=10,
max_samples=1.0, max_features=1.0, bootstrap=True, bootstrap_features=False,
oob_score=False, warm_start=False, n_jobs=None,
random_state=None, verbose=0)
from sklearn.ensemble import BaggingRegressor
(base_estimator=None, n_estimators=10,
max_samples=1.0, max_features=1.0, bootstrap=True, bootstrap_features=False,
oob_score=False, warm_start=False, n_jobs=None,
random_state=None, verbose=0)
- base_esimateor
- the base estimatetor on random subsets of dataset, decisiontree default
- n_estimators
- the number of base estimators
- 10,default
- int
- max_samples
- 来训练的样本数目
- int
- float, 比例
- max_features
- 来训练的特征数目
- int
- float,比例
- bootstrap
- 是不是用替代的方式选取样本
- bootstrap-features
- oob_score
- 是不是要用外包估计来评断泛化误差
- warm_start
- n_jobs
- 是否在fit\predict中使用并行计算
- 1,使用全部处理器
- random
同决策树。
随机森林
以决策树为基学习器构建bagging集成的基础上。
在RF中,对基决策树的每个结点,随机选择包含k个属性的子集,再从这个子集中选择一个最优属性用于划分。
个体学习器的性能往往有所降低。
然而,随着个体学习器数目的增加,随机森林通常会收敛到更低的泛化误差。
训练数据通常优于bagging
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import RandomForestRegressor
(n_estimators=’warn’, criterion=’gini’, max_depth=None,
min_samples_split=2, min_samples_leaf=1,
min_weight_fraction_leaf=0.0,
max_features=’auto’,
max_leaf_nodes=None,
min_impurity_decrease=0.0,
min_impurity_split=None, bootstrap=True,
oob_score=False, n_jobs=None,
random_state=None, verbose=0,
warm_start=False, class_weight=None)
- n_estimators
- 10, will change to 100
- int
- criterion
- ‘gini’,default - ‘mse’
- 'entorpy; - ‘mae’
- max_dpeth
- min_samples_split
- min_samples_leaf
- min_weight_fraction_leaf
- max_features
- max_leaf_nodes
- min_impurity_decrease
- bootstrap
- True,default,bootstrp samples(有放回抽样)
- False, 使用所有的数据来构建每一颗树
- n_jobs
- random_sate
- warm_start
- False,default,建立一个新树
- Ture, 接上上次的使用
- class_weight
- the default values for the parameters controlling the size of the trees (e.g. max_depth,
min_samples_leaf, etc.) lead to fully grown and unpruned trees which can potentially be very large on
some data sets. To reduce memory consumption, the complexity and size of the trees should be controlled by
setting those parameter values.
The features are always randomly permuted at each split. Therefore, the best found split may vary, even with
the same training data, max_features=n_features and bootstrap=False, if the improvement of the
criterion is identical for several splits enumerated during the search of the best split. To obtain a deterministic
behaviour during fitting, random_state has to be fixed
voting 分类器
class sklearn.ensemble.VotingClassifier(estimators, voting=’hard’,
weights=None, n_jobs=None, flatten_transform=True)[source]¶
Parameters
- estimators
list of (string,estimator) tuples - voting
- ‘hard’, default,多数原则
- ‘soft’,基于预测可能性的比例,在一些well-calibrated 的分类器集成中推荐用这个。
- weigths
定义分类器的权重,或者可能性的权重(soft)- None ,Default
- shape(n_classifiers)
- n_jobs
- flatten_transform
影响soft形式下的输出格式- True, default
METHODS
- fit(self, X, y[, sample_weight]) Fit the estimators.
- fit_transform(self, X[, y]) Fit to data, then transform it.
- get_params(self[, deep]) Get the parameters of the ensemble estimator
- predict(self, X) Predict class labels for X.
- score(self, X, y[, sample_weight]) Returns the mean accuracy on the given test data and labels.
- set_params(self, **params) Setting the parameters for the ensemble estimator
- transform(self, X) Return class labels or probabilities for X for each estimator.
voting 回归器
平衡相似表现的回归模型的缺点。 不相似可能会拉下水哦
from sklearn.ensemble import VotingRegressor(estimators,weights=None,n_Jobs=None)
- METHODS 中 score 方法返回R2系数