scikit-learn(工程中用的相对较多的模型介绍):1.11. Ensemble methods

参考:http://scikit-learn.org/stable/modules/ensemble.html


在实际项目中,我们真的很少用到那些简单的模型,比如LR、kNN、NB等,虽然经典,但在工程中确实不实用。

今天我们关注在工程中用的相对较多的Ensemble methods。


Ensemble methods(集成方法)主要是综合多个estimators加权或不加权的投票结果来产生最终结果。主要有两大类:

  • In averaging methods(平均方法), the driving principle is to build several estimatorsindependently and then toaverage their predictions. On average, the combined estimator is usually better than any of the single base estimator because its variance is reduced.

    Examples: Bagging methodsForests of randomized trees, ...

  • By contrast, in boosting methods(提升方法), base estimators are builtsequentially and one tries to reduce the bias of the combined estimator(the former estimator). The motivation is to combine several weak models to produce a powerful ensemble.

    Examples: AdaBoostGradient Tree Boosting, ...


接下来主要讲:

1、Bagging meta-estimator

注意bagging和boosting的区别:bagging methods work best with strong and complex models (e.g., fully developed decision trees), in contrast with boosting methods which usually work best with weak models (e.g., shallow decision trees).

不同bagging方法的区别:产生random subsets的方式,有些是随机子样本集,有些是随机子features集,有些是随机子样本/features集,还有一些是有放回的抽样(samples、features可重复)。


scikit-learn提供了a unified BaggingClassifier meta-estimator (resp. BaggingRegressor),同时由参数max_samples和max_features决定子集大小、由bootstrap和bootstrap_features决定子集产生过程是否有替换。小例子:

>>> from sklearn.ensemble import BaggingClassifier
>>> from sklearn.neighbors import KNeighborsClassifier
>>> bagging = BaggingClassifier(KNeighborsClassifier(),
...                             max_samples=0.5, max_features=0.5)


2、Forests of ranomized trees

两种算法:RandomForest algorithm and the Extra-Trees method。最终结果是average prediction of the individual classifiers。给个简单例子:

>>> from sklearn.ensemble import RandomForestClassifier
>>> X = [[0, 0], [1, 1]]
>>> Y = [0, 1]
>>> clf = RandomForestClassifier(n_estimators=10)
>>> clf = clf.fit(X, Y)

Like decision trees, forests of trees also extend to multi-output problems (if Y is an array of size [n_samples, n_outputs]).


RandomForest algorithm

有两个class,分别处理分类和回归,RandomForestClassifier and RandomForestRegressor classes。样本提取时允许replacement(a bootstrap sample),在随机选取的部分(而不是全部的)features上进行划分,与原论文的vote方法不同,scikit-learn通过平均每个分类器的预测概率(averaging their probabilistic prediction)来生成最终结果。

Extremely Randomized Trees 

有两个class,分别处理分类和回归, ExtraTreesClassifier and ExtraTreesRegressor classes。默认使用所有样本,但划分时features随机选取部分。


给个比较例子:

  • 2
    点赞
  • 17
    收藏
    觉得还不错? 一键收藏
  • 2
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值