随机森林—bagging算法的代表作

随机森林
对于随机森林的理解看这篇博文
另外周志华《机器学习》8.3节对bagging和随机森林进行了比较好的总结。

随机森林关键点:
bagging方法,用重复采样的方式产生n个不同的样本集合,然后基于基分类器并行训练出n个模型,为了是降低模型的方差。
在bagging基础上,随机森林用决策树(不剪枝)做基分类器,对决策树的特征进行随机的采样,进一步的提高了模型抗过拟合能力。
随基分类器个数的增加,随机森林的泛化误差会逐渐低于bagging。

下面是对boosting和bagging两大集成学习方法的比较。

Two families of ensemble methods are usually distinguished:

In averaging methods, the driving principle is to build several estimators independently and then to average their predictions. On average, the combined estimator is usually better than any of the single base estimator because its variance is reduced.

Examples: Bagging methods, Forests of randomized trees, ...
    bagging和boosting算法最大的差异点!
    By contrast, in boosting methods, base estimators are built sequentially and one tries to reduce the bias of the combined estimator. The motivation is to combine several weak models to produce a powerful ensemble.

    Examples: AdaBoost, Gradient Tree Boosting, ...

As they provide a way to reduce overfitting, bagging methods work best with strong and complex models (e.g., fully developed decision trees), in contrast with boosting methods which usually work best with weak models (e.g., shallow decision trees).

bagging算法的种类!!
When random subsets of the dataset are drawn as random subsets of the samples, then this algorithm is known as Pasting [B1999].
When samples are drawn with replacement, then the method is known as Bagging [B1996].
When random subsets of the dataset are drawn as random subsets of the features, then the method is known as Random Subspaces [H1998].
Finally, when base estimators are built on subsets of both samples and features, then the method is known as Random Patches [LG2012].

随机森林的优点:
1、它是一个Out Of Box的算法,也就是它对超参数的依赖不强,可以拿来即用;
所以一开始接触数据挖掘的时候,感觉随机森林真是美,直接把数据扔进模型,不用数据预处理、进行简单调参(其实就是试个值),就能得到一个相对不错的的结果。

2、它的另一个更重要的作用是用来做 特征提取。

随机森林用来做特征提取(显示特征对提取的重要性)的例子
这个例子来源于sklearn,链接如下

在sklearn中有两种随机森林,一种是一般随机随机森林,一种是极端随机森林,极端随机森林的效果要比随机森林好一点点。
http://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html#sphx-glr-auto-examples-ensemble-plot-forest-importances-py

import numpy as np
import matplotlib.pyplot as plt

from sklearn.datasets import make_classification
from sklearn.ensemble import ExtraTreesClassifier

# Build a classification task using 3 informative features
X, y = make_classification(n_samples=100000,
                           n_features=20,
                           n_informative=10,
                           n_redundant=0,
                           n_repeated=0,
                           n_classes=2,
                           random_state=0,
                           shuffle=False)

# Build a forest and compute the feature importances
forest = ExtraTreesClassifier(n_estimators=250,
                              random_state=0)

forest.fit(X, y)
importances = forest.feature_importances_
std = np.std([tree.feature_importances_ for tree in forest.estimators_],
             axis=0)
indices = np.argsort(importances)[::-1]

# Print the feature ranking
print("Feature ranking:")

for f in range(X.shape[1]):
    print("%d. feature %d (%f)" % (f + 1, indices[f], importances[indices[f]]))

# Plot the feature importances of the forest
plt.figure()
plt.title("Feature importances")
plt.bar(range(X.shape[1]), importances[indices],
       color="r", yerr=std[indices], align="center")
plt.xticks(range(X.shape[1]), indices)
plt.xlim([-1, X.shape[1]])
plt.show()

生成的特征重要的图如下:
这里写图片描述

  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值