随机森林—bagging算法的代表作

最新推荐文章于 2023-12-26 13:45:59 发布

ma416539432

最新推荐文章于 2023-12-26 13:45:59 发布

阅读量1.9k

点赞数 1

分类专栏：机器学习实践机器学习原理文章标签：算法随机森林机器学习

本文链接：https://blog.csdn.net/ma416539432/article/details/53220447

版权

机器学习实践同时被 2 个专栏收录

4 篇文章 0 订阅

订阅专栏

机器学习原理

4 篇文章 0 订阅

订阅专栏

随机森林
对于随机森林的理解看这篇博文
另外周志华《机器学习》8.3节对bagging和随机森林进行了比较好的总结。

随机森林关键点：
bagging方法，用重复采样的方式产生n个不同的样本集合，然后基于基分类器并行训练出n个模型，为了是降低模型的方差。
在bagging基础上，随机森林用决策树（不剪枝）做基分类器，对决策树的特征进行随机的采样，进一步的提高了模型抗过拟合能力。
随基分类器个数的增加，随机森林的泛化误差会逐渐低于bagging。

下面是对boosting和bagging两大集成学习方法的比较。

Two families of ensemble methods are usually distinguished:

In averaging methods, the driving principle is to build several estimators independently and then to average their predictions. On average, the combined estimator is usually better than any of the single base estimator because its variance is reduced.

Examples: Bagging methods, Forests of randomized trees, ...

    bagging和boosting算法最大的差异点！
    By contrast, in boosting methods, base estimators are built sequentially and one tries to reduce the bias of the combined estimator. The motivation is to combine several weak models to produce a powerful ensemble.

    Examples: AdaBoost, Gradient Tree Boosting, ...

As they provide a way to reduce overfitting, bagging methods work best with strong and complex models (e.g., fully developed decision trees), in contrast with boosting methods which usually work best with weak models (e.g., shallow decision trees).

bagging算法的种类！！
When random subsets of the dataset are drawn as random subsets of the samples, then this algorithm is known as Pasting [B1999].
When samples are drawn with replacement, then the method is known as Bagging [B1996].
When random subsets of the dataset are drawn as random subsets of the features, then the method is known as Random Subspaces [H1998].
Finally, when base estimators are built on subsets of both samples and features, then the method is known as Random Patches [LG2012].

随机森林的优点：
1、它是一个Out Of Box的算法，也就是它对超参数的依赖不强，可以拿来即用；
所以一开始接触数据挖掘的时候，感觉随机森林真是美，直接把数据扔进模型，不用数据预处理、进行简单调参（其实就是试个值），就能得到一个相对不错的的结果。

2、它的另一个更重要的作用是用来做特征提取。

随机森林用来做特征提取（显示特征对提取的重要性）的例子
这个例子来源于sklearn，链接如下

在sklearn中有两种随机森林，一种是一般随机随机森林，一种是极端随机森林，极端随机森林的效果要比随机森林好一点点。
http://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html#sphx-glr-auto-examples-ensemble-plot-forest-importances-py

import numpy as np
import matplotlib.pyplot as plt

from sklearn.datasets import make_classification
from sklearn.ensemble import ExtraTreesClassifier

# Build a classification task using 3 informative features
X, y = make_classification(n_samples=100000,
                           n_features=20,
                           n_informative=10,
                           n_redundant=0,
                           n_repeated=0,
                           n_classes=2,
                           random_state=0,
                           shuffle=False)

# Build a forest and compute the feature importances
forest = ExtraTreesClassifier(n_estimators=250,
                              random_state=0)

forest.fit(X, y)
importances = forest.feature_importances_
std = np.std([tree.feature_importances_ for tree in forest.estimators_],
             axis=0)
indices = np.argsort(importances)[::-1]

# Print the feature ranking
print("Feature ranking:")

for f in range(X.shape[1]):
    print("%d. feature %d (%f)" % (f + 1, indices[f], importances[indices[f]]))

# Plot the feature importances of the forest
plt.figure()
plt.title("Feature importances")
plt.bar(range(X.shape[1]), importances[indices],
       color="r", yerr=std[indices], align="center")
plt.xticks(range(X.shape[1]), indices)
plt.xlim([-1, X.shape[1]])
plt.show()

生成的特征重要的图如下：
这里写图片描述

ma416539432

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
随机森林—bagging算法的代表作

随机森林对于随机森林的理解看这篇博文另外周志华《机器学习》8.3节对bagging和随机森林进行了比较好的总结。随机森林关键点： bagging方法，用重复采样的方式产生n个不同的样本集合，然后基于基分类器并行训练出n个模型，为了是降低模型的方差。在bagging基础上，随机森林用决策树（不剪枝）做基分类器，对决策树的特征进行随机的采样，进一步的提高了模型抗过拟合能力。随基分类器个
复制链接

扫一扫