机器学习方法（六）：随机森林Random Forest，bagging

最新推荐文章于 2023-11-18 19:25:37 发布

weixin_30698527

最新推荐文章于 2023-11-18 19:25:37 发布

阅读量739

点赞数

原文链接：http://www.cnblogs.com/yihaha/p/7265313.html

版权

欢迎转载，转载请注明：本文出自Bin的专栏blog.csdn.net/xbinworld。
技术交流QQ群：433250724，欢迎对算法、技术感兴趣的同学加入。

前面机器学习方法（四）决策树讲了经典的决策树算法，我们讲到决策树算法很容易过拟合，因为它是通过最佳策略来进行属性分裂的，这样往往容易在train data上效果好，但是在test data上效果不好。随机森林random forest算法，本质上是一种ensemble的方法，可以有效的降低过拟合，本文将具体讲解。

Background

Decision trees are a popular method for various machine learning tasks. Tree learning “comes closest to meeting the requirements for serving as an off-the-shelf procedure for data mining”, say Hastie et al.[1], because it is invariant under scaling and various other transformations of feature values, is robust to inclusion of irrelevant features, and produces inspectable models. However, they are seldom accurate

先讲一讲decision tree[2]的好处：（1）特征数据放缩不变性；（2）面对无关特征更鲁棒；（3）得到确定的model。

但是decision tree往往不够准确，因为很容易产生over-fitting：一颗很深的树往往有low bias, high variance；而随机森林Random Forest通过对对多个决策树进行平均，可以显著降低variance来减少过拟合。RF带来的问题是稍稍增加一点bias，以及模型的可解释性，但是获得的收益是显著提高了准确率。

bagging

bagging[4]，也称为 bootstrap aggregating，是一种非常简单而通用的机器学习集成学习算法。RF需要用到bagging，但是其他的分类或者回归算法都可以用到bagging，以减少over-fitting（降低model的variance）。

Given a standard training set D of size n, bagging generates m new training sets D_i, each of size n′, by sampling from D uniformly and with replacement. This kind of sample is known as a bootstrap sample. The m models are fitted using the above m bootstrap samples and combined by averaging the output (for regression) or voting (for classification).

简单的来说，就是从原始训练数据集中，有放回的采样出若干个小集合，然后在每个小集合上train model，对所有的model output取平均（regression）或者投票（classification）。

bagging的每一个小集合中，不同的样本数量的期望满足这样一个性质[3]：

when drawing with replacement n′ values out of a set of n (different and equally likely), the expected number of unique draws is

n (1 - e - n' / n) .

回到random forest算法：给定一个有n个样本的训练集{ X，Y}，
for b=1,…,B:
1. 从X中有放回的采样n个样本，组成集合{ Xb，Yb}；
2. 在

最低0.47元/天解锁文章

weixin_30698527

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
机器学习方法（六）：随机森林Random Forest，bagging

欢迎转载，转载请注明：本文出自Bin的专栏blog.csdn.net/xbinworld。技术交流QQ群：433250724，欢迎对算法、技术感兴趣的同学加入。前面机器学习方法（四）决策树讲了经典的决策树算法，我们讲到决策树算法很容易过拟合，因为它是通过最佳策略来进行属性分裂的，这样往往容易在train data上效果好，但是在test data上效果不好。随机森林ra...
复制链接

扫一扫