很久很久以前给大家写过决策树,非常简单明了的算法。今天给大家写随机(生存)森林,随机森林是集成了很多个决策数的集成模型。像随机森林这样将很多个基本学习器集合起来形成一个更加强大的学习器的这么一种集成思想还是非常好的。所以今天来写写这类算法。
集成学习方法
Ensemble learning methods are made up of a set of classifiers—e.g. decision trees—and their predictions are aggregated to identify the most popular result.
所谓的集成学习方法,就是把很多的比较简单的学习算法统起来用,比如光看一个决策树,好像效果比较单调,还比较容易过拟合,我就训练好多树,把这些树的结果综合一下,结果应该会好很多,用这么样思路形成的算法就是集成学习算法Ensemble methods,就是利用很多个基础学习器形成一个综合学习器。
Basically, a forest is an example of an ensemble, which is a special type of machine learning method that averages simple functions called base learners.The resulting averaged learner is called the ensemble
集成学习方法最有名的就是bagging 和boosting 方法:
The most well-known ensemble methods are bagging, also known as bootstrap aggregation, and boosting
BAGGing
BAGGing, or Bootstrap AGGregating这个方法把自助抽样和结果合并整合在一起,包括两个步骤,一个就是自助抽样,抽很多个数据集出来,每个数据集来训练一个模型,这样就可以有很多个模型了;第二步就是将这么多模型的结果合并出来最终结果,这个最终结果相对于单个模型结果就会更加稳健。
In the bagging algorithm, the first step involves creating multiple models. These models are generated using the same algorithm with random sub-samples of the dataset which are drawn from the original dataset randomly with bootstrap sampling method</