随机森林算法

How to grow a Decision Tree

source : [3](3.html)

LearnUnprunedTree(X,Y)

Input: X a matrix of R rows and M columns where X{}{}{~}ij{~} = the value of the j‘th attribute in the i‘th input datapoint. Each column consists of either all real values or all categorical values. Input: Y a vector of R elements, where Y{}{}{~}i{~} = the output class of the i‘th datapoint. The Y{}{}{~}i{~} values are categorical. Output: An Unpruned decision tree

If all records in X have identical values in all their attributes (this includes the case where R<2), return a Leaf Node predicting the majority output, breaking ties randomly. This case also includes If all values in Y are the same, return a Leaf Node predicting this value as the output Else     select m variables at random out of the M variables     For j = 1 .. m         If j‘th attribute is categorical             IG{}{}{~}j{~} = IG(Y|X{}{}{~}j{~}) (see Information Gain)                     Else (j‘th attribute is real-valued)             IG{}{}{~}j{~} = IG(Y|X{}{}{~}j{~}) (see Information Gain)     Let *j* = argmax{~}j~ IG{}{}{~}j{~} (this is the splitting attribute we’ll use)     If j* is categorical then         For each value v of the j‘th attribute             Let X{}{}{^}v{^} = subset of rows of X in which X{}{}{~}ij{~} = v. Let Y{}{}{^}v{^} = corresponding subset of Y             Let Child{}{}{^}v{^} = LearnUnprunedTree(X{}{}{^}v{^},Y{}{}{^}v{^})         Return a decision tree node, splitting on j‘th attribute. The number of children equals the number of values of the j‘th attribute, and the v‘th child is Child{}{}{^}v{^}     Else j* is real-valued and let t be the best split threshold         Let X{}{}{^}LO{^} = subset of rows of X in which X{}{}{~}ij{~} <= t. Let Y{}{}{^}LO{^} = corresponding subset of Y         Let Child{}{}{^}LO{^} = LearnUnprunedTree(X{}{}{^}LO{^},Y{}{}{^}LO{^})         Let X{}{}{^}HI{^} = subset of rows of X in which X{}{}{~}ij{~} > t. Let Y{}{}{^}HI{^} = corresponding subset of Y         Let Child{}{}{^}HI{^} = LearnUnprunedTree(X{}{}{^}HI{^},Y{}{}{^}HI{^})         Return a decision tree node, splitting on j‘th attribute. It has two children corresponding to whether the j‘th attribute is above or below the given threshold.

Note: There are alternatives to Information Gain for splitting nodes  

Information gain

source : [3](3.html)

  1. h4. nominal attributes

suppose X can have one of m values V{~}1~,V{~}2~,…,V{~}m~ P(X=V{~}1~)=p{~}1~, P(X=V{~}2~)=p{~}2~,…,P(X=V{~}m~)=p{~}m~   H(X)= -sum{~}j=1{~}{^}m^ p{~}j~ log{~}2~ p{~}j~ (The entropy of X) H(Y|X=v) = the entropy of Y among only those records in which X has value v H(Y|X) = sum{~}j~ p{~}j~ H(Y|X=v{~}j~) IG(Y|X) = H(Y) - H(Y|X)

  1. h4. real-valued attributes

suppose X is real valued define IG(Y|X:t) as H(Y) - H(Y|X:t) define H(Y|X:t) = H(Y|X<t) P(X<t) + H(Y|X>=t) P(X>=t) define IG*(Y|X) = max{~}t~ IG(Y|X:t)

How to grow a Random Forest

source : [1](1.html)

Each tree is grown as follows:

  1. if the number of cases in the training set is N, sample N cases at random -but with replacement, from the original data. This sample will be the training set for the growing tree.
  2. if there are M input variables, a number m « M is specified such that at each node, m variables are selected at random out of the M and the best split on these m is used to split the node. The value of m is held constant during the forest growing.
  3. each tree is grown to its large extent possible. There is no pruning.

Random Forest parameters

source : [2](2.html) Random Forests are easy to use, the only 2 parameters a user of the technique has to determine are the number of trees to be used and the number of variables (m) to be randomly selected from the available set of variables. Breinman’s recommendations are to pick a large number of trees, as well as the square root of the number of variables for m.  

How to predict the label of a case

Classify(node,V)     Input: node from the decision tree, if node.attribute = j then the split is done on the j‘th attribute

    Input: V a vector of M columns where V{}{}{~}j{~} = the value of the j‘th attribute.     Output: label of V

    If node is a Leaf then             Return the value predicted by node

    Else             Let j = node.attribute             If j is categorical then                     Let v = V{}{}{~}j{~}                     Let child{}{}{^}v{^} = child node corresponding to the attribute’s value v                     Return Classify(child{}{}{^}v{^},V)

            Else j is real-valued                     Let t = node.threshold (split threshold)                     If Vj < t then                             Let child{}{}{^}LO{^} = child node corresponding to (<t)                             Return Classify(child{}{}{^}LO{^},V)                     Else                             Let child{}{}{^}HI{^} = child node corresponding to (>=t)                             Return Classify(child{}{}{^}HI{^},V)  

The out of bag (oob) error estimation

source : [1](1.html)

in random forests, there is no need for cross-validation or a separate test set to get an unbiased estimate of the test set error. It is estimated internally, during the run, as follows:

  • each tree is constructed using a different bootstrap sample from the original data. About one-third of the cases left of the bootstrap sample and not used in the construction of the kth tree.
  • put each case left out in the construction of the kth tree down the kth{}tree to get a classification. In this way, a test set classification is obtained for each case in about one-thrid of the trees. At the end of the run, take j to be the class that got most of the the votes every time case n was oob. The proportion of times that j is not equal to the true class of n averaged over all cases is the oob error estimate. This has proven to be unbiased in many tests.

Other RF uses

source : [1](1.html)

  • variable importance
  • gini importance
  • proximities
  • scaling
  • prototypes
  • missing values replacement for the training set
  • missing values replacement for the test set
  • detecting mislabeled cases
  • detecting outliers
  • detecting novelties
  • unsupervised learning
  • balancing prediction error Please refer to [1](1.html) for a detailed description

References

[1](1.html)   Random Forests - Classification Description         Random forests - classification description [2](2.html)   B. Larivi�re & D. Van Den Poel, 2004. “Predicting Customer Retention and Profitability by Using Random Forests and Regression Forests Techniques,”         Working Papers of Faculty of Economics and Business Administration, Ghent University, Belgium 04/282, Ghent University,         Faculty of Economics and Business Administration.         Available online : Predicting Customer Retention and Profitability by Using Random Forests and Regression Forests Techniques [3](3.html)   Decision Trees - Andrew W. Moore[4]         http://www.cs.cmu.edu/~awm/tutorials[1](1.html) [4](4.html)   Information Gain - Andrew W. Moore         http://www.cs.cmu.edu/~awm/tutorials

Copyright © 2014-2024 The Apache Software Foundation, Licensed under the Apache License, Version 2.0.

如何培育决策树

来源:[3](3.html)

学习未剪枝树(XY

输入:X 是一个由R行和M列组成的矩阵,其中X{ }{ }{~}ij{~} =第 i个输入数据点中第j个属性的值。每列由所有实数值或所有分类值组成。输入:Y 是一个由R个元素组成的向量,其中Y{ }{ }{~}i{~} = 第i个数据点的输出类。Y { }{ }{~}i{~}值为分类值。输出:未剪枝的决策树

如果X中的所有记录在其所有属性中具有相同的值(包括R <2的情况),则返回一个预测多数输出的叶节点,随机打破平局。这种情况还包括如果Y中的所有值都相同,则返回一个叶节点,预测该值作为输出,否则从M 个变量中随机选择m 个变量对于j = 1 .. m         ,如果第 j个属性是分类属性             IG{ }{ }{~}j{~} = IG( Y | X{ }{ }{~}j{~} )(参见信息增益)否则(第 j个属性是实值)             IG{ }{ }{~}j{~} = IG ( Y | X{ }{ }{~}j{~} )(参见信息增益)让 *j* = argmax{~}j~ IG{ }{ }{~}j{~}(这是我们将使用的分裂属性)如果j*是分类属性,则对于第j个属性的每个值v,让 X{ }{ }{^}v{^} = X的行子集,其中X{ }{ }{~}ij{~} = v。令Y{ }{ }{^}v{^} = Y的对应子             集令Child{ }{ }{^}v{^} = LearnUnprunedTree( X{ }{ }{^}v{^} , Y{ }{ }{^}v{^} ) 返回一个决策树节点,在第 j个属性上进行拆分。子节点的数量等于第j个属性的值的数量,第v个子节点为 Child{ }{ }{^}v{^}     否则j*为实值,令t为最佳拆分阈值令X{ }{ }{^}LO{^} = X的行子集,其中X{ }{ }{~}ij{~} <= t。令Y{ }{ }{^}LO{^} = Y的对应子         集令Child{ }{ }{^}LO{^} = LearnUnprunedTree( X{ }{ }{^}LO{^} , Y{ }{ }{^}LO{^} ) 令X{ }{ }{^}HI{^} = X的行子集 ,其中X{ }{ }{~}ij{~} > t。令Y{ }{ } {^}HI{^} = Y的对应子集         令Child{ }{ }{^}HI{^} = LearnUnprunedTree( X{ }{ }{^}HI{^} , Y{ }{ }{^}HI{^} ) 返回决策树节点,根据 第 j个属性进行拆分。它有两个子节点,分别对应第j个属性是高于还是低于给定阈值。

注意:除了信息增益之外,还有其他方法可以分割节点  

信息增益

来源:[3](3.html)

  1. h4. 名义属性

假设 X 可以具有 m 个值之一 V{~}1~,V{~}2~,…,V{~}m~ P(X=V{~}1~)=p{~}1~, P(X=V{~}2~)=p{~}2~,…,P(X=V{~}m~)=p{~}m~ H(X)= -sum{~}j=1{~}{^}m^ p{~}j~ log{~}2~ p{~}j~ (X 的熵) H(Y|X=v) = 仅在 X 具有值 v 的记录中 Y 的熵 H(Y|X) = sum{~}j~ p{~}j~ H(Y|X=v{~}j~) IG(Y|X) = H(Y) - H(Y|X)

  1. h4. 实值属性

假设 X 是实值,定义 IG(Y|X:t) 为 H(Y) - H(Y|X:t) 定义 H(Y|X:t) = H(Y|X<t) P(X<t) + H(Y|X>=t) P(X>=t) 定义 IG*(Y|X) = max{~}t~ IG(Y|X:t)

如何培育随机森林

来源:[1](1.html)

每棵树的生长方式如下:

  1. 如果训练集中的案例数为N,则从原始数据中随机抽取N个案例(但有替换)。该样本将成为生长树的训练集。
  2. 如果有M 个输入变量,则指定一个数字m « M,这样在每个节点上,从M中随机选择m 个变量,并使用这m 个变量中的最佳分割来分割节点。在森林生长过程中, m的值保持不变。
  3. 每棵树都尽可能地长大。没有修剪。

随机森林参数

来源:[2](2.html) 随机森林易于使用,该技术的用户只需确定 2 个参数:要使用的树的数量和从可用变量集中随机选择的变量数量 ( m )。 Breinman 的建议是选择大量树,以及变量数量的平方根( m ) 。  

如何预测案件的标签

分类(节点V)输入:决策树中的节点,如果node.attribute = j ,则对第j个属性进行拆分

    输入:V是一个M列向量,其中 V{ }{ }{~}j{~} = 第j个属性的值。输出:V的标签

    如果节点是叶子,则返回节点预测的值

    否则,让j = node.attribute             如果j是分类的,则让v = V{ }{ }{~}j{~}                     让child{ }{ }{^}v{^} = 对应于属性值v的子节点                     返回 Classify( child{ }{ }{^}v{^} , V )

            否则j为实值 设t = node.threshold(分割阈值) 如果 Vj < t 则 设child{ }{ }{^}LO{^} = 对应于(<t)的子节点 返回 Classify( child{ }{ }{^}LO{^} , V ) 否则 设child{ }{ }{^}HI{^} = 对应于(>=t)的子节点 返回 Classify( child{ }{ }{^}HI{^} , V )  

袋外(oob)误差估计

来源:[1](1.html)

在随机森林中,不需要交叉验证或单独的测试集来获得测试集误差的无偏估计。在运行过程中,它会在内部进行估计,如下所示:

  • 每棵树都是使用来自原始数据的不同引导样本构建的。大约三分之一的案例是引导样本中剩下的,没有用于构建第k棵树。
  • 将构建第 k棵树时遗漏的每个案例放到 第 k{ }树中,得到一个分类。这样,大约三分之一的树中的每个案例都会得到一个测试集分类。在运行结束时,取j作为每次案例noob时获得大多数投票的类。所有案例中j不等于n的真实类别的次数的平均值就是oob 错误估计。这在许多测试中已被证明是无偏的。

其他射频用途

来源:[1](1.html)

  • 变量重要性
  • 基尼重要性
  • 邻近
  • 扩展
  • 原型
  • 训练集缺失值替换
  • 测试集缺失值替换
  • 检测错误标记的案例
  • 检测异常值
  • 发现新奇事物
  • 无监督学习
  • 平衡预测误差的详细说明请参考[1](1.html)

参考

[1](1.html) 随机森林 - 分类描述         Random forests - classification description [2](2.html) B. Larivi�re & D. Van Den Poel,2004 年。“使用随机森林和回归森林技术预测客户保留率和盈利能力,”比利时根特大学经济与工商管理学院工作论文 04/282,根特大学,经济与工商管理学院。在线获取:Predicting Customer Retention and Profitability by Using Random Forests and Regression Forests Techniques [3](3.html) 决策树 - Andrew W. Moore[4] http://www.cs.cmu.edu/~awm/tutorials[1](1.html) [4](4.html) 信息增益 - Andrew W. Moore         http://www.cs.cmu.edu/~awm/tutorials

  • 25
    点赞
  • 9
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值