Mahout的BreimanExample例子执行了
Leo Breiman: Random Forests. Machine Learning 45(1): 5-32 (2001)这篇论文的测试。
对它的分析我分为3个部分
- 森林生成的Iteration部分
- BreimanExample的测试执行部分
- 命令行执行部分
Iteration部分
迭代函数如下,对于训练数据集data,根据随机生成器rng随机将data分成训练集与测试集两部分,之后生成随机森林,并进行准确率检测。
/**
* runs one iteration of the procedure.
*
* @param rng
* random numbers generator
* @param data
* training data
* @param m
* number of random variables to select at each tree-node
* take m to be the first integer less than log2(M) + 1, where M is the number of attributes
* @param nbtrees
* number of trees to grow
*/
private void runIteration(Random rng, Data data, int m, int nbtrees)
1.数据集的构造
data是输入的数据集,不过并不会将data的全部都用来做训练,而是将它分为两部分:
第一部分train是训练集,用data克隆一下,咦,好奇怪啊,那不就和data一样了么?
第二部分test为测试集,它是从train中随机取出10%左右的数据构成,同时将这些数据从train中删除,具体是使用Data类的rsplit函数实现的。
Data train = data.clone();
Data test = train.rsplit(rng, (int) (data.size() * 0.1));
1.1Data类的rsplit函数
Data类位于org.apache.mahout.classifier.df.data.Data。
它的成员变量有两个,相关说明是
Holds a list of vectors and their corresponding Dataset
private final List<Instance> instances;
private final Dataset dataset;
rpslit函数如下,它从Data对象的instances变量存储的数据随机取出了subsize个放到新构造的subset集合中去,这subsize个数据也就从instances中给去掉了。
因为数据同出一源,所以最后返回的Data对象它的dataset和调用rsplit的Data对象的dataset是相同的。
/**
* Splits the data in two, returns one part, and this gets the rest of the data. <b>VERY SLOW!</b>
*/
public Data rsplit(Random rng, int subsize) {
List<Instance> subset = Lists.newArrayListWithCapacity(subsize);
for (int i = 0; i < subsize; i++) {
subset.add(instances.remove(rng.nextInt(instances.size())));
}
return new Data(dataset, subset);
}
1.2TreeBuilder
接着定义了决策树的构造器以及森林的构造器。
两个类分别位于
org.apache.mahout.classifier.df.builder.DefaultTreeBuilder;
org.apache.mahout.classifier.df.ref.SequentialBuilder;
/**
* Builds a Decision Tree <br>
* Based on the algorithm described in the "Decision Trees" tutorials by Andrew W. Moore, available at:<br>
* <br>
* http://www.cs.cmu.edu/~awm/tutorials
* <br><br>
* This class can be used when the criterion variable is the categorical attribute.
*/
DefaultTreeBuilder treeBuilder = new DefaultTreeBuilder();
/**
* Builds a Random Decision Forest using a given TreeBuilder to grow the trees
*/
SequentialBuilder forestBuilder = new SequentialBuilder(rng, treeBuilder, train);
接着用forestBuilder来构造一个随机森林。
/* grow a forest with m = log2(M)+1*/
treeBuilder.setM(m);
DecisionForest forestM = forestBuilder.build(nbtrees);
SequentialBuilder中的build函数如下,它循环使用bagging生成了nbTrees颗树。用trees记录各棵树的根节点。
public class SequentialBuilder {
private final Bagging bagging;
public DecisionForest build(int nbTrees) {
List<Node> trees = Lists.newArrayList();
for (int treeId = 0; treeId < nbTrees; treeId++) {
trees.add(bagging.build(rng));
logProgress(((float) treeId + 1) / nbTrees);
}
return new DecisionForest(trees);
}
}
1.3Bagging
bagging是如何建树的呢?
如下所示,先是用Data的bagging方法从数据中采样出一个训练集bag,之后用这个bag按照决策树的方法建树就好了。
/**
* Builds one tree
*/
public Node build(Random rng) {
log.debug("Bagging...");
Arrays.fill(sampled, false);
Data bag = data.bagging(rng, sampled);
log.debug("Building...");
return treeBuilder.build(rng, bag);
}
那么如何bagging采样呢?
如下所示,也即有放回从N个样本的数据集中随机采样N次,同一个数据可以多次采样,挺奇怪的这个sampled有什么用呢?
/**
* if data has N cases, sample N cases at random -but with replacement.
*
* @param sampled
* indicating which instance has been sampled
*
* @return sampled data
*/
public Data bagging(Random rng, boolean[] sampled) {
int datasize = size();
List<Instance> bag = Lists.newArrayListWithCapacity(datasize);
for (int i = 0; i < datasize; i++) {
int index = rng.nextInt(datasize);
bag.add(instances.get(index));
sampled[index] = true;
}
return new Data(dataset, bag);
}
接下来又按照m=1的方式再进行一次随机森林的生成
m表示的是number of attributes to select randomly at each node
// grow a forest with m=1
treeBuilder.setM(1);
time = System.currentTimeMillis();
log.info("Growing a forest with m=1");
DecisionForest forestOne = forestBuilder.build(nbtrees);
sumTimeOne += System.currentTimeMillis() - time;
numNodesOne += forestOne.nbNodes();
1.4测试森林的准确率
为两个森林(m