Weka算法Classifier-tree-RandomForest源码分析（二）代码实现

最新推荐文章于 2024-06-20 17:37:06 发布

smilehehe110

最新推荐文章于 2024-06-20 17:37:06 发布

阅读量1.6k

点赞数

分类专栏：数据挖掘之WEKA 机器学习之随机森林文章标签：随机森林 WEKA Random Forest 源码

数据挖掘之WEKA 同时被 2 个专栏收录

11 篇文章 0 订阅

订阅专栏

机器学习之随机森林

6 篇文章 0 订阅

订阅专栏

Weka算法Classifier-tree-RandomForest源码分析（二）代码实现

RandomForest的实现异常的简单，简单的超出博主的预期，Weka在实现方式上组合了Bagging和RandomTree。

一、RandomForest的训练

构建RandomForest的代码如下：

[java]view plaincopy 
   
 public void buildClassifier(Instances data) throws Exception {  
   
   // can classifier handle the data?  
   getCapabilities().testWithFail(data);  
   
   // remove instances with missing class  
   data = new Instances(data);  
   data.deleteWithMissingClass();  
   
   m_bagger = new Bagging();  
   RandomTree rTree = new RandomTree();  
   
   // set up the random tree options  
   m_KValue = m_numFeatures;  
   if (m_KValue < 1)  
     m_KValue = (int) Utils.log2(data.numAttributes()) + 1;  
   rTree.setKValue(m_KValue);  
   rTree.setMaxDepth(getMaxDepth());  
   
   // set up the bagger and build the forest  
   m_bagger.setClassifier(rTree);  
   m_bagger.setSeed(m_randomSeed);  
   m_bagger.setNumIterations(m_numTrees);  
   m_bagger.setCalcOutOfBag(true);  
   m_bagger.buildClassifier(data);  
 }  

通过这段代码很直观的可以看出首先把无效数据去掉，然后建立了一个Bag，设置随机森林中每棵树所用到的属性的值，设置最大深度，接着把这棵RandomTree当做基分类器传递给Bagging，最后调用bagging的训练方法进行训练。

二、RandomForest分类

看完训练过程看具体的分类过程，也就是classifyInstance函数，值得注意的是，RandomForest继承自Classifier，却没有队classifyInstance方法进行重载，使用的是基类Classifier的classifyInstance函数，但却重载了distributionForInstance，而distributionForInstance却是Classifier的classifyInstance函数所用到的一个函数，返回一个instance在所有类上的概率。代码如下：

[java]view plaincopy 
   
 public double[] distributionForInstance(Instance instance) throws Exception {  
   
   return m_bagger.distributionForInstance(instance);  
 }  

可以看到，算出给定instance在各class上的分布是委托给bagger去做的（真懒），所以这里也不做详细分析，详细分析留到分析bagger的时候再说。

接下来看基类Classifier是如何使用distribution来给出分类结果的。

[java]view plaincopy 
   
 public double classifyInstance(Instance instance) throws Exception {  
   
   double[] dist = distributionForInstance(instance);  
   if (dist == null) {  
     throw new Exception("Null distribution predicted");  
   }  
   switch (instance.classAttribute().type()) {  
   case Attribute.NOMINAL:  
     double max = 0;  
     int maxIndex = 0;  
   
     for (int i = 0; i < dist.length; i++) {  
       if (dist[i] > max) {  
         maxIndex = i;  
         max = dist[i];  
       }  
     }  
     if (max > 0) {  
       return maxIndex;  
     } else {  
       return Instance.missingValue();  
     }  
   case Attribute.NUMERIC:  
   case Attribute.DATE:  
     return dist[0];  
   default:  
     return Instance.missingValue();  
   }  
 }