Weka J48决策树算法(C4.5)源码学习

最新推荐文章于 2021-02-25 20:23:34 发布

三少GG

最新推荐文章于 2021-02-25 20:23:34 发布

阅读量1.1w

点赞数 1

分类专栏：云计算/大数据

本文链接：https://blog.csdn.net/pkueecser/article/details/17419995

版权

本文详细介绍了Weka中J48决策树算法（C4.5）的源码学习，涵盖了从J48.java开始的代码流程，包括模型学习、数据分割、C45Split的选择与评估，以及分类实例的处理。文章通过代码片段展示了决策树构建的过程，如C45ModelSelection、ClassifierSplitModel和C45Split等关键类的功能。同时，文章还提及了与之相关的熵基分裂准则和信息增益计算。

摘要由CSDN通过智能技术生成

代码下载：

http://weka.wikispaces.com/Subversion

Use WEKA in your Java Code:

http://www.cs.umb.edu/~ding/history/310_fall_2010/homework/UseWEKA_In_Java_Code.pdf

J48 C4.5决策树算法源码学习

TODO: J48 的分类效率分析。

题记：之前虽然对 J48 用得比较多，是由于它能方便的区别特征的好坏。工作了，希望自己能更深入，如是开始了这个算法学习系列。希望和大家共同进步。

个人对看算法源代码也没有很好的流程，计划先采用按类Class 做架构介绍；再深入代码具体逻辑的方式展开。欢迎大家提出好的算法源码阅读流程。

另外，求推荐LR 的比较好的实现代码 ~(^o^)~

一、准备工作。

下载 weka 的工具包，将 weka.jar 和 weka-src.jar 导入eclipse 项目的依赖包，即可查看到源码。

也可以将weka-src.jar 解压，在对应的文件夹下建立一个单独的eclipse 项目，这样可以自己修改代码。

相关论文参考： Ross Quinlan (1993). C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, San Mateo, CA.

因为还是对 E文不友好，个人看了统计学习方法的对应章节就算是把理论先过完啦。

熵用来计算随机量的分布不确定性，或者其表达能力。计算公式：

信息增益 Information Gain 又称为互信息 Multi Information, 体现了新的属性的出现对类别不确定的弱化的能力。计算公式：

信息增益率在分类问题中体现为：特征A对类别C的信息增益与特征A的熵之比。能在一定程度上修正某个属性ai由于分布很广而Gain更大的影响；但是也可能导致过分修正而选择那些属性分布很窄的项。可以采用两步判别法：信息增益超过平均值的属性，其次再比较信息增益率。

C4.5 决策树的学习算法：

二、代码流解析：

模型的学习程序从 J48.java 开始。

J48.buildClassifier(ins): 选取 C45 决策树模型为例：

[java]view plaincopy 
   
 modSelection = new C45ModelSelection(m_minNumObj, instances);  
 m_root = new C45PruneableClassifierTree(modSelection, !m_unpruned, m_CF,  
                         m_subtreeRaising, !m_noCleanup);  
 m_root.buildClassifier(instances);  

将C45Pruneable*.buildClassifier(ins) 继续展开：

[java] view plain copy

data.deleteWithMissingClass();



buildTree(data, m_subtreeRaising || !m_cleanup);

collapse();

if (m_pruneTheTree) {

    prune();

}

if (m_cleanup) {

    cleanup(new Instances(data, 0));

}

对基类ClassifierTree. buildTree()继续展开：

调用 modSelection.selectModel(ins);

modSelection.split(ins). // 分割数据

m_sons[i] = getNewTree(localInstances[i]); // 构建子树

将 C45ModelSelection.selectModel(ins) 继续展开：

[java] view plain copy

if (Utils.sm(checkDistribution.total(), 2 * m_minNoObj)

        || Utils.eq(checkDistribution.total(), checkDistribution.perClass(checkDistribution.maxClass())))

    return noSplitModel;

multiValue = !(attribute.isNumeric() || attribute.numValues() < (0.3 * m_allData.numInstances()));

currentModel = new C45Split[data.numAttributes()];

sumOfWeights = data.sumOfWeights();



// For each attribute.

for (i = 0; i < data.numAttributes(); i++) {



    // Apart from class attribute.

    if (i != (data).classIndex()) {



        // Get models for current attribute.

        currentModel[i] = new C45Split(i, m_minNoObj, sumOfWeights);

        currentModel[i].buildClassifier(data);



        // ... 省略代码部分：更新 averageInfoGain的总和

    } else

        currentModel[i] = null;

}



averageInfoGain = averageInfoGain / (double) validModels;



// Find "best" attribute to split on.

minResult = 0;

for (i = 0; i < data.numAttributes(); i++) {



        // Use 1E-3 here to get a closer approximation to the original implementation.

        if ((currentModel[i].infoGain() >= (averageInfoGain - 1E-3))

                && Utils.gr(currentModel[i].gainRatio(), minResult)) {

            bestModel = currentModel[i];

            minResult = currentModel[i].gainRatio();

        }

}

  </