c4.5代码分析

最新推荐文章于 2022-04-29 16:08:15 发布

hacwalker

最新推荐文章于 2022-04-29 16:08:15 发布

阅读量795

点赞数

分类专栏：数据挖掘文章标签： c4.5算法

本文链接：https://blog.csdn.net/hack_net/article/details/8697795

版权

数据挖掘专栏收录该内容

1 篇文章 0 订阅

订阅专栏

本文详细分析了C4.5算法的代码实现，特别是数据读入和决策树构建的过程。文件通过'|'和','作为分隔符，存储类名、属性名和属性值。代码中，当遇到','时结束读取类名。通过示例展示了决策树的结构，包括outlook、humidity等属性的子树。最后，提到了C4.5运行时生成的未修剪和修剪过的决策树文件，以及它们在评估时的大小和错误率比较。" 116840999,9813509,Redis在高并发秒杀中的应用与分布式锁解析,"['分布式', '缓存', '多线程', 'Redis', '高并发处理']

摘要由CSDN通过智能技术生成

分析重点：我们主要观察这段代码是如何完成读入数据的操作的

GetNames（）：

strcpy(Fn, FileName);
strcat(Fn, ".names");

首先对文件名进行修改操作，读取.names文件进行操作

重点看一下其读入的格式：

while ( ( c = getc(f) ) == '|' || Space(c) )

    {
     if ( c == '|' ) SkipComment;
    }

说明此时的文件中的注释说明在文件最前面，而且以‘|’作为分隔。

/* Read the names of classes, attributes and legal attribute values.     */
/* On completion, these names are stored in:                    */
/*     ClassName     - class names                         */
/*     AttName          - attribute names                    */
/*     AttValName     - attribute value names               */
/* with:                                        */
/*     MaxAttVal     - number of values for each attribute          */
/*                                             */
/* Other global variables set are:                         */
/*     MaxAtt          - maximum attribute number               */
/*     MaxClass     - maximum class number                    */
/*     MaxDiscrVal     - maximum discrete values for any attribute     */
/*                                             */
/* Note: until the number of attributes is known, the name          */
/*        information is assembled in local arrays               */
/*                                             */

do
{
     ReadName(Nf, Buffer);

     if ( ++MaxClass >= ClassCeiling)
     {
         ClassCeiling += 100;
         ClassName = (String *) realloc(ClassName, ClassCeiling*sizeof(String));
     }
     ClassName[MaxClass] = CopyString(Buffer);
}
while ( Delimiter == ',' );

我们发现条件是怎么组织的呢？？以delimiter == ‘，’作为一个分界条件。

所以我们可以大致分析出条件的值

outlook: sunny, overcast, rain.
temperature: continuous.
humidity: continuous.
windy: true, false.

我们来分析一下结果是什么含义

找到了一个 英文网页来进行分析结果：

从Decision来进行分析操作：

Secondly, one or more ASCII renditions of a generated decision tree.

下面是决策树的ASCII 表示：

首先决策树的根属性为outlook，他有三个属性"sunny", "overcast", and "rain".

sunny下面又有humidity子树和rain子树
在括号中的数表示训练集中，相应的归属于这一类的实例的个数，
有可能括号中会有两个数字， (e.g., 4.0/2.0)，第二个数据表示（2.0）在这一路上出现错误的训练实例数。
The sum of the first series of numbers equals the total number of cases read by C4.5 from the golf.data file.
(e.g., 4.0 + 2.0 + 3.0 + 2.0 + 3.0 = 14.0)
The sum of the second series of numbers equals the total number of errors.
(e.g., 0 for this example).
Two binary files are created during execution:

在运行过程中会出现两个新的二进制文件
filestem.unpruned: the unpruned decision tree generated and used by C4.5
filestem.tree: the pruned decision tree generated and used by C4.5 which is subsequently required by C4.5rules to generate rules.

第三部：
最终未修剪的决策树和修剪过决策树会进行比较，最终会决定于训练数据的结果

第一个表格表示的数最终结果与未修剪过的树的比较结果，有两列：

size：决策树的节点的数目

error：分类错误的结果占总的数据的百分比

The second table illustrates the fitness of the pruned tree. It has three columns:
Size: the size of the pruned tree. It is either less than or equal to that of the unpruned tree depending upon the extent of the pruning performed by C4.5.
Errors: the number of classification errors and their corresponding actual error percentage after pruning.
Estimate: the estimated error percentage of the tree after pruning, useful when comparing with the actual percentage.

hacwalker

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
c4.5代码分析

分析重点：我们主要观察这段代码是如何完成读入数据的操作的GetNames（）： strcpy(Fn, FileName); strcat(Fn, ".names"); 首先对文件名进行修改操作，读取.names文件进行操作重点看一下其读入的格式： while ( ( c = getc(f) ) == '|' || Space(c) )
复制链接

扫一扫

专栏目录