分析重点:我们主要观察这段代码是如何完成读入数据的操作的
strcpy(Fn, FileName);
strcat(Fn, ".names");
strcat(Fn, ".names");
首先对文件名进行修改操作,读取.names文件进行操作
重点看一下其读入的格式:
while ( ( c = getc(f) ) == '|' || Space(c) )
{
if ( c == '|' ) SkipComment;
}
说明此时的文件中的注释说明在文件最前面,而且以‘|’作为分隔。
/* Read the names of classes, attributes and legal attribute values. */
/* On completion, these names are stored in: */
/* ClassName - class names */
/* AttName - attribute names */
/* AttValName - attribute value names */
/* with: */
/* MaxAttVal - number of values for each attribute */
/* */
/* Other global variables set are: */
/* MaxAtt - maximum attribute number */
/* MaxClass - maximum class number */
/* MaxDiscrVal - maximum discrete values for any attribute */
/* */
/* Note: until the number of attributes is known, the name */
/* information is assembled in local arrays */
/* */
/* On completion, these names are stored in: */
/* ClassName - class names */
/* AttName - attribute names */
/* AttValName - attribute value names */
/* with: */
/* MaxAttVal - number of values for each attribute */
/* */
/* Other global variables set are: */
/* MaxAtt - maximum attribute number */
/* MaxClass - maximum class number */
/* MaxDiscrVal - maximum discrete values for any attribute */
/* */
/* Note: until the number of attributes is known, the name */
/* information is assembled in local arrays */
/* */
do
{
ReadName(Nf, Buffer);
if ( ++MaxClass >= ClassCeiling)
{
ClassCeiling += 100;
ClassName = (String *) realloc(ClassName, ClassCeiling*sizeof(String));
}
ClassName[MaxClass] = CopyString(Buffer);
}
while ( Delimiter == ',' );
{
ReadName(Nf, Buffer);
if ( ++MaxClass >= ClassCeiling)
{
ClassCeiling += 100;
ClassName = (String *) realloc(ClassName, ClassCeiling*sizeof(String));
}
ClassName[MaxClass] = CopyString(Buffer);
}
while ( Delimiter == ',' );
我们发现条件是怎么组织的呢??以delimiter == ‘,’作为一个分界条件。
所以我们可以大致分析出条件的值
outlook: sunny, overcast, rain.
temperature: continuous.
humidity: continuous.
windy: true, false.
temperature: continuous.
humidity: continuous.
windy: true, false.
我们来分析一下结果是什么含义
找到了一个
英文网页来进行分析结果:
从Decision来进行分析操作:
Secondly, one or more ASCII renditions of a generated decision tree.
下面是决策树的ASCII 表示:
首先决策树的根属性为outlook,他有三个属性"sunny", "overcast", and "rain".
sunny下面又有humidity子树和rain子树
在括号中的数表示训练集中,相应的归属于这一类的实例的个数,
有可能括号中会有两个数字, (e.g., 4.0/2.0),第二个数据表示(2.0)在这一路上出现错误的训练实例数。
The sum of the first series of numbers equals the total number of cases read by C4.5 from the golf.data file.
(e.g., 4.0 + 2.0 + 3.0 + 2.0 + 3.0 = 14.0)
The sum of the second series of numbers equals the total number of errors.
(e.g., 0 for this example).
Two binary files are created during execution:
在括号中的数表示训练集中,相应的归属于这一类的实例的个数,
有可能括号中会有两个数字, (e.g., 4.0/2.0),第二个数据表示(2.0)在这一路上出现错误的训练实例数。
The sum of the first series of numbers equals the total number of cases read by C4.5 from the golf.data file.
(e.g., 4.0 + 2.0 + 3.0 + 2.0 + 3.0 = 14.0)
The sum of the second series of numbers equals the total number of errors.
(e.g., 0 for this example).
Two binary files are created during execution:
在运行过程中会出现两个新的二进制文件
filestem.unpruned: the unpruned decision tree generated and used by C4.5
filestem.tree: the pruned decision tree generated and used by C4.5 which is subsequently required by C4.5rules to generate rules.
filestem.unpruned: the unpruned decision tree generated and used by C4.5
filestem.tree: the pruned decision tree generated and used by C4.5 which is subsequently required by C4.5rules to generate rules.
第三部:
最终未修剪的决策树和修剪过决策树会进行比较,最终会决定于训练数据的结果
最终未修剪的决策树和修剪过决策树会进行比较,最终会决定于训练数据的结果
第一个表格表示的数最终结果与未修剪过的树的比较结果,有两列:
size:决策树的节点的数目
error:分类错误的结果占总的数据的百分比
The second table illustrates the fitness of the pruned tree. It has three columns:
Size: the size of the pruned tree. It is either less than or equal to that of the unpruned tree depending upon the extent of the pruning performed by C4.5.
Errors: the number of classification errors and their corresponding actual error percentage after pruning.
Estimate: the estimated error percentage of the tree after pruning, useful when comparing with the actual percentage.