c4.5的参数理解

最新推荐文章于 2020-04-30 18:15:25 发布

ActionLi

最新推荐文章于 2020-04-30 18:15:25 发布

阅读量2.4k

点赞数

分类专栏： C4.5 文章标签： c parameters 测试

C4.5 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

–f 项目名
后接项目名称，注意不要.nam后缀名
3.2.2    –u
利用训练生成的决策树检测对应的.tes文件。
3.2.3    –s
尝试对枚举类型的属性进行聚类分析，比如某个属性（记为CiLei）是词类属性，有：a,b,c,d,e,f,…,z共26种可能，如果不用-s参数，则一旦用到CurrentCiLei属性，那么会把决策树一下子生成26个子分支，而用到-s参数之后，则C45_VC自动尝试可能的词类组合，生成的子分支可能是：
|   |   CurrentCiLei = p: 1 (2.0/1.0)
|   |   CurrentCiLei in {b,e,g,h,i,j,k,l,r,u,w,x,y,z,new,old,{,0} 1 (0.0)
|   |   CurrentCiLei in {a,c,d,f,m,n,o,q,s,t,v,ngp}
这样有助于提高训练的效果。
3.2.4    –m 数字
-m后面接的数字表示当一个决策分支中必须有大于等于数字个支撑事例时才可能继续往下细分。比如-m 5表示，如果到这个分支的时候，还有超过5个实例，那么才尝试是不是继续往下分。
这个参数有助于提高集外测试正确率。
3.2.5    –c 数字
-c 后面接的数字表示裁减到的比例，决策树有个裁减的概念，先会生成一个完整的决策树，然后进行裁减，裁减有助于提高决策树的集外测试正确率，缺省裁减到25%。注意是裁减到原先的25％。后面的数字必须是整型，-c 5就表示裁减到原先的5%。裁减越多一般来说集内测试正确率降低而集外测试效果上升。
3.2.6    –v 数字
输出调试信息的级别，数字的范围是0,1,2,3,4,5。0表示输出最少（当然也是最重要的调试信息），而5表示输出最多的信息，缺省为1
3.2.7    –b
决策树裁减过程中允不允许使用窗口的设置选项，训练中使用窗口是为了提高训练速度，窗口有两个相关属性：WINDOW和INCREMENT。下面是关于WINDOW和INCREMENT意义的说明：
/*  Construct a classifier tree using the data items in the                 */
/*  window, then test for the successful classification of other            */
/*  data items by this tree.  If there are misclassified items,             */
/*  put them immediately after the items in the window, increase         */
/*  the size of the window and build another classifier tree, and        */
/*  so on until we have a tree which successfully classifies all             */
/*  of the test items or no improvement is apparent.                     */
-b表示不使用WINDOW机制，训练效率可能会下降，缺省情况下是使用窗口的，缺省值定义如下：
/*  If necessary, set initial size of window to 20% (or twice
    the sqrt, if this is larger) of the number of data items,
    and the maximum number of items that can be added to the
    window at each iteration to 20% of the initial window size  */
3.2.8    –w
设置初始的窗口WINDOW的大小。
3.2.9    –i
设置初始的最大INCREMENT值。
3.2.10    –p
在裁减决策树时使用概率的方式，即Soft的裁减方式，不是硬性规定是哪个分支。
3.2.11    –g
在决策树训练过程中，决策树采用Gain值的概念，可以有两种方式的Gain值定义：
/************************************************************************    */
/*                                                                           */
/*  Determine the worth of a particular split according to the                           */
/*  operative criterion                                                          */
/*                                                                           */
/*          Parameters:                                                      */
/*              SplitInfo:      potential info of the split                              */
/*              SplitGain:      gain in info of the split                            */
/*              MinGain:        gain above which the Gain Ratio                  */
/*                              may be used                                  */
/*                                                                           */
/*  If the Gain criterion is being used, the information gain of                             */
/*  the split is returned, but if the Gain Ratio criterion is                                */
/*  being used, the ratio of the information gain of the split to                            */
/*  its potential information is returned.                                           */
/*                                                                           */
/************************************************************************    */
-g选项表示使用 the ratio of the information gain的方式，否则使用the information gain的值的方式，这一选项影响的是判断是否继续往下分时候的结论可能不同。
缺省为使用the information gain方式。
3.2.12    –t 数字
当TRAILS达到数字后，开始使用窗口机制。

转自水木论坛 http://bbs.tsinghua.edu.cn/frames.php
faint: 我用了一天时间写的一些理解居然在保存的时候丢失了,懒的写了,就转别人的了.比我理解的到位.

ActionLi

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
1
评论
c4.5的参数理解

–f 项目名后接项目名称，注意不要.nam后缀名 3.2.2 –u 利用训练生成的决策树检测对应的.tes文件。 3.2.3 –s 尝试对枚举类型的属性进行聚类分析，比如某个属性（记为CiLei）是词类属性，有：a,b,c,d,e,f,…,z共26种可能，如果不用-s参数，则一旦用到CurrentCiLei属性，那么会把决策树一下子生成26个子分支，而用到-s参数之后，则C45_V
复制链接

扫一扫