History of pruning algorithm development and python implementation(finished)

All the python-implementation for 7 post-pruning Algorithms
are here.

Table of Decision Trees:

name of treeinventername of articleyear
ID3Ross Quinlan《Discovering rules by induction from large collections of examples》1979
ID3Ross QuinlanAnother origin:《Learning efficient classification procedures and their application to chess end games》1983a
CARTLeo Breiman《Classification and Regression Trees》1984
C4.5Ross Quinlan《C4.5: Programming for machine learning》1993
C5.0Ross QuinlanCommercial Edition of C4.5 ,no relevant papers-

Table of Pruning Algorithms:

name of post-pruning algorithmname of article or bookyearinventerthe tree prunedRemark
Pessimistic Pruning《Simplifying Decision Trees》part2.31986b(也有1987b的说法,这里以论文上写的时间为准)QuinlanC4.5Ross Quinlan invented “Pessimistic Pruning”,
John Mingers rename it as “Pessimistic Error Pruning” in his article《An Empirical Comparison of Pruning Methods for Decision Tree induction》
Reduced Error Pruning《Simplifying Decision Trees》part2.21986bQuinlanC4.5需要额外的验证集才能剪枝
Cost-Complexity Pruning《Classification and Regression Trees》3.3节1984L BreimanCARTFor Classification Tree only
Error-Complexity Pruning《Classification and Regression Trees》8.5.1节1984L BreimanCARTFor Regression Tree only
Critical Value Pruning《Expert System-Rule Induction with Statistical Data》,还有一说是:《An Empirical Comparison of Pruning Methods for Decision Tree Induction》但是该文作者自己说是引用自1987年的论文1987aJohn Mingers论文中没有明说哪一种关于CVP算法的出处众说纷纭,这里的出处是以《An Empirical Comparison of Pruning Methods for Ensemble Classifiers》P212提到的为准
Minimum-Error Pruning《Learning decision rules in noisy domains》1986Niblett and BratkoCan Not be Downloaded from Internet
Minimum-Error Pruning《on estimating probabilities in tree pruning》1991Bojan Cestnik,Ivan Bratkomodified MEP algorithm
Error-Based Pruning《C4.5: Programs for Machine Learning 》4.2节1993QuinlanC4.5EBP is an evolution of PEP

Pruning Target of classification trees(ID3,C4.5,CART-classification)
1.simplifying your classification trees without losing precision too much
2.improve Generalization ability of validation Sets.
3.alleviate overfit

Pruning Target of classification trees(CART-Regression):
1.simplifying your classification trees without increasing MSE too much
2.improve Generalization ability of validation Sets.
3.alleviate overfit

----------------For C4.5-----------------
U25%(1,16) and U25%(1,168)on《C4.5:programs for machine learning》
Weka的-3.6.10的C4.5与Quinlan教授的C4.5算法的区别
《C4.5: Programs for Machine Learning》chaper4实验结果重现
C4.5-Release8中Ross Quinlan对缺失值的处理
C4.5最新版本Release8与MDL的关系的详细解读
some understanding of《Inferring Decision Trees Using the Minimum Description Length Principle*》
some understanding of《Improved Use of Continuous Attributes in C4.5》
C4.5-Release8的代码架构图

------------REP-start--------
ID3的REP(Reduced Error Pruning)剪枝代码详细解释+周志华《机器学习》决策树图4.5、图4.6、图4.7绘制
周志華《機器學習》圖4.4和图4.9繪制(轉載+增加熵顯示功能)
ID3决策树中连续值的处理+周志华《機器學習》图4.8和图4.10绘制
sklearn没有实现ID3算法

------------REP-end--------

----------------PEP-start-----------------
Earliest PEP Algorithm Principles
Pessimistic Error Pruning example of C4.5
Pessimistic error pruning illustration with C4.5-python implemention
----------------PEP-end----------------

-------------EBP-start--------------------
Error Based Pruning剪枝算法、代码实现与举例
这里有人会质疑为何不直接采用weka中的J48的python接口?
注意,weka是以quinlan的C语言版本代码为准的(说白了就是J48抄的C4.5-Release8),在某些数据集中,例如使用hypo数据集,weka生成的决策树很庞大,这是非常糟糕的。
因为决策树的初衷是帮助分类,生成知识,
十分庞大的决策树显然是不利于使用的。
J48:Java edition of C4.5-Release8
-------------EBP-end--------------------

-------------MEP-start--------------------
Two Examples of Minimum Error Pruning(reprint)
MEP(minimum error pruning) principle with python implemention

-----------------MEP-end--------------------

-----------------CVP-start--------------------
proof of Chi-square statistics used in CVP
CVP(Critical Value Pruning)illustration with clear principle in details
CVP(Critical value pruning)examples with python implemention
Error in a paper about CVP
contingency(列联表)python计算与实验结果分析
python卡方分布计算
-----------------CVP-end--------------------

------------------------For CART-------------------------------------------------------
notes from《classification and regression trees》
\--------CCP start---------------------
《统计学习方法》P59决策树绘制-sklearn版本
CCP(Cost complexity pruning) on sklearn with python implemention
Theory Defect in selecting best pruned tree from CCP with Cross-validation
1SE rule details in CCP pruning of CART
\--------CCP-end---------------------

\--------ECP start---------------------
举例讲清楚模型树和回归树的区别
Error Complexity Pruning for sklearn’s Regression Tree with Python Implementation
\--------ECP end---------------------

Note:
Critical Value Pruning can be both used in pre-pruning and post-pruning.
When in pre-pruing,IM(Information Measure)is replaced with Chi-Square Statistics.
When post-pruning with χ 2 \chi^2 χ2 test,then you need an independent datasets.
Of course,if you grow a tree with CVP,then you need not post-prune it with CVP with the same data which is used to grow the tree.

T a b l e s   a b o u t   w h e t h e r   y o u   n e e d   e x t r a   v a l i d a t i o n   s e t s   w h e n   p r u n i n g Tables\ about\ whether\ you\ need\ extra\ validation\ sets\ when\ pruning Tables about whether you need extra validation sets when pruning
In “Reference and Quotation” of the following table,
the word “test"means sets with
“class label”,so it actually means"validation datasets”.

Pruning AlgorithmNeed extra validation datasets?Reference and Quotation
REP(Reduced Error Pruning)yesAccording to 《An Empirical Comparison of Pruing Methods for Decision Tree Induction》Section2.2.4:
“The pruned node will often make fewer errors on the test data than the sub-tree makes.”
CCP(Cost Complexity Pruning)pruning stage:No
selecting best pruned tree stage:
①(small datasets)cross-validation:yes
②(large datasets)1-SE rule:no
According to《Simplifying Decision Trees》part 2.1:
“Finally,the procedure requires a test set distinct from the original training set”
ECP(Error Complexity Pruning)pruning stage:No
selecting best pruned tree stage:
①(small datasets)cross-validation:yes
②(large datasets)1-SE rule:no
According to《An Empirical Comparison of Pruing Methods for Decision Tree Induction》part 2.2.1-page230:
“Instead,each of the pruned trees is used to classify an independent test data set”
CVP(Critical Value Pruning)pre-pruning:no
post-pruning:yes
《Expert System-Rule Induction with Statistical Data》(pre-prune):
“Intial runs were performed using chi-square purely as a means of choosing attributes-not of judging their significance-onthe two years separately and on the data as a whole.”
MEP(Minimum Error Pruning)It all depends how you set mAccording to《ON ESTIMATING PROBABILITIES IN TREE PRUNING》
Section1:
"m can be adjusted to some essential properties of the learning domain,such as the level of noise in the learning data."
Section 2:
“m can be set so as to maximise the classification accuracy on an independent data set”
PEP(Pessimistic Error Pruning)noAccording to 《Simplifying Decision Trees》Section2.3:
“Unlike these methods,it does not require a test set separate from the cases in the training set from which the tree was constructed.”
EBP(Error Based Pruning)noAccording to《C4.5:Programs for machine learning》Page 40th:
“The approach taken in C4.5 belongs to the second family of techniques that use only the training set from which the tree was build.”

Atttention:
When your datasets is small,you need Cross Validation in CCP、ECP to produce k candidate model,and each candidate model is then pruned many times to get its pruned tree sequence (K such sequences totally),and finally to choose the best pruned tree from among the K sequences.

When your datasets is large,you need 1-SE Rule in CCP、ECP to select the best pruned tree.
In above Github Link ,we use the ② method.

P r u n i n g   S t y l e   T a b l e   o f   A l g o r i t h m s Pruning\ Style\ Table\ of\ Algorithms Pruning Style Table of Algorithms

Pruning AlgorithmPruning StyleReference and Quotation
REP(Reduced Error Pruning)bottom-upNOT mentioned in earliest article 《Simplifying Decision Trees》about it
CCP(Cost Complexity Pruning)bottom-up《Classification and Regression Trees》
-LEO BREIMAN
Chapter 3.3 MINIMAL COST-COMPLEXITY PRUNING:
"Thus, the pruning process produces a finite sequence of subtrees T1 , T2 , T3 , …with progressively fewer terminal nodes "
ECP(Error Complexity Pruning)bottom-up《Classification and Regression Trees》
-LEO BREIMAN
Chpater8.5.1:
“Now minimal error-complexity pruning is done exactly as minimal cost-complexity pruning in classification.”
CVP(Critical Value Pruning)top-down in pre-pruning
bottom-up in post-pruing
pre-pruning
《Expert Systems-Rule Induction with Statistical Data》"The ID3 algorithm,with the enhancements mentioned previously, 1 ^1 1was modified to calculate χ 2 \chi^2 χ2instead of IM"
post-pruning:
https://www.cs.rit.edu/~rlaz/prec2010/slides/DecisionTrees.pdf
MEP(Minimum Error Pruning)bottom-up《The effects of pruning methods on the predictive accuracy of induced decision rules》:
“Niblett and Bratko[26] proposed a bottom-up approach for seaching a single tree that minimizes the expected error rate.”
PEP(Pessimistic Error Pruning)top-down《Top-Down Induction of Decision Trees Classifiers – A Survey》part E:
“The pessimistic pruning procedure performs top-down traversing over the internal nodes.”
EBP(Error Based Pruning)bottom-up《C4.5: Programs for Machine Learning》page 39th:
“Start from the bottom of the tree and examine each nonleaf sub-tree.”

Do all above pruning tend to be overpruning or underpruning?
According to the following 2 articles:
《Simplifying Decision Trees by Pruning and Grafting:New Results》
《Top-Down Induction of Decision Trees Classifiers –A Survey》
The result table is as follows:

Pruning Algorithmtendency of pruning
REPover-pruning or not significant
PEPunder-pruning
EBPunder-pruning
MEPunder-pruning
CVPunder-pruning
CCPunder-pruning(from my own experiment)
ECPunder-pruning (from my own experiment)

markdown tables generation table
https://tool.lu/tables

评论 6
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值