MEP(minimum error pruning) principle with python implemention

According to《Estimating Probabilities: A Crucial Task in Machine learning》:
Θ = p ( C ) p ( C ∣ V 1 ) p ( C ) p ( C ∣ V 1 V 2 ) p ( C ∣ V 1 ) p ( C ∣ V 1 V 2 V 3 ) p ( C ∣ V 1 V 2 ) ⋅ ⋅ ⋅ ⋅ ① w h e r e Θ=p(C)\frac{p(C|V_1)}{p(C)} \frac{p(C|V_1V_2)}{p(C|V_1)} \frac{p(C|V_1V_2V_3)}{p(C|V_1V_2)}····①\\ where Θ=p(C)p(C)p(CV1)p(CV1)p(CV1V2)p(CV1V2)p(CV1V2V3)where
h ( i ) = p ( C ∣ V i ) p ( C ) h(i)=\frac{p(C|V_i)}{p(C)} h(i)=p(C)p(CVi)
p ( C ∣ V i ) = n ( C V i ) n ( V i ) p(C|V_i)=\frac{n(CV_i)}{n(V_i)} p(CVi)=n(Vi)n(CVi)
i f   n ( V i ) = 0 , t h e n   h ( i ) = 1 if \ n(V_i)=0,\\ then \ h(i)=1 if n(Vi)=0,then h(i)=1
i f   n ( V i ) > 0   a n d   n ( C V i ) = 0 , t h e n   h ( i ) = 0 if \ n(V_i)>0 \ and\ n(CV_i)=0,\\ then \ h(i)=0 if n(Vi)>0 and n(CVi)=0,then h(i)=0
which causes that the estimation unreliable.
Because when h(i)=0,no matter how the other factors in ①varies, Θ Θ Θ=0.
So to solve the above problem, β \beta β distribution is used.

According to 《on estimating probabilities in tree pruning》:
There are 3 stages:
①tree construction stage,m=0
②tree pruning stage, m>0
③classification phase,a new different m.
In summary ,at each stage ,you need a different m.

E s = 1 − n e + p a e ⋅ m N + m = N − n e + ( 1 − p a e ⋅ m ) N + m E_s=1-\frac{n_e+p_{ae}·m}{N+m}=\frac{N-n_e+(1-p_{ae}·m)}{N+m} Es=1N+mne+paem=N+mNne+(1paem)
N N N:total number of examples reach the node.
n e n_e ne:number of examples in class c that minimises E s E_s Es for the given m m m.
p a e p_{ae} pae:aprioir probability of class c.
m m m:the parameter of the estimation method.

The backed-up Error is:
∑ i = 1 c o u n t s   o f   a l l   s u b − t r e e s p i ⋅ E i \sum_{i=1}^{counts\ of\ all\ sub-trees} p_i·E_i i=1counts of all subtreespiEi
E i E_i Ei refers to the i t h i_{th} ithsub-tree’s static error
The criterion to prune a tree is:
E b ≥ E s E_b≥E_s EbEs

More Details from the MEP’s author’s replies:

在这里插入图片描述
在这里插入图片描述

datasets and python-implemention for MEP are both included in my Github:
https://github.com/appleyuchi/Decision_Tree_Prune
*******************************************************
we use the following settings by default:
m=2
p a e = p_{ae}= pae=apriority probabilities of each class
*******************************************************
if you like Laplace’s Law of succession methoned in
<estimating probabilities in tree pruning>page 139th,
just set:
m=counts of classes of datasets
p a e = 1 m p_{ae}=\frac{1}{m} pae=m1
*******************************************************
Attention:
①MEP is proposed on the basis of ID3,but I decided to implement it on C4.5
②although《on estimating probabilities in tree pruning》said you need to set m=0 when creating your decision model,but
the original C4.5 model is created from
http://www.rulequest.com/Personal/c4.5r8.tar.gz
so we do Not need to set "m“ .
③although《on estimating probabilities in tree pruning》said you need to set m when you have a test.
I did not set “m” in my testing,
because I use the most common testing mechanism of C4.5,instead of the mechanism mentioned in above paper.


Now let’s perform our first MEP(minimum error pruning) experiment with “abalone datasets”

First,for much easier to visualize,I reorder the datasets by the last column and choose the first 200 items,
and save them as abalone_parts.data(you can find this file in my github)

The C4.5 model before MEP pruned:
model= {‘Viscera’: {’<=0.0145’: {‘Shucked’: {’>0.007’: ’ 4 (66.0/31.0)’, ‘<=0.007’: {‘Shucked’: {’<=0.0045’: {‘Height’: {’<=0.025’: ’ 1 (2.0/1.0)’, ‘>0.025’: ’ 3 (2.0)’}}, ‘>0.0045’: {‘Shucked’: {’>0.005’: {‘Height’: {’<=0.02’: ’ 4 (2.0)’, ‘>0.02’: ’ 3 (4.0)’}}, ‘<=0.005’: ’ 4 (3.0)’}}}}}}, ‘>0.0145’: {‘Shell’: {’<=0.0345’: {‘Viscera’: {’<=0.0285’: ’ 5 (50.0/9.0)’, ‘>0.0285’: ’ 4 (3.0)’}}, ‘>0.0345’: {‘Sex’: {’=M’: ’ 6 (6.0/3.0)’, ‘=F’: ’ 5 (3.0)’, ‘=I’: ’ 5 (59.0/12.0)’}}}}}}

The C4.5 model after MEP pruned:
model_pruned= {‘Viscera’: {’>0.0145’: {‘Shell’: {’<=0.0345’: {‘Viscera’: {’<=0.0285’: ’ 5 (50.0/9.0)’, ‘>0.0285’: ’ 4 (3.0)’}}, ‘>0.0345’: ‘5 (68/16)’}}, ‘<=0.0145’: {‘Shucked’: {’>0.007’: ’ 4 (66.0/31.0)’, ‘<=0.007’: {‘Shucked’: {’>0.0045’: {‘Shucked’: {’>0.005’: {‘Height’: {’<=0.02’: ’ 4 (2.0)’, ‘>0.02’: ’ 3 (4.0)’}}, ‘<=0.005’: ’ 4 (3.0)’}}, ‘<=0.0045’: {‘Height’: {’<=0.025’: ’ 1 (2.0/1.0)’, ‘>0.025’: ’ 3 (2.0)’}}}}}}}}


visualization of unpruned model:

在这里插入图片描述

visualization of MEP_pruned model:

在这里插入图片描述

The accuracy of MEP algorithm for 200items of abalone datasets:
accuracy_unprune= 0.72
accuracy_prune= 0.715

The accuracy of EBP algorithm for 200items of abalone datasets:
Evaluation on training data (200 items):

 Before Pruning           After Pruning
----------------   ---------------------------
Size      Errors   Size      Errors   Estimate

  20   56(28.0%)     17   57(28.5%)    (36.1%)   <<

Let’s put the above result in the following table:

pruning methodsunpruned accuracypruned accuracy
MEP0.7271.5%
EBP0.7271.5%

Now Let’s do the second MEP experiment with “Credit-a datasets”

Credit-a is from UCI(you can also find this in the above github link):

the C4.5 model generated from http://www.rulequest.com/Personal/c4.5r8.tar.gz
is(unpruned model):

model= {‘A9’: {’=t’: {‘A15’: {’>228’: ’ + (106.0/2.0)’, ‘<=228’: {‘A11’: {’>3’: {‘A15’: {’<=4’: ’ + (25.0)’, ‘>4’: {‘A15’: {’<=5’: ’ - (2.0)’, ‘>5’: {‘A7’: {’=v’: ’ + (5.0)’, ‘=bb’: ’ + (1.0)’, ‘=ff’: ’ + (0.0)’, ‘=j’: ’ + (0.0)’, ‘=o’: ’ + (0.0)’, ‘=n’: ’ + (0.0)’, ‘=h’: ’ + (3.0)’, ‘=dd’: ’ + (0.0)’, ‘=z’: ’ - (1.0)’}}}}}}, ‘<=3’: {‘A4’: {’=u’: {‘A7’: {’=v’: {‘A14’: {’<=110’: ’ + (18.0/1.0)’, ‘>110’: {‘A15’: {’>8’: ’ + (4.0)’, ‘<=8’: {‘A6’: {’=w’: {‘A12’: {’=t’: ’ - (2.0)’, ‘=f’: ’ + (3.0)’}}, ‘=q’: {‘A12’: {’=t’: ’ + (4.0)’, ‘=f’: ’ - (2.0)’}}, ‘=ff’: ’ - (0.0)’, ‘=r’: ’ - (0.0)’, ‘=x’: ’ - (0.0)’, ‘=e’: ’ - (0.0)’, ‘=d’: ’ - (2.0)’, ‘=c’: ’ - (4.0/1.0)’, ‘=m’: {‘A13’: {’=g’: ’ + (2.0)’, ‘=p’: ’ - (0.0)’, ‘=s’: ’ - (5.0)’}}, ‘=i’: ’ - (0.0)’, ‘=k’: ’ - (2.0)’, ‘=j’: ’ - (0.0)’, ‘=aa’: {‘A2’: {’<=41’: ’ - (3.0)’, ‘>41’: ’ + (2.0)’}}, ‘=cc’: ’ + (2.0/1.0)’}}}}}}, ‘=dd’: ’ + (0.0)’, ‘=ff’: ’ - (1.0)’, ‘=j’: ’ - (1.0)’, ‘=o’: ’ + (0.0)’, ‘=n’: ’ + (0.0)’, ‘=h’: ’ + (18.0)’, ‘=bb’: {‘A14’: {’<=164’: ’ + (3.4/0.4)’, ‘>164’: ’ - (5.6)’}}, ‘=z’: ’ + (1.0)’}}, ‘=l’: ’ + (0.0)’, ‘=y’: {‘A13’: {’=g’: {‘A14’: {’<=204’: ’ - (16.0/1.0)’, ‘>204’: ’ + (5.0/1.0)’}}, ‘=p’: ’ - (0.0)’, ‘=s’: ’ + (2.0)’}}, ‘=t’: ’ + (0.0)’}}}}}}, ‘=f’: {‘A13’: {’=g’: ’ - (204.0/10.0)’, ‘=p’: {‘A2’: {’<=36’: ’ - (4.0/1.0)’, ‘>36’: ’ + (2.0)’}}, ‘=s’: {‘A4’: {’=u’: {‘A6’: {’=w’: ’ - (0.0)’, ‘=q’: ’ - (1.0)’, ‘=ff’: ’ - (2.0)’, ‘=r’: ’ - (0.0)’, ‘=x’: ’ + (1.0)’, ‘=e’: ’ - (0.0)’, ‘=d’: ’ - (2.0)’, ‘=c’: ’ - (3.0)’, ‘=m’: ’ - (3.0)’, ‘=i’: ’ - (3.0)’, ‘=k’: ’ - (4.0)’, ‘=j’: ’ - (0.0)’, ‘=aa’: ’ - (0.0)’, ‘=cc’: ’ - (1.0)’}}, ‘=l’: ’ + (1.0)’, ‘=y’: ’ - (8.0/1.0)’, ‘=t’: ’ - (0.0)’}}}}}}

After EBP(invented by quinlan) pruned,the model is
{‘A9’: {’=t’: {‘A15’: {’>228’: ’ + (106.0/3.8)’, ‘<=228’: {‘A11’: {’>3’: {‘A15’: {’>4’: {‘A15’: {’<=5’: ’ - (2.0/1.0)’, ‘>5’: ’ + (10.0/2.4)’}}, ‘<=4’: ’ + (25.0/1.3)’}}, ‘<=3’: {‘A4’: {’=u’: {‘A7’: {’=v’: {‘A14’: {’<=110’: ’ + (18.0/2.5)’, ‘>110’: {‘A15’: {’>8’: ’ + (4.0/1.2)’, ‘<=8’: {‘A6’: {’=aa’: {‘A2’: {’<=41’: ’ - (3.0/1.1)’, ‘>41’: ’ + (2.0/1.0)’}}, ‘=w’: {‘A12’: {’=t’: ’ - (2.0/1.0)’, ‘=f’: ’ + (3.0/1.1)’}}, ‘=q’: {‘A12’: {’=t’: ’ + (4.0/1.2)’, ‘=f’: ’ - (2.0/1.0)’}}, ‘=ff’: ’ - (0.0)’, ‘=r’: ’ - (0.0)’, ‘=i’: ’ - (0.0)’, ‘=x’: ’ - (0.0)’, ‘=e’: ’ - (0.0)’, ‘=d’: ’ - (2.0/1.0)’, ‘=c’: ’ - (4.0/2.2)’, ‘=m’: {‘A13’: {’=g’: ’ + (2.0/1.0)’, ‘=p’: ’ - (0.0)’, ‘=s’: ’ - (5.0/1.2)’}}, ‘=cc’: ’ + (2.0/1.8)’, ‘=k’: ’ - (2.0/1.0)’, ‘=j’: ’ - (0.0)’}}}}}}, ‘=z’: ’ + (1.0/0.8)’, ‘=bb’: {‘A14’: {’<=164’: ’ + (3.4/1.5)’, ‘>164’: ’ - (5.6/1.2)’}}, ‘=ff’: ’ - (1.0/0.8)’, ‘=o’: ’ + (0.0)’, ‘=n’: ’ + (0.0)’, ‘=h’: ’ + (18.0/1.3)’, ‘=dd’: ’ + (0.0)’, ‘=j’: ’ - (1.0/0.8)’}}, ‘=l’: ’ + (0.0)’, ‘=y’: {‘A13’: {’=g’: {‘A14’: {’<=204’: ’ - (16.0/2.5)’, ‘>204’: ’ + (5.0/2.3)’}}, ‘=p’: ’ - (0.0)’, ‘=s’: ’ + (2.0/1.0)’}}, ‘=t’: ’ + (0.0)’}}}}}}, ‘=f’: ’ - (239.0/19.4)’}}

After MEP pruned,the model is:
model_pruned= {‘A9’: {’=t’: {‘A15’: {’>228’: ’ + (106.0/2.0)’, ‘<=228’: {‘A11’: {’>3’: {‘A15’: {’>4’: {‘A15’: {’<=5’: ’ - (2.0)’, ‘>5’: ‘+ (10/1)’}}, ‘<=4’: ’ + (25.0)’}}, ‘<=3’: {‘A4’: {’=u’: {‘A7’: {’=v’: {‘A14’: {’<=110’: ’ + (18.0/1.0)’, ‘>110’: {‘A15’: {’>8’: ’ + (4.0)’, ‘<=8’: {‘A6’: {’=aa’: ‘+ (8/3)’, ‘=w’: {‘A12’: {’=t’: ’ - (2.0)’, ‘=f’: ’ + (3.0)’}}, ‘=q’: ‘+ (12/2)’, ‘=c’: ’ - (4.0/1.0)’, ‘=r’: ’ - (0.0)’, ‘=cc’: ’ + (2.0/1.0)’, ‘=x’: ’ - (0.0)’, ‘=e’: ’ - (0.0)’, ‘=d’: ’ - (2.0)’, ‘=ff’: ’ - (0.0)’, ‘=m’: {‘A13’: {’=g’: ’ + (2.0)’, ‘=p’: ’ - (0.0)’, ‘=s’: ’ - (5.0)’}}, ‘=i’: ’ - (0.0)’, ‘=k’: ’ - (2.0)’, ‘=j’: ’ - (0.0)’}}}}}}, ‘=z’: ’ + (1.0)’, ‘=bb’: ‘- (9/3)’, ‘=ff’: ’ - (1.0)’, ‘=o’: ’ + (0.0)’, ‘=n’: ’ + (0.0)’, ‘=h’: ’ + (18.0)’, ‘=dd’: ’ + (0.0)’, ‘=j’: ’ - (1.0)’}}, ‘=l’: ’ + (0.0)’, ‘=y’: {‘A13’: {’=g’: ‘- (21/5)’, ‘=p’: ’ - (0.0)’, ‘=s’: ’ + (2.0)’}}, ‘=t’: ’ + (0.0)’}}}}}}, ‘=f’: ‘- (239/16)’}}


visiualization of the above unpruned model:

在这里插入图片描述

visiualization of the above MEP-pruned model:

在这里插入图片描述
accuracy_unprune= 0.961224489796
accuracy_prune= 0.928571428571

the EBP-pruned results is:
Evaluation on training data (490 items):

 Before Pruning           After Pruning
----------------   ---------------------------
Size      Errors   Size      Errors   Estimate

  90   19( 3.9%)     58   24( 4.9%)    (11.9%)   <<

Let’s put the above results in the following table:

pruning methodsunpruned accuracypruned accuracysimplicity
(how huge is the model after being pruned)
MEP0.9612244897960.92857142857110.5 lines long
EBP0.9610.95114 lines long

Note:different editor has different “line length”,so the 10.5lines and 14 lines are counted in CSDN blog Markdown Editor

We can see that:
In terms of simplicity,MEP wins,EBP loses.
In terms of accuracy,EBP wins,MEP loses.

Summary:

MEP is targeted at simplifying your decision trees without losing accuracy too much.

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值