MEP(minimum error pruning) principle with python implemention

最新推荐文章于 2020-05-05 18:22:58 发布

微电子学与固体电子学-俞驰

最新推荐文章于 2020-05-05 18:22:58 发布

阅读量639

点赞数

分类专栏：机器学习算法

本文链接：https://blog.csdn.net/appleyuchi/article/details/84205427

版权

机器学习算法专栏收录该内容

87 篇文章 7 订阅

订阅专栏

According to《Estimating Probabilities: A Crucial Task in Machine learning》:
$Θ=p(C)\frac{p(C|V_1)}{p(C)} \frac{p(C|V_1V_2)}{p(C|V_1)} \frac{p(C|V_1V_2V_3)}{p(C|V_1V_2)}····①\\ where$
$h(i)=\frac{p(C|V_i)}{p(C)}$
$p(C|V_i)=\frac{n(CV_i)}{n(V_i)}$
$if \ n(V_i)=0,\\ then \ h(i)=1$
$if \ n(V_i)>0 \ and\ n(CV_i)=0,\\ then \ h(i)=0$
which causes that the estimation unreliable.
Because when h(i)=0,no matter how the other factors in ①varies, $Θ$ =0.
So to solve the above problem, $\beta$ distribution is used.

According to 《on estimating probabilities in tree pruning》:
There are 3 stages:
①tree construction stage,m=0
②tree pruning stage, m>0
③classification phase,a new different m.
In summary ,at each stage ,you need a different m.

$E_s=1-\frac{n_e+p_{ae}·m}{N+m}=\frac{N-n_e+(1-p_{ae}·m)}{N+m}$
$N$ :total number of examples reach the node.
$n_e$ :number of examples in class c that minimises $E_s$ for the given $m$ .
$p_{ae}$ :aprioir probability of class c.
$m$ :the parameter of the estimation method.

The backed-up Error is:
$\sum_{i=1}^{counts\ of\ all\ sub-trees} p_i·E_i$
$E_i$ refers to the $i_{th}$ sub-tree’s static error
The criterion to prune a tree is:
$E_b≥E_s$

More Details from the MEP’s author’s replies:

datasets and python-implemention for MEP are both included in my Github:
https://github.com/appleyuchi/Decision_Tree_Prune
*******************************************************
we use the following settings by default:
m=2
$p_{ae}=$ apriority probabilities of each class
*******************************************************
if you like Laplace’s Law of succession methoned in
<estimating probabilities in tree pruning>page 139th,
just set:
m=counts of classes of datasets
$p_{ae}=\frac{1}{m}$
*******************************************************
Attention:
①MEP is proposed on the basis of ID3,but I decided to implement it on C4.5
②although《on estimating probabilities in tree pruning》said you need to set m=0 when creating your decision model,but
the original C4.5 model is created from
http://www.rulequest.com/Personal/c4.5r8.tar.gz
so we do Not need to set "m“ .
③although《on estimating probabilities in tree pruning》said you need to set m when you have a test.
I did not set “m” in my testing,
because I use the most common testing mechanism of C4.5,instead of the mechanism mentioned in above paper.

Now let’s perform our first MEP(minimum error pruning) experiment with “abalone datasets”

First,for much easier to visualize,I reorder the datasets by the last column and choose the first 200 items,
and save them as abalone_parts.data(you can find this file in my github)

The C4.5 model before MEP pruned:
model= {‘Viscera’: {’<=0.0145’: {‘Shucked’: {’>0.007’: ’ 4 (66.0/31.0)’, ‘<=0.007’: {‘Shucked’: {’<=0.0045’: {‘Height’: {’<=0.025’: ’ 1 (2.0/1.0)’, ‘>0.025’: ’ 3 (2.0)’}}, ‘>0.0045’: {‘Shucked’: {’>0.005’: {‘Height’: {’<=0.02’: ’ 4 (2.0)’, ‘>0.02’: ’ 3 (4.0)’}}, ‘<=0.005’: ’ 4 (3.0)’}}}}}}, ‘>0.0145’: {‘Shell’: {’<=0.0345’: {‘Viscera’: {’<=0.0285’: ’ 5 (50.0/9.0)’, ‘>0.0285’: ’ 4 (3.0)’}}, ‘>0.0345’: {‘Sex’: {’=M’: ’ 6 (6.0/3.0)’, ‘=F’: ’ 5 (3.0)’, ‘=I’: ’ 5 (59.0/12.0)’}}}}}}

The C4.5 model after MEP pruned:
model_pruned= {‘Viscera’: {’>0.0145’: {‘Shell’: {’<=0.0345’: {‘Viscera’: {’<=0.0285’: ’ 5 (50.0/9.0)’, ‘>0.0285’: ’ 4 (3.0)’}}, ‘>0.0345’: ‘5 (68/16)’}}, ‘<=0.0145’: {‘Shucked’: {’>0.007’: ’ 4 (66.0/31.0)’, ‘<=0.007’: {‘Shucked’: {’>0.0045’: {‘Shucked’: {’>0.005’: {‘Height’: {’<=0.02’: ’ 4 (2.0)’, ‘>0.02’: ’ 3 (4.0)’}}, ‘<=0.005’: ’ 4 (3.0)’}}, ‘<=0.0045’: {‘Height’: {’<=0.025’: ’ 1 (2.0/1.0)’, ‘>0.025’: ’ 3 (2.0)’}}}}}}}}

visualization of unpruned model:

在这里插入图片描述

visualization of MEP_pruned model:

The accuracy of MEP algorithm for 200items of abalone datasets:
accuracy_unprune= 0.72
accuracy_prune= 0.715

The accuracy of EBP algorithm for 200items of abalone datasets:
Evaluation on training data (200 items):

 Before Pruning           After Pruning
----------------   ---------------------------
Size      Errors   Size      Errors   Estimate

  20   56(28.0%)     17   57(28.5%)    (36.1%)   <<

Let’s put the above result in the following table:

pruning methods	unpruned accuracy	pruned accuracy
MEP	0.72	71.5%
EBP	0.72	71.5%

Now Let’s do the second MEP experiment with “Credit-a datasets”

Credit-a is from UCI(you can also find this in the above github link):

the C4.5 model generated from http://www.rulequest.com/Personal/c4.5r8.tar.gz
is(unpruned model):

model= {‘A9’: {’=t’: {‘A15’: {’>228’: ’ + (106.0/2.0)’, ‘<=228’: {‘A11’: {’>3’: {‘A15’: {’<=4’: ’ + (25.0)’, ‘>4’: {‘A15’: {’<=5’: ’ - (2.0)’, ‘>5’: {‘A7’: {’=v’: ’ + (5.0)’, ‘=bb’: ’ + (1.0)’, ‘=ff’: ’ + (0.0)’, ‘=j’: ’ + (0.0)’, ‘=o’: ’ + (0.0)’, ‘=n’: ’ + (0.0)’, ‘=h’: ’ + (3.0)’, ‘=dd’: ’ + (0.0)’, ‘=z’: ’ - (1.0)’}}}}}}, ‘<=3’: {‘A4’: {’=u’: {‘A7’: {’=v’: {‘A14’: {’<=110’: ’ + (18.0/1.0)’, ‘>110’: {‘A15’: {’>8’: ’ + (4.0)’, ‘<=8’: {‘A6’: {’=w’: {‘A12’: {’=t’: ’ - (2.0)’, ‘=f’: ’ + (3.0)’}}, ‘=q’: {‘A12’: {’=t’: ’ + (4.0)’, ‘=f’: ’ - (2.0)’}}, ‘=ff’: ’ - (0.0)’, ‘=r’: ’ - (0.0)’, ‘=x’: ’ - (0.0)’, ‘=e’: ’ - (0.0)’, ‘=d’: ’ - (2.0)’, ‘=c’: ’ - (4.0/1.0)’, ‘=m’: {‘A13’: {’=g’: ’ + (2.0)’, ‘=p’: ’ - (0.0)’, ‘=s’: ’ - (5.0)’}}, ‘=i’: ’ - (0.0)’, ‘=k’: ’ - (2.0)’, ‘=j’: ’ - (0.0)’, ‘=aa’: {‘A2’: {’<=41’: ’ - (3.0)’, ‘>41’: ’ + (2.0)’}}, ‘=cc’: ’ + (2.0/1.0)’}}}}}}, ‘=dd’: ’ + (0.0)’, ‘=ff’: ’ - (1.0)’, ‘=j’: ’ - (1.0)’, ‘=o’: ’ + (0.0)’, ‘=n’: ’ + (0.0)’, ‘=h’: ’ + (18.0)’, ‘=bb’: {‘A14’: {’<=164’: ’ + (3.4/0.4)’, ‘>164’: ’ - (5.6)’}}, ‘=z’: ’ + (1.0)’}}, ‘=l’: ’ + (0.0)’, ‘=y’: {‘A13’: {’=g’: {‘A14’: {’<=204’: ’ - (16.0/1.0)’, ‘>204’: ’ + (5.0/1.0)’}}, ‘=p’: ’ - (0.0)’, ‘=s’: ’ + (2.0)’}}, ‘=t’: ’ + (0.0)’}}}}}}, ‘=f’: {‘A13’: {’=g’: ’ - (204.0/10.0)’, ‘=p’: {‘A2’: {’<=36’: ’ - (4.0/1.0)’, ‘>36’: ’ + (2.0)’}}, ‘=s’: {‘A4’: {’=u’: {‘A6’: {’=w’: ’ - (0.0)’, ‘=q’: ’ - (1.0)’, ‘=ff’: ’ - (2.0)’, ‘=r’: ’ - (0.0)’, ‘=x’: ’ + (1.0)’, ‘=e’: ’ - (0.0)’, ‘=d’: ’ - (2.0)’, ‘=c’: ’ - (3.0)’, ‘=m’: ’ - (3.0)’, ‘=i’: ’ - (3.0)’, ‘=k’: ’ - (4.0)’, ‘=j’: ’ - (0.0)’, ‘=aa’: ’ - (0.0)’, ‘=cc’: ’ - (1.0)’}}, ‘=l’: ’ + (1.0)’, ‘=y’: ’ - (8.0/1.0)’, ‘=t’: ’ - (0.0)’}}}}}}

After EBP(invented by quinlan) pruned,the model is
{‘A9’: {’=t’: {‘A15’: {’>228’: ’ + (106.0/3.8)’, ‘<=228’: {‘A11’: {’>3’: {‘A15’: {’>4’: {‘A15’: {’<=5’: ’ - (2.0/1.0)’, ‘>5’: ’ + (10.0/2.4)’}}, ‘<=4’: ’ + (25.0/1.3)’}}, ‘<=3’: {‘A4’: {’=u’: {‘A7’: {’=v’: {‘A14’: {’<=110’: ’ + (18.0/2.5)’, ‘>110’: {‘A15’: {’>8’: ’ + (4.0/1.2)’, ‘<=8’: {‘A6’: {’=aa’: {‘A2’: {’<=41’: ’ - (3.0/1.1)’, ‘>41’: ’ + (2.0/1.0)’}}, ‘=w’: {‘A12’: {’=t’: ’ - (2.0/1.0)’, ‘=f’: ’ + (3.0/1.1)’}}, ‘=q’: {‘A12’: {’=t’: ’ + (4.0/1.2)’, ‘=f’: ’ - (2.0/1.0)’}}, ‘=ff’: ’ - (0.0)’, ‘=r’: ’ - (0.0)’, ‘=i’: ’ - (0.0)’, ‘=x’: ’ - (0.0)’, ‘=e’: ’ - (0.0)’, ‘=d’: ’ - (2.0/1.0)’, ‘=c’: ’ - (4.0/2.2)’, ‘=m’: {‘A13’: {’=g’: ’ + (2.0/1.0)’, ‘=p’: ’ - (0.0)’, ‘=s’: ’ - (5.0/1.2)’}}, ‘=cc’: ’ + (2.0/1.8)’, ‘=k’: ’ - (2.0/1.0)’, ‘=j’: ’ - (0.0)’}}}}}}, ‘=z’: ’ + (1.0/0.8)’, ‘=bb’: {‘A14’: {’<=164’: ’ + (3.4/1.5)’, ‘>164’: ’ - (5.6/1.2)’}}, ‘=ff’: ’ - (1.0/0.8)’, ‘=o’: ’ + (0.0)’, ‘=n’: ’ + (0.0)’, ‘=h’: ’ + (18.0/1.3)’, ‘=dd’: ’ + (0.0)’, ‘=j’: ’ - (1.0/0.8)’}}, ‘=l’: ’ + (0.0)’, ‘=y’: {‘A13’: {’=g’: {‘A14’: {’<=204’: ’ - (16.0/2.5)’, ‘>204’: ’ + (5.0/2.3)’}}, ‘=p’: ’ - (0.0)’, ‘=s’: ’ + (2.0/1.0)’}}, ‘=t’: ’ + (0.0)’}}}}}}, ‘=f’: ’ - (239.0/19.4)’}}

After MEP pruned,the model is:
model_pruned= {‘A9’: {’=t’: {‘A15’: {’>228’: ’ + (106.0/2.0)’, ‘<=228’: {‘A11’: {’>3’: {‘A15’: {’>4’: {‘A15’: {’<=5’: ’ - (2.0)’, ‘>5’: ‘+ (10/1)’}}, ‘<=4’: ’ + (25.0)’}}, ‘<=3’: {‘A4’: {’=u’: {‘A7’: {’=v’: {‘A14’: {’<=110’: ’ + (18.0/1.0)’, ‘>110’: {‘A15’: {’>8’: ’ + (4.0)’, ‘<=8’: {‘A6’: {’=aa’: ‘+ (8/3)’, ‘=w’: {‘A12’: {’=t’: ’ - (2.0)’, ‘=f’: ’ + (3.0)’}}, ‘=q’: ‘+ (12/2)’, ‘=c’: ’ - (4.0/1.0)’, ‘=r’: ’ - (0.0)’, ‘=cc’: ’ + (2.0/1.0)’, ‘=x’: ’ - (0.0)’, ‘=e’: ’ - (0.0)’, ‘=d’: ’ - (2.0)’, ‘=ff’: ’ - (0.0)’, ‘=m’: {‘A13’: {’=g’: ’ + (2.0)’, ‘=p’: ’ - (0.0)’, ‘=s’: ’ - (5.0)’}}, ‘=i’: ’ - (0.0)’, ‘=k’: ’ - (2.0)’, ‘=j’: ’ - (0.0)’}}}}}}, ‘=z’: ’ + (1.0)’, ‘=bb’: ‘- (9/3)’, ‘=ff’: ’ - (1.0)’, ‘=o’: ’ + (0.0)’, ‘=n’: ’ + (0.0)’, ‘=h’: ’ + (18.0)’, ‘=dd’: ’ + (0.0)’, ‘=j’: ’ - (1.0)’}}, ‘=l’: ’ + (0.0)’, ‘=y’: {‘A13’: {’=g’: ‘- (21/5)’, ‘=p’: ’ - (0.0)’, ‘=s’: ’ + (2.0)’}}, ‘=t’: ’ + (0.0)’}}}}}}, ‘=f’: ‘- (239/16)’}}

visiualization of the above unpruned model:

在这里插入图片描述

visiualization of the above MEP-pruned model:

在这里插入图片描述
accuracy_unprune= 0.961224489796
accuracy_prune= 0.928571428571

the EBP-pruned results is:
Evaluation on training data (490 items):

 Before Pruning           After Pruning
----------------   ---------------------------
Size      Errors   Size      Errors   Estimate

  90   19( 3.9%)     58   24( 4.9%)    (11.9%)   <<

Let’s put the above results in the following table:

pruning methods	unpruned accuracy	pruned accuracy	simplicity (how huge is the model after being pruned)
MEP	0.961224489796	0.928571428571	10.5 lines long
EBP	0.961	0.951	14 lines long

Note:different editor has different “line length”,so the 10.5lines and 14 lines are counted in CSDN blog Markdown Editor

We can see that:
In terms of simplicity,MEP wins,EBP loses.
In terms of accuracy,EBP wins,MEP loses.

Summary:

MEP is targeted at simplifying your decision trees without losing accuracy too much.

微电子学与固体电子学-俞驰

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
1
评论
MEP(minimum error pruning) principle with python implemention

According to《Estimating Probabilities: A Crucial Task in Machine learning》:Θ=p(C)p(C∣V1)p(C)p(C∣V1V2)p(C∣V1)p(C∣V1V2V3)p(C∣V1V2)⋅⋅⋅⋅①whereΘ=p(C)\frac{p(C|V_1)}{p(C)}\frac{p(C|V_1V_2)}{p(C|V_1)}\fra...
复制链接

扫一扫