新学习新迷惑:sklearn,一次验证之旅

目录

1、一个完全的样本

2、sklearn


1、一个完全的样本

设计好的决策树:

用机器学习的算法对相应数据处理后,会不会得出设计的这棵树?

节点分支数量:

A,B,C,D,E=(3,4,3,2,2)

可提供的样本总数:3×4×3×2×2=144

当一条路径唯一确定,那些非该路径特征取值就可任意,因此可以确定唯一包含一条路径的样本数量:

所以,有效样本容量:112

有36个样本不属于任何路径。

所有样本如下:dataprofull.csv

特征A,特征B,特征C,特征D,特征E,结果RES
A,B,C,D,E,RES
1,5,10,12,3,no
1,8,10,12,4,yes
0,8,9,13,4,yes
2,6,11,12,3,yes
0,6,9,13,3,no
1,5,10,13,3,no
2,8,9,12,3,yes
2,8,9,13,4,no
1,6,11,13,4,no
1,7,9,12,3,yes
1,7,11,12,4,no
1,6,9,12,4,yes
2,8,10,13,3,no
2,5,10,12,3,yes
0,8,10,12,4,yes
2,8,10,12,3,yes
1,5,9,12,3,yes
1,8,10,13,3,yes
1,5,11,12,3,no
2,7,11,12,4,yes
1,8,11,12,4,no
0,5,10,12,4,yes
0,6,10,12,4,yes
2,8,11,12,4,yes
0,7,10,13,3,yes
0,7,9,12,4,yes
2,7,10,12,4,yes
0,7,9,13,4,yes
1,6,11,13,3,no
2,6,10,12,4,yes
1,8,10,12,3,yes
0,8,10,13,4,yes
2,6,9,13,4,yes
0,7,9,12,3,yes
2,6,9,12,3,yes
1,6,11,12,3,no
2,6,11,13,4,yes
2,6,9,12,4,yes
2,8,9,13,3,no
1,5,9,13,3,yes
0,5,10,13,4,yes
2,6,9,13,3,yes
0,6,10,13,3,no
2,5,10,12,4,yes
1,7,11,13,4,no
0,7,11,13,4,yes
1,7,9,13,4,yes
1,5,10,13,4,no
0,5,11,13,4,yes
1,8,9,13,3,yes
1,5,11,13,3,no
0,5,9,13,4,yes
0,6,11,12,4,yes
2,7,11,12,3,yes
1,8,9,12,3,yes
0,7,10,12,3,yes
0,7,10,13,4,yes
0,6,10,12,3,no
1,5,11,13,4,no
1,6,9,13,3,yes
2,8,10,12,4,yes
2,5,9,12,3,yes
1,5,10,12,4,no
0,6,9,13,4,yes
0,7,10,12,4,yes
1,7,11,13,3,no
0,8,9,12,4,yes
1,8,10,13,4,yes
2,5,9,12,4,yes
0,7,11,12,4,yes
0,7,11,13,3,yes
0,8,11,13,4,yes
2,8,11,13,3,no
1,5,11,12,4,no
1,8,11,13,4,no
2,8,9,12,4,yes
2,7,10,12,3,yes
1,7,9,13,3,yes
0,6,9,12,3,no
1,7,9,12,4,yes
1,6,9,12,3,yes
2,6,11,13,3,yes
2,7,9,12,3,yes
0,7,9,13,3,yes
1,8,11,13,3,no
1,6,11,12,4,no
1,8,9,12,4,yes
2,5,11,12,4,yes
0,6,11,13,3,no
1,6,9,13,4,yes
2,8,11,12,3,yes
0,6,9,12,4,yes
2,6,10,13,3,yes
0,8,11,12,4,yes
1,7,11,12,3,no
0,6,11,13,4,yes
2,8,11,13,4,no
1,5,9,12,4,yes
1,8,11,12,3,no
2,7,9,12,4,yes
0,5,9,12,4,yes
0,7,11,12,3,yes
2,6,10,13,4,yes
2,5,11,12,3,yes
0,6,11,12,3,no
2,6,10,12,3,yes
1,5,9,13,4,yes
0,6,10,13,4,yes
1,8,9,13,4,yes
2,8,10,13,4,no
0,5,11,12,4,yes
2,6,11,12,4,yes

我自己写的id3算法以及生成的决策树:

有问题吗?验证一下便可知:

从样本中任取一条记录,如最后一行的“2,6,11,12,4,yes”,这个次序依次对应特征A,B,C,D,E,RES,即A=2,B=6,C=11,D=12,E=4,RES=‘yes'。从树中访问来看,C=11,A=2,D=12,到达决策节点'yes',这与记录中结果RES='yes'是一致的,说明生成的决策树没问题。其他的记录也可类似访问,都能从树中得到相应结果。至于B=6和E=4,对于树中这条路径来讲是多余信息,可随机取值。

再从最初设计的决策树来看,“2,6,11,12,4,yes”是属于决策路径d10=(A=2,D=12,’yes’),从这里,既能看到内、外在规则的统一,又能看到内、外在规则的区别。

决策树的生成的决策代码:

def Decision(**d):
      if d['C']=='9':
             if d['B']=='8':
                    if d['A']=='1':
                           return 'yes'
                    elif d['A']=='0':
                           return 'yes'
                    elif d['A']=='2':
                           if d['D']=='13':
                                  return 'no'
                           elif d['D']=='12':
                                  return 'yes'
             elif d['B']=='6':
                    if d['A']=='1':
                           return 'yes'
                    elif d['A']=='0':
                           if d['E']=='3':
                                  return 'no'
                           elif d['E']=='4':
                                  return 'yes'
                    elif d['A']=='2':
                           return 'yes'
             elif d['B']=='5':
                    return 'yes'
             elif d['B']=='7':
                    return 'yes'
      elif d['C']=='11':
             if d['A']=='1':
                    return 'no'
             elif d['A']=='0':
                    if d['E']=='3':
                           if d['B']=='6':
                                  return 'no'
                           elif d['B']=='7':
                                  return 'yes'
                    elif d['E']=='4':
                           return 'yes'
             elif d['A']=='2':
                    if d['D']=='13':
                           if d['B']=='8':
                                  return 'no'
                           elif d['B']=='6':
                                  return 'yes'
                    elif d['D']=='12':
                           return 'yes'
      elif d['C']=='10':
             if d['B']=='8':
                    if d['A']=='1':
                           return 'yes'
                    elif d['A']=='0':
                           return 'yes'
                    elif d['A']=='2':
                           if d['D']=='13':
                                  return 'no'
                           elif d['D']=='12':
                                  return 'yes'
             elif d['B']=='6':
                    if d['E']=='3':
                           if d['A']=='0':
                                  return 'no'
                           elif d['A']=='2':
                                  return 'yes'
                    elif d['E']=='4':
                           return 'yes'
             elif d['B']=='5':
                    if d['A']=='1':
                           return 'no'
                    elif d['A']=='0':
                           return 'yes'
                    elif d['A']=='2':
                           return 'yes'
             elif d['B']=='7':
                    return 'yes'

2、sklearn

下面是通过sklearn对dataprofull.csv生成决策树(目前对sklearn刚刚接触,还达不到灵活运用,只好照搬好心人的代码修修补补了):

import numpy as np
from sklearn import tree
from sklearn.cross_validation import train_test_split 
from sklearn.metrics import classification_report 

def main():
    data = []
    label= []
    with open("dataprofull.csv") as ifile:
        ifile.readline()
        ifile.readline()
        for line in ifile:
            tmp = line.strip('\n').split(',')
            tmp1=[]
            for i in range(len(tmp)):
                if i==len(tmp)-1:
                    if tmp[i]=='yes':
                        tmp[i]=1
                    else:
                        tmp[i]=0
                tmp1.append(float(tmp[i]))
            data.append(tmp1)
            label.append(float(tmp1[-1]))
    
    x = np.array(data)
    y = np.array(label)

    x_train, x_test, y_train, y_test = train_test_split(x, y,test_size=0.2)

    clf = tree.DecisionTreeClassifier(criterion='entropy',max_depth=6)  
    clf.fit(x_train, y_train)  

    with open("tree.dot", 'w') as f:
        dot_data = tree.export_graphviz(clf,out_file=f,filled=True)  

if __name__ == '__main__':
    main()

sklearn生成图形结果如下:

目前来说,我是无法理解这棵树的。有待学习后进一步改进。

 

 

 

 

 

 

 

 

 

 

  • 2
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值