看了《统计学习方法》就尝试写了个简单的决策树,使用信息增益(ID3)或者信息增益率(C4.5),但是没写好剪枝,自己写的剪枝一剪就只剩根节点和一个叶子节点了,目前只有训练和预测的功能,容易过拟合。
用的隐形眼镜数据集,把数据集读入np.array里,就可以进行训练了。
young myope no reduced nolenses
young myope no normal soft
young myope yes reduced nolenses
young myope yes normal hard
young hyper no reduced nolenses
young hyper no normal soft
young hyper yes reduced nolenses
young hyper yes normal hard
pre myope no reduced nolenses
pre myope no normal soft
pre myope yes reduced nolenses
pre myope yes normal hard
pre hyper no reduced nolenses
pre hyper no normal soft
pre hyper yes reduced nolenses
pre hyper yes normal nolenses
presbyopic myope no reduced nolenses
presbyopic myope no normal nolenses
presbyopic myope yes reduced nolenses
presbyopic myope yes normal hard
presbyopic hyper no reduced nolenses
presbyopic hyper no normal soft
presbyopic hyper yes reduced nolenses
presbyopic hyper yes normal nolenses
首先写计算熵、条件熵、信息增益(互信息)的相关函数
熵
def getEnt(x):
#x是随机变量
from math import log
try:
l = x.tolist()
except:
l = x
total = len(x)
ent = 0
for i in set(l):
p = l.count(i)*1./total
ent += -p*log(p)
return ent
条件熵
def getConEnt(x,y):
#x是特征,y是类别,x条件下y的条件熵
from math import log
try:
lx = x.tolist()
ly = y.tolist()
except:
lx = x
ly = y
l = zip(lx,ly)
total = len(l)
ent = 0
for i in set(lx):
p = lx.count(i)*1./total
ey = []
for k,v in l:
if k == i:ey.append(v)
ent += p*getEnt(ey)
return ent
信息增益
def getMutInfo(x,y):
#x是特征,y是类别,x对y的信息增益
return getEnt(y) - getConEnt(x,y)
信息增益率
def getEntGainRatio(x,y):
gda = getMutInfo(x,y)
had = getEnt(x)
return gda*1./had
def trainDecisionTree(features,classes):
tree = {}
#获取熵最大的特征
f_dim = features.shape[1]
for n,e in zip(range(0,f_dim),[getMutInfo(features.T[i],classes) for i in range(0,f_dim)]):
emax = e
maxindex = n
if e > emax:
maxindex = n
tree.setdefault(maxindex,{})
#获取特征下的可能取值
di = {}
cls_count = {}
for i in set(classes):
cls_count.setdefault(i,0)
for k,v in zip(features.T[maxindex],classes):
di.setdefault(k,cls_count.copy())
di[k][v] += 1
#特征的各取值做节点还是继续构造
#np.delete(features[features.T[3] == 1.5],3,axis=1)
for i in di.keys():
flag = 0
cls = None
for c in di[i].keys():
if di[i][c] == sum(di[i].values()):
flag += 1
cls = c
else:continue
if flag == 1:
#子集之中只有一类的情况
tree[maxindex].setdefault(i,cls)
else:
subset = np.delete(features[features.T[maxindex] == i],maxindex,axis=1)
subcls = classes[features.T[maxindex] == i]
#继续分
tree[maxindex].setdefault(i,trainDecisionTree(subset,subcls))
return tree
训练好的决策树,3、2、1这些数字是列号,因为传入的数据集没有特征标识,所以这里用列号表示
预测和评价
预测一条数据
def predictOnce(data,tree):
for branch in tree.keys():#对树tree下所有的分支
node = tree[branch]
#检测分支的类型是内部节点还是叶节点
if isinstance(node,dict):#如果是内部节点
if data[branch] in node.keys():
child = node[data[branch]]
if isinstance(child,dict):
return predictOnce(data,child)
else:
return child
预测一个数据集
def predictSet(test_data,mytree):
result = []
for i in test_data:
result.append(predictOnce(i,mytree))
return np.array(result)
评价函数,准确度:
def validation(test,real):
if len(test) != len(real):
print "Length error!"
return
good = 0.
bad = 0.
for i in range(len(test)):
if test[i] == real[i]:good+=1
else:bad+=1
return good/len(test)
#预测
result = predictSet(test_data,mytree)
#评价,过拟合了。。
validation(result,classes)