大神代码:https://blog.csdn.net/Snoopy_Yuan/article/details/69223240
昨天画不出树有点烦躁,随便找了百度了一点点,还是画不出来。
今天这道题,其实就是把信息增益换成基尼指数,本质上的构造树逻辑是一致的。
不过源代码有个小错误,在上面链接里已经评论了,好奇宝宝可以自己去看
不过,奇葩的是前后剪枝算出来的准确率一毛一样,估计程序里还有问题,以后再扣吧。。。
主程序gini_decision_tree.py
#https://blog.csdn.net/Snoopy_Yuan/article/details/69223240
import pandas as pd
#data_file_encode="gb18030" #gb18030支持汉字和少数民族字符,是一二四字节变长编码。这么用的时候with open需要增加encoding参数,但会报错gb18030不能解码
# with open相当于打开文件,保存成str对象,如果出错则关闭文件。参数r表示只读
with open("/Users/huatong/PycharmProjects/Data/watermelon_33.csv",mode="r") as data_file:
df=pd.read_csv(data_file)
import decision_tree
# 取出训练集,iloc是根据数字索引取出对应行的信息,drop是删除这些行之后剩余的表格
index_train = [0, 1, 2, 5, 6, 9, 13, 14, 15, 16] #和书上80页的训练样本相同
df_train = df.iloc[index_train]
df_test = df.drop(index_train)
# generate a full tree
root = decision_tree.TreeGenerate(df_train)
#decision_tree.DrawPNG(root, "decision_tree_full.png") 画不出来 先注释掉
print("accuracy of full tree: %.3f" % decision_tree.PredictAccuracy(root, df_test))
# 预剪枝
root = decision_tree.PrePurn(df_train, df_test)
#decision_tree.DrawPNG(root, "decision_tree_pre.png")
print("accuracy of pre-purning tree: %.3f" % decision_tree.PredictAccuracy(root, df_test))
# 后剪枝,先生成树,再从底部节点开始分析
root = decision_tree.TreeGenerate(df_train)
decision_tree.PostPurn(root, df_test)
#decision_tree.DrawPNG(root, "decision_tree_post.png")
print("accuracy of post-purning tree: %.3f" % decision_tree.PredictAccuracy(root, df_test))
# 5折交叉分析
accuracy_scores = []
n = len(df.index)
k = 5
for i in range(k):
m = int(n / k)
test = []
for j in range(i * m, i * m + m):
test.append(j)
df_train = df.drop(test)
df_test = df.iloc[test]
root = decision_tree.TreeGenerate(df_train) # generate the tree
decision_tree.PostPurn(root, df_test) # post-purning
# test the accuracy
pred_true = 0
for i in df_test.index:
label = decision_tree.Predict(root, df[df.index == i])
if label == df_test[df_test.columns[-1]][i]:
pred_true += 1
accuracy = pred_true / len(df_test.index)
accuracy_scores.append(accuracy)
# print the prediction accuracy result
accuracy_sum = 0
print("accuracy: ", end="")
for i in range(k):
print("%.3f " % accuracy_scores[i], end="")
accuracy_sum += accuracy_scores[i]
print("\naverage accuracy: %.3f" % (accuracy_sum / k))
decision_tree.py
#被主程序执行treeGenerate时候调用,def用于定义函数
#节点类,包含①当前节点的属性,例如纹理清晰? ②节点所属分类,只对叶子节点有效 ③向下划分的属性取值例如色泽乌黑青绿浅白
class Node(object): #新式类
def __i