python math库决策树CART算法实际数据分析

最新推荐文章于 2020-12-17 22:31:54 发布

weixin_34290096

最新推荐文章于 2020-12-17 22:31:54 发布

阅读量150

点赞数

文章标签：数据结构与算法 python 人工智能

原文链接：https://my.oschina.net/wangzonghui/blog/1618708

版权

2019独角兽企业重金招聘Python工程师标准>>>

每周一搏，提升自我。

本博文总结math库算法，首先将上一篇博客CATH算法跑通。

地址：https://my.oschina.net/wangzonghui/blog/1618690

第一步：数据分析

准备的数据是逗号分割的txt文本文件，每行最后一列为分类。

首先考虑数据，那些为有效数据，无用的数据，提前数据清洗。

第二步：数据读入

读入数据时，对python不是很了解，没找到好的类库处理数据，按照java的编程习惯，自己写了一个。

def createSet(trainDataFile):
	dataSet=[]
	labels=[]
	try:
		fin=open(trainDataFile,'r')
		num =0;
		for line in fin:
			if num==0:
				line=line.strip('\n') 
				labels=line.split(',')
			else:
				line=line.strip('\n')
				cols=line.split(',')
				# print len(cols)
				# #int(cols[0]),int(cols[1]),int(cols[2]),long(cols[3]),
				row =[int(cols[4]),float(cols[5]),float(cols[6]),int(cols[7]),int(cols[8]),float(cols[9]),
					int(cols[10]),float(cols[11]),int(cols[12]),float(cols[13]),int(cols[14]),int(cols[15]),int(cols[16]),int(cols[17]),float(cols[18]),
					int(cols[19]),float(cols[20]),int(cols[21]),int(cols[22]),int(cols[23]),float(cols[24]),int(cols[25]),int(cols[26]),int(cols[27]),float(cols[28]),
					int(cols[29]),int(cols[30]),int(cols[31]),int(cols[32]),float(cols[33]),int(cols[34]),float(cols[35]),float(cols[36]),float(cols[37]),float(cols[38]),
					int(cols[39]),int(cols[40])]
				dataSet.append(row)
			num+=1
	except Exception as e:
		print 'Usage xxx.py trainDataFilePath'
		print e
		
	#删除无效指标
	del labels[0];del labels[1];del labels[2];del labels[3]
	return dataSet,labels

row对象后的行对象数据，根据实际转换相应数据类型，方便后面处理，我的文本第一行为列的标签名。

说明一下，标签名字段数比实际数据字段数少最后一列——分类。

第三步：创建决策树
先生成数据，再调用cart算法，生成决策树。

#生成模板数据
dataSet,labels=createSet(url)
#复制标签 测试决策树时，原有标签对象不可用
labels_tmp=labels[:]
#用决策树
desicionTree = createTree(dataSet,labels_tmp)

第四步：存储决策树

将生成的决策树保存下来，下次使用不需要从模板数据生成，服务部署时，直接加载决策树使用。

#保存决策树
def storeTree(inputTree,filename):
	import pickle
	fw=open(filename,'wb')
	pickle.dump(inputTree,fw)
	fw.close()

第五步：调用存储的决策树

#加载决策树
def grabTree(filename):
	import pickle
	fr=open(filename,'rb')
	return pickle.load(fr)

第六步：测试数据验证

用测试数据验证决策树

def createTestSet():
	testSet=[[0,0.0,0.0,13,3,15,3,88.89,15,100.0,0,0,0,0,0.0,0,0.0,0,6,0,0.0,0,0,0,0.0,2215,5818,0,0,0.0,0,4.4,4.4,4.4,0.01,14,0]]
	return testSet

所有调用代码如下：

def main():
	url="F:\\input\\eciMath.txt";
	treeUrl="F:\\input\\tree\\tree-math.txt";
	# dataSet,labels=test(url)

	dataSet,labels=createSet(url)

	# print "开始创建决策树"
	labels_tmp=labels[:] #copy 标签
	desicionTree = createTree(dataSet,labels_tmp)
	#保存决策树
	storeTree(desicionTree,treeUrl)
	print desicionTree
	print "创建决策树结束"
	import showTree as show
	show.createPlot(desicionTree)
	print 'desicionTree:',desicionTree

	desicionTree=grabTree(treeUrl)
	testSet=createTestSet()
	print 'classifyResult:',calssifyAll(desicionTree,labels,testSet)


if __name__ == '__main__':
	main()

感觉math运行较慢，测试数据1w行，8G运存电脑几乎卡死，感觉效率很慢，用于研究可以，实用性不强。

转载于:https://my.oschina.net/wangzonghui/blog/1618708