比较适合分析离散数据.
如果是连续数据要先转成离散数据再做分析
熵(entropy)概念
一条信息的信息量大小和它的不确定性有直接的关系,要搞清楚一件非常非常不确定的事情,或者是我们一无所知的事情,需要了解大量信息 ->信息量的度量就等于不确定性的多少
:如果熵越大就代表这不确定性越大,熵越小就代表不确定性越小
信息熵计算
信息熵公式:
假如有一个普通色子A,仍出1-6的概率都是1/6
有一个色子B,扔出6的概率是50%,扔出1-5的概率都是10%
有一个色子C,扔出的6的概率是100%
p(x) 为概率,还要乘一个多少次
ID3算法
决策树会选择最大化信息增益来对结点进行划分.信息增益计算:
#信息增益,选择最大的
例子
C4.5算法
信息增益的方法倾向与首先选择因子数较多的变量
信息增益的改进:增益率
决策树预测
示例数据:
RID,age,income,student,credit_rati,class_buys_computer
1,youth,high,no,fair,no
2,youth,high,no,excellent,no
3,middle_age,high,no,fair,yes
4,senior,medium,no,fair,yes
5,senior,low,yes,fair,yes
6,senior,low,yes,excellent,no
7,middle_age,low,yes,excellent,yes
8,youth,medium,no,fair,no
9,youth,low,yes,fair,yes
10,senior,medium,yes,fair,yes
11,youth,medium,yes,fair,yes
12,middle_age,medium,no,excellent,yes
13,middle_age,high,yes,fair,yes
14,senior,medium,no,excellent,no
示例代码demo:
from sklearn.feature_extraction import DictVectorizer from sklearn import tree from sklearn import preprocessing import csv
# 读取数据 # Dtree = open(r'datapre.csv','r') reader = csv.reader(open(r'datapre.csv','r')) # 获取第一行数据 headers = reader.__next__() # print(header) # 定义两个列表 featureList = [] #训练数据列表 labelList = []#标签,也就是结果 for row in reader: # print(row) # 把label存入list labelList.append(row[-1]) rowDict = {} for i in range(1,len(row)-1): # 建立一个数据字典 rowDict[headers[i]] = row[i] # 吧数据字典存入list featureList.append(rowDict) print(labelList) # ['no', 'no', 'yes', 'yes', 'yes', 'no', 'yes', 'no', 'yes', 'yes', 'yes', 'yes', 'yes', 'no'] print(featureList) # [{'age': 'youth', 'income': 'high', 'student': 'no', 'credit_rati': 'fair'}, # {'age': 'youth', 'income': 'high', 'student': 'no', 'credit_rati': 'excellent'}, # {'age': 'middle_age', 'income': 'high', 'student': 'no', 'credit_rati': 'fair'}, # {'age': 'senior', 'income': 'medium', 'student': 'no', 'credit_rati': 'fair'}, # {'age': 'senior', 'income': 'low', 'student': 'yes', 'credit_rati': 'fair'}, # {'age': 'senior', 'income': 'low', 'student': 'yes', 'credit_rati': 'excellent'}, # {'age': 'middle_age', 'income': 'low', 'student': 'yes', 'credit_rati': 'excellent'}, # {'age': 'youth', 'income': 'medium', 'student': 'no', 'credit_rati': 'fair'}, # {'age': 'youth', 'income': 'low', 'student': 'yes', 'credit_rati': 'fair'}, # {'age': 'senior', 'income': 'medium', 'student': 'yes', 'credit_rati': 'fair'}, # {'age': 'youth', 'income': 'medium', 'student': 'yes', 'credit_rati': 'fair'}, # {'age': 'middle_age', 'income': 'medium', 'student': 'no', 'credit_rati': 'excellent'}, # {'age': 'middle_age', 'income': 'high', 'student': 'yes', 'credit_rati': 'fair'}, # {'age': 'senior', 'income': 'medium', 'student': 'no', 'credit_rati': 'excellent'}]
# 吧数据转换成01表示 vec = DictVectorizer()# 这个函数是处理字符数据成数字的 x_data = vec.fit_transform(featureList).toarray() print('x_data:' + str(x_data)) # x_data:[[0. 0. 1. 0. 1. 1. 0. 0. 1. 0.] # [0. 0. 1. 1. 0. 1. 0. 0. 1. 0.] # [1. 0. 0. 0. 1. 1. 0. 0. 1. 0.] # [0. 1. 0. 0. 1. 0. 0. 1. 1. 0.] # [0. 1. 0. 0. 1. 0. 1. 0. 0. 1.] # [0. 1. 0. 1. 0. 0. 1. 0. 0. 1.] # [1. 0. 0. 1. 0. 0. 1. 0. 0. 1.] # [0. 0. 1. 0. 1. 0. 0. 1. 1. 0.] # [0. 0. 1. 0. 1. 0. 1. 0. 0. 1.] # [0. 1. 0. 0. 1. 0. 0. 1. 0. 1.] # [0. 0. 1. 0. 1. 0. 0. 1. 0. 1.] # [1. 0. 0. 1. 0. 0. 0. 1. 1. 0.] # [1. 0. 0. 0. 1. 1. 0. 0. 0. 1.] # [0. 1. 0. 1. 0. 0. 0. 1. 1. 0.]] ######################################### # 注意,这里的01的转换方式是这样转的: # [0. 0. 1. 0. 1. 1. 0. 0. 1. 0.] 和 # ['age=middle_age', 'age=senior', 'age=youth', 'credit_rati=excellent', 'credit_rati=fair', 'income=high', 'income=low', 'income=medium', 'student=no', 'student=yes'] # 以上十个数据是一一对应的,比如 youth,high,no,fair,no , 是的为1,否的为0 ##################### # 打印属性名称 print(vec.get_feature_names()) # ['age=middle_age', 'age=senior', 'age=youth', 'credit_rati=excellent', 'credit_rati=fair', 'income=high', 'income=low', 'income=medium', 'student=no', 'student=yes'] print('labelList:' + str(labelList)) # labelList:['no', 'no', 'yes', 'yes', 'yes', 'no', 'yes', 'no', 'yes', 'yes', 'yes', 'yes', 'yes', 'no'] # 吧标签转成01表示 lb = preprocessing.LabelBinarizer() y_data = lb.fit_transform(labelList) print('y_data:' + str(y_data)) # y_data:[[0] # [0] # [1] # [1] # [1] # [0] # [1] # [0] # [1] # [1] # [1] # [1] # [1] # [0]]
# 创建决策树模型 , 决策树分类器DecisionTreeClassifier ,entropy代表用熵来计算 model = tree.DecisionTreeClassifier(criterion='entropy') # 输入数据建立模型 model.fit(x_data,y_data)
#测试 x_test = x_data[0] print('x_test:' + str(x_test)) predict = model.predict(x_test.reshape(1,-1)) print('predict:' + str(predict))
# 导出决策树 ,也就是画图 import graphviz # htttp://www.graphviz.org dot_data = tree.export_graphviz(model,out_file = None, feature_names=vec.get_feature_names(), class_names = lb.classes_, filled = True, rounded=True, special_characters=True) graph = graphviz.Source(dot_data) graph.render('computer')