决策树是一个有监督的分类模型。以二分类为例,从复杂的离散型数据中学习一种模式。
这里使用西瓜书的数据集
编号,色泽,根蒂,敲声,纹理,脐部,触感,密度,含糖率,好瓜
1,青绿,蜷缩,浊响,清晰,凹陷,硬滑,0.697,0.46,是
2,乌黑,蜷缩,沉闷,清晰,凹陷,硬滑,0.774,0.376,是
3,乌黑,蜷缩,浊响,清晰,凹陷,硬滑,0.634,0.264,是
4,青绿,蜷缩,沉闷,清晰,凹陷,硬滑,0.608,0.318,是
5,浅白,蜷缩,浊响,清晰,凹陷,硬滑,0.556,0.215,是
6,青绿,稍蜷,浊响,清晰,稍凹,软粘,0.403,0.237,是
7,乌黑,稍蜷,浊响,稍糊,稍凹,软粘,0.481,0.149,是
8,乌黑,稍蜷,浊响,清晰,稍凹,硬滑,0.437,0.211,是
9,乌黑,稍蜷,沉闷,稍糊,稍凹,硬滑,0.666,0.091,否
10,青绿,硬挺,清脆,清晰,平坦,软粘,0.243,0.267,否
11,浅白,硬挺,清脆,模糊,平坦,硬滑,0.245,0.057,否
12,浅白,蜷缩,浊响,模糊,平坦,软粘,0.343,0.099,否
13,青绿,稍蜷,浊响,稍糊,凹陷,硬滑,0.639,0.161,否
14,浅白,稍蜷,沉闷,稍糊,凹陷,硬滑,0.657,0.198,否
15,乌黑,稍蜷,浊响,清晰,稍凹,软粘,0.36,0.37,否
16,浅白,蜷缩,浊响,模糊,平坦,硬滑,0.593,0.042,否
17,青绿,蜷缩,沉闷,稍糊,稍凹,硬滑,0.719,0.103,否
信息熵:
数据集D的信息熵为:
以数据为例:
对于标签列,是否为好瓜的这一“属性”的信息熵就是:
y的取值只有两个,是(1)和否(0)。那么 P1=817 , P0=917
然而为了计算信息增益,有些特征会有很多取值,同一个特征不各个值都需要计算熵,如敲声:
根蒂取值各为:浊响(D1) ,沉闷(D2),清脆(D3) 三个
D1的信息熵为: 当前数据特征‘敲声’下,取值为‘浊响’,被化为正类的: P1=610 , P0=410
其信息熵为:0.9709
#input: one parameter
#data type: pandas DataFrame; content: [fea1,fea2,...,fea,lable]
#out put: entropy
def calcutaleEnt(self,data_set):
num_samples = len(data_set)
lable_list = set(list(data_set[self.lable]))
count_dic = data_set.groupby(by=self.lable).count().to_dict()
main_key_list = list(count_dic.keys())
cal_dic = {}
for item in lable_list:
count_dic[item] = 0
for main_key in main_key_list:
for lable in lable_list:
cal_dic[lable] = count_dic[main_key][lable]
ent = 0.0
for key in cal_dic.keys():
prob = float(cal_dic[key])/num_samples
ent-=prob*math.log(prob,2)
return ent
树有不同的分支节点,不同分支节点,即同一个特征下的特征取值,所包含的样本数目不同,为了让样本数越多的分支节点权重越大,引入信息增益以刻画每个特征,为了更好的分类。
信息增益
其中, Ent(D) 为原始信息熵,即全数据(标签列);在刚刚的类别下, Dv = 浊响(D1) ,沉闷(D2),清脆(D3)
计算完每个特征的信息增益后,就可以从已有特征中选出最好的特征,即信息增益最大的特征
def inforGain(self,data_set):
base_ent = self.calcutaleEnt(data_set)
best_feature = ''
base_inforgain = 0.0
feature_list = list(data_set.columns)
feature_list.remove(self.lable)
for feature in feature_list:
this_feature_samples = self.getFeatureProprity(data_set[feature])
fea_val_list = set(list(data_set[feature]))
feature_ent = 0.0
for fea_val in fea_val_list:
extract_data = self.splitData(data_set,feature,fea_val)
cell_fea_sample = self.getFeatureProprity(extract_data)
cell_feature_ent = (abs(cell_fea_sample)/abs(this_feature_samples))*self.calcutaleEnt(extract_data)
feature_ent += cell_feature_ent
tmp_infor_gain = base_ent-feature_ent
print ("feature=",feature,"inforGain=",tmp_infor_gain)
if tmp_infor_gain > base_inforgain:
base_inforgain = tmp_infor_gain
best_feature = feature
return best_feature
def splitData(self,data_set,feature,value):
columns = list(data_set.columns)
columns.remove(feature)
restData = data_set[data_set[feature]==value]
restData = restData[columns]
return restData
最后建立树,并预测。
def predicter(self,tree,features,testX):
first_key = list(tree.keys())[0]
second_dict = tree[first_key]
for key in second_dict.keys():
if testX[first_key][0] == key:
if type(second_dict[key]).__name__ == 'dict':
class_lable = self.predicter(second_dict[key],features,testX)
else:
class_lable = second_dict[key]
return class_lable