1.关于决策树算法:
The kNN algorithm in chapter 2 did a great job of classifying, but it didn’t lead toany major insights about the data. One of the best things about decision trees is that humans can easily understand the data.也就是说knn并不能解释数据的内在特征,而相反决策树对于人们来说是十分容易理解的(良好的可解释性)。
2.决策树构造的一般过程
在构造决策树时,第一个问题就是当前数据集上哪个特征在划分数据分类时起决定性作用。为了找到决定性的特征,划分出最好的结果,我们必须评估每个特征。
决策树的一般流程如下:
the process of building decision trees from nothing but a pile of data
(1)收集数据
(2)准备数据:树构造算法只适用于标称型数据,因此数值型数据必须离散化。
(3)分析数据:构造树完成之后,我们应该检查图形是否符合预期。
(4)训练算法:构造树的数据结构。
(5)测试算法:使用经验树计算错误率。
(6)使用算法
3.Tree Construction
信息论:We’ll first discuss the mathematics that decide how to split a dataset using something calledinformation theory.
(1)which feature is used to split the data:
To determine this, you try every feature and measure which split will give you thebest results.
如何决定选择哪个特征来分裂数据:选择能够获得最好结果的特征。
then, 最好结果means what?==>We choose to split our dataset in a way that makes our unorganized datamore organized.我们趋向于选择使散漫的数据变得更加有条理的分裂方式。从信息论的角度来说,使熵更小,混乱度更小。
(2)信息增益 Information Gain
def:The change in information before and after the split is known as the information gain.定义为分裂前后的熵的变化。
A.Entropy
defined as the expected value of the information.熵H是信息量的期望值。
计算熵的代码如下所示:
<pre name="code" class="python">'''
@abstract: calculate the shannonEntropy of the input dataSet
@input: dataSet==>type(list of list)
@output: the shannonEntropy of the input dataSet==>type(float)
'''
def calcShannonEnt(dataSet):
numEntries = len(dataSet)
"""&#