《(1997)Machine Learning [CMU+T.M. Mitchell] 》读书笔记 - 第三章

本章节主要介绍一类(非常经典)的ML分类算法(即决策树学习DECISION TREE LEARNING)。尽管基于决策树学习的分类算法存在着多个不同的版本(不同的版本之间有着不一样的执行细节,且新的版本也可能会被提出),但庆幸的是:不同算法版本之间*层思维逻辑和背后优化实质*基本上是相通的。正因如此,本章节才能更加关注于决策树学习的底层思维逻辑和背后优化实质,而非<过于具体>的执行细节(虽然执行细节也很重要)。


A. 什么是“Decision Tree Learning”?

(1). "Decision tree learning is one of the most widely used and practical methods for inductive inference.

It is a method for approximating discrete-valued functions, in which the learned function is represented by a decision tree, that is robust to noisy data and capable of learning disjunctive expressions."

(2). "These decision tree learning methods search a completely expressive hypothesis space and thus avoid the difficulties of restricted hypothesis spaces. Their inductive bias is a preference for small trees over large trees."

(3). "Learned trees can also be re-presented as sets of if-then rules to improve human readability. In general, decision trees represent a disjunction of conjunctions of constraints on the attribute values of instances. Each path from the tree root to a leaf corresponds to a conjunction of attribute tests, and the tree itself to a disjunction of these conjunctions."

(4). "Most algorithms that have been developed for learning decision trees are variations ona core algorithm that employs a top-down, greedy search through the space of possible decision trees (such as theID3 algorithm [Quinlan 1986] and its successor C4.5 [Quinlan 1993])."


B.“Decision Tree Learning”适合解决哪些类型的实际问题? 

“Decision tree learning is generally best suited to problems with the following characteristics:

(1). Instances are represented by attribute-value pairs. The easiest situation fro decision tree learning is when each attribute takes on a small number of disjoint possible values.

(2). The target function has discrete output values.

(3). Disjunction descriptions may be required.

(4). The training data may contain errors.

(5). The training data may contain missing attribute values.”


C."Which Attribute Is the Best Classifier?"

(1). "The central choice in the ID3 algorithm is selecting which attribute to test at each node in the tree. What is a good quantitative measure of the worth of an attribute? We will define a statistical property, called information gain, that measures how well a given attribute separates the training examples according to their target classification. ID3 uses this information gain measure to select among the candidate attributes at each step while growing the tree."

(2). "In order to define information gain precisely, we begin by defining a measure commonly used in information theory, called entropy, that characterizes the (im)purity of an arbitrary collection of examples."

(3). "One interpretation of entropy from information theory is that it specifies the minimum number of bits of information needed to encode the classification of an arbitrary member, which drawn at random with uniform probability."

熵(entropy)是信息理论的基本概念之一。决策树分类算法借用了这一经典概念计算信息增益(information gain)。所幸的是,只需花费些许时间就可以掌握熵和信息增益的计算公式和计算流程。例如,对应书中计算Gain(S, Temperature)的R代码为:

>> e_sum = - ((5 / 14) * log(5 / 14, 2) + (9 / 14) * log(9 / 14, 2))

>> e_sep = - (4 / 14) * ((1 / 2) * log(1 / 2, 2)+ (1 / 2) * log(1 / 2, 2)) 

                   - (6/ 14) * ((4 / 6) * log(4 / 6, 2) + (2 / 6) * log(2 / 6, 2))

                   - (4 / 14) * ((3 / 4) * log(3 / 4, 2) + (1 / 4) * log(1 / 4, 2))

>> e_sum -e_sep


(1). "ID3 performs a simple-to-complex, hill-climbing search through this hypothesis space, beginning with the empty tree, then considering progressively more elaborate hypotheses in search of a decision tree that correctly classifies the training data. The evaluation function that guides this hill-climbing search is the information gain measure."

(2). "ID3 's hypothesis space of all decision trees is a complete space of finite discrete-valued functions, relative to the available attributes."

(3). "ID3 in its pure form performs no backtracking in its search. Therefore, it is susceptible to the usual risks of hill-climbing search without backtracking: converging to locally optimal solutions that are not globally optimal."

(4). "ID3 uses all training examples at each step in the search to make statistically based decisions regarding how to refine its current hypothesis. One advantage of using statistical properties of all the examples (e.g., information gain) is that the resulting search is much less sensitive to errors in individual training examples."


E.决策树算法ID3的“Inductive Bias”

(1). "A closer approximation to the inductive bias of ID3: shorter trees are preferred over longer trees. Trees that place high information gain attributes close to the root are preferred over those that do not."

ID3对更短树结构的偏好正好反映了"Occam's razor"原则(但是:为什么会有此偏好呢?)。

F.“Restriction Bias vs. Preference Bias”

(1). "A preference bias (or a search bias) is a preference for certain hypotheses over others (e.g., for shorter hypotheses), with no hard restriction on the hypotheses that can be eventually enumerated. A restriction bias (or a language bias) is in the form of a categorical restriction on the set of hypotheses considered."

(2). "Typically, a preference bias is more desirable than a restriction bias, because it allows the learner to work within a complete hypothesis space that is assured to contain the unknown target function."

Restriction-Preference Bias与常用的ML概念Bias-Variance Trade-off有许多相似之处,但前者更多的是从Search的角度进行解说(Restriction对应search an incomplete hypothesis space + search this space completely,而Preference对应search a complete hypothesis space + search incompletely through this space)。


(1). "Occam's razor: Prefer the simplest hypothesis that fits the data."

(2). "The size of a hypothesis is determined by the particular representation used internally by the learner."

(3). "The question of which internal representations might arise from a process of evolution and natural selection."

(4). "Evolution will create internal representations that make the learning algorithm's inductive bias a self-fulfilling prophecy, simply because it can alter the representation easier than it can alter the learning algorithm."

(5). "Minimum Description Length principle, a version of Occam's razor that can be interpreted within a Bayesian framework."



(1). "avoiding over-fitting the data"

i. "Given a hypothesis space H, a hypothesis h is said to overfit the training data if there exists some alternative hypothesis h', such that h has smaller error than h' over the training examples, but h' has a smaller error than h over the entire distribution of instances."

ii. "There are several approaches to avoiding over-fitting in decision tree learning. These can be grouped into two classes: A). approaches that stop growing the tree earlier, before it reaches the point where it perfectly classifies the training data, B). approaches that allow the tree to overfit the data, and then post-prune the tree. Although the first of these approaches might seem more direct, the second approach of post-pruning overfit trees has been found to be more successful in practice. This is due to the difficult in the first approach of estimating precisely when to stop growing the tree."

iii. "Regardless of whether the correct tree size is found by stopping early or by post-pruning, a key question is what criterion is to be used to determine the correct final tree size. The most common is the training and validation set approach. The motivation is this: Even though the learner may be misled by random errors and coincidental regularities within the training set, the validation set is unlikely to exhibit the same random fluctuations. Therefore, the validation set can be expected to provide a safety check against over-fitting the spurious characteristics of the training set. Of course, it is important that the validation set be large enough to itself provide a statistically significant samples of the instances. One common heuristic is to withhold one-third of the available examples for the validation set, using the other two-thirds for training."

iv. "Reduced-error pruning" + "rule post-pruning" ("Although this heuristic method is not statistically valid, it has nevertheless been found useful in practice.")

(2). "Incorporating continuous-valued attributes"

i. "For an attribute A that is continuous-valued, the algorithm can dynamically create a new boolean attribute Ac that is true if A < c and false otherwise. The only question is how to select the best value for the threshold c."

ii. "By sorting the examples according to the continuous attribute A, then identifying adjacent examples that differ in their target classification, we can generate a set of candidate thresholds midway between the corresponding values of A. It can be shown that the value of c that maximizes information gain must always lie at such a boundary (Fayyad 1991)."

(3). "Alternative Measures for Selecting Attributes"

i. "There is a natural bias in the information gain measure that favors attributes with many values over those with few values."

ii. "One way to avoid this difficulty is to select decision attributes based on some measure other than information gain. One alternative measure that has been used successfully is the gain ratio (Quinlan 1986)."

(4). "Handling training examples with missing attribute values"

i. "It is common to estimate the missing attribute value based on other examples for which this attribute has a known value."

ii. "One strategy for dealing with the missing attribute value is to assign it the value that the most common among training example at node n. Alternatively, we might assign it the most common value among examples at node n that have the classification c(x)."

iii. "A second, more complex procedure is to assign a probability to each of the possible values of A rather than simply assigning the most common value to A(x). "

Among the earliest work on decision tree learning is Hunt's Concept Learning System (CLS) [Hunt et al. 1966]  and Friedman and Breiman's work resulting in theCART ([Friedman 1977; Breiman et al 1984]. For further details on decision tree induction, an excellent book by Quinlan (1993) discusses many practical issues and provides executable code for C4.5.


