翻译自dataquest网站,由于网站是付费的,这部分内容我觉得不错,翻译出来供大家参考和自己复习,另外部分知识来自周志华的《机器学习》
决策树是一类常见的机器学习方法,以二分为例,讨论“我是否要与这只熊搏斗”问题:
当我们有“当面遇到熊后存活者”的数据时,为了最优化遇到熊的存活率,我们可以使用决策树决定我们的行动。
决策树是一个监督学习算法——我们首先采用历史数据创建决策树,然后用它来预测一个输出。决策树的主要好处是它可以发现数据间非线性的相关性,相比线性回归而言。在与熊搏斗的例子中,一个决策树可以得到当熊是巨熊时,逃跑不可能存活;然而一个线性回归必须权衡两个因素,虽然其中一个在实际中并不存在。
首先读取一个csv文件,关于工资是否高工资的问题。文件可以从http://archive.ics.uci.edu/ml/datasets/Adult下载到。
import pandas
# Set index_col to False to avoid pandas thinking that the first column is row indexes (it's age) income = pandas.read_csv("income.csv", index_col=False)
print(income.head(5)) |
然后对一些string类型的数据进行分类
# Convert a single column from text categories to numbers col = pandas.Categorical.from_array(income["workclass"]) income["workclass"] = col.codes print(income["workclass"].head(5)) col = pandas.Categorical.from_array(income["education"]) income["education"] = col.codes col = pandas.Categorical.from_array(income["marital_status"]) income["marital_status"] = col.codes for i in ["occupation","relationship","race","sex","native_country","high_income"]: col = pandas.Categorical.from_array(income[i]) income[i] = col.codes
print (income.head(5)) |
这里必须提到,Categorical.from_array是非常有用的代码,col.codes可以将原先是string类型的一列数据以数值分类好。
col = pandas.Categorical.from_array(income["workclass"]) income["workclass"] = col.codes |
一个决策树由一系列节点和分支组成。如:
如上所示,节点将数据分成两个分支,N和Y,基于这些人是否工作在私营部门(根据workclass列)。我们已经将“私营部门”对应到code4。所以N对应workclass != 4,Y对应workclass == 4。
再深一层可见:
以此类推。
private_incomes = income[income["workclass"] == 4] public_incomes = income[income["workclass"] != 4] |
那些位于底层的节点(不再分支)成为终节点或叶子。当我们决定分支时,并不是随机的,而是有一个目标,就是确保我们可以将结果用于预测,为了达到这个目标,所有叶子在预测目标列里必须只有一个值。
例子中,我们使用了high_income列作为目标列。当high_income的值超过50k每年时= 1,反之为0。
所有我们会继续到所有节点只有一个high_income结果为止。
此处我们学习如何进行分支。首先,进行一个预分支,常见的分支指标是熵,主要用来判断节点的“纯度”。
熵的定义如下:
有一点信息论基础的都应该知道,熵值越高,混乱度越高,纯度越低。
import math # We'll do the same calculation we did above, but in Python # Passing in 2 as the second parameter to math.log will take a base 2 log entropy = -(2/5 * math.log(2/5, 2) + 3/5 * math.log(3/5, 2)) print(entropy)
low_income_len = len(income[income["high_income"] == 0])/len(income) low_income_len1 = income[income["high_income"] == 0].shape[0]/income.shape[0] print (low_income_len,low_income_len1) high_income_len = 1 - low_income_len
income_entropy = -(high_income_len*math.log(high_income_len,2) + low_income_len*math.log(low_income_len,2)) |
此处定义一个新的信息增益:
如何简单理解呢,首先我们计算了IG针对的是特定目标T,此处对应的是high_income,同时,我们给定了一个节点分支A。计算T的熵值,然后计算A的分支v1针对特定目标T的熵值,再计算v2针对特定目标T的熵值,以此类推,乘以每个分支的权重并相加,得到分支A后的熵值,再与T的原先熵值进行减法,可以得到信息增益。
import numpy
def calc_entropy(column): """ Calculate entropy given a pandas series, list, or numpy array. """ # Compute the counts of each unique value in the column counts = numpy.bincount(column) # Divide by the total column length to get a probability probabilities = counts / len(column)
# Initialize the entropy to 0 entropy = 0 # Loop through the probabilities, and add each one to the total entropy for prob in probabilities: if prob > 0: entropy += prob * math.log(prob, 2)
return -entropy
# Verify that our function matches our answer from earlier entropy = calc_entropy([1,1,0,0,1]) print(entropy)
information_gain = entropy - ((.8 * calc_entropy([1,1,0,0])) + (.2 * calc_entropy([1]))) print(information_gain)
median_age = income["age"].median()
left_split = income[income["age"] <= median_age] right_split = income[income["age"] > median_age]
age_information_gain = income_entropy - ((left_split.shape[0] / income.shape[0]) * calc_entropy(left_split["high_income"]) + ((right_split.shape[0] / income.shape[0]) * calc_entropy(right_split["high_income"]))) |
此时我们需要决定哪些变量可以在分支后获得最优的信息增益(即分支后纯度最高,熵值最低)。
def calc_information_gain(data, split_name, target_name): """ Calculate information gain given a data set, column to split on, and target """ # Calculate the original entropy original_entropy = calc_entropy(data[target_name])
# Find the median of the column we're splitting column = data[split_name] median = column.median()
# Make two subsets of the data, based on the median left_split = data[column <= median] right_split = data[column > median]
# Loop through the splits and calculate the subset entropies to_subtract = 0 for subset in [left_split, right_split]: prob = (subset.shape[0] / data.shape[0]) to_subtract += prob * calc_entropy(subset[target_name])
# Return information gain return original_entropy - to_subtract
# Verify that our answer is the same as on the last screen print(calc_information_gain(income, "age", "high_income"))
columns = ["age", "workclass", "education_num", "marital_status", "occupation", "relationship", "race", "sex", "hours_per_week", "native_country"]
information_gains = []
for col in columns: gain = calc_information_gain(income,col,"high_income") information_gains.append(gain) #print (information_gains,max(information_gains),information_gains == max(information_gains),) highest_gain = columns[information_gains.index(max(information_gains))] print(highest_gain) |
对单个连续数据进行分支,需要多次进行分支动作。
ID3算法是创建决策树的常见算法,包括递归和时间复杂度等认知在内。通俗的说,递归指的是将大问题分解成许多小步骤,递归函数会调用自己并组合出最终的结果。
创建一个决策树是递归的良好使用例子,在每个节点我们调用一个递归函数,节点分化出两个分支,每个分支产生一个新的节点,新的节点又可以调用递归函数去逐步创建出完整的树来。
伪代码如下:
def id3(data, target, columns) 1 Create a node for the tree 2 If all values of the target attribute are 1, Return the node, with label = 1 3 If all values of the target attribute are 0, Return the node, with label = 0 4 Using information gain, find A, the column that splits the data best 5 Find the median value in column A 6 Split column A into values below or equal to the median (0), and values above the median (1) 7 For each possible value (0 or 1), vi, of A, 8 Add a new tree branch below Root that corresponds to rows of data where A = vi 9 Let Examples(vi) be the subset of examples that have the value vi for A 10 Below this new branch add the subtree id3(data[A==vi], target, columns) 11 Return Root |
此处举个例子:假如我们需要使用年龄和婚姻状况去预测high_income。
high_income age marital_status 0 20 0 0 60 2 0 40 1 1 25 1 1 35 2 1 55 1 |
从伪代码走起,跳过第2和3行,第4行通过计算使用年龄作为分支,第5行,计算中点是37.5,第6行,少于37.5的记为0,反之记为1。进入第7行的循环,在第10行出进入id3()函数。在进入前我们称为节点1。
再进入id3,称为节点2
high_income age marital_status 0 20 0 1 25 1 1 35 2 |
high_income age marital_status 0 20 0 1 25 1 |
在节点4的位置,我们停在了第2行,再逐渐返回上去,剩余几行以叶子的形式挂上去。
寻找最优分支
def find_best_column(data, target_name, columns): # Fill in the logic here to automatically find the column in columns to split on # data is a dataframe # target_name is the name of the target variable # columns is a list of potential columns to split on gain_feature = [] for col in columns: information_gain = calc_information_gain(data,col,"high_income") gain_feature.append(information_gain) best_gain = gain_feature.index(max(gain_feature)) return columns[best_gain]
# A list of columns to potentially split income with columns = ["age", "workclass", "education_num", "marital_status", "occupation", "relationship", "race", "sex", "hours_per_week", "native_country"]
income_split = find_best_column(income,"high_income",columns) |
现在我们可以存储整个树,而不是叶标签。 我们将使用嵌套字典来做到这一点。 我们可以用一个字典来表示根节点,并用左右键来分支。 我们将把我们分解的列存储为关键列,并将中间值存储为关键中值。 最后,我们可以将叶子的标签作为关键标签。 我们还会使用数字键为每个节点编号。
更新伪代码
def id3(data, target, columns, tree) 1 Create a node for the tree 2 Number the node 3 If all of the values of the target attribute are 1, assign 1 to the label key in tree 4 If all of the values of the target attribute are 0, assign 0 to the label key in tree 5 Using information gain, find A, the column that splits the data best 6 Find the median value in column A 7 Assign the column and median keys in tree 8 Split A into values less than or equal to the median (0), and values above the median (1) 9 For each possible value (0 or 1), vi, of A, 10 Add a new tree branch below Root that corresponds to rows of data where A = vi 11 Let Examples(vi) be the subset of examples that have the value vi for A 12 Create a new key with the name corresponding to the side of the split (0=left, 1=right). The value of this key should be an empty dictionary. 13 Below this new branch, add the subtree id3(data[A==vi], target, columns, tree[split_side]) 14 Return Root |
红字部分是更改的内容,Python代码
# Create a dictionary to hold the tree # It has to be outside of the function so we can access it later tree = {}
# This list will let us number the nodes # It has to be a list so we can access it inside the function nodes = []
def id3(data, target, columns, tree): unique_targets = pandas.unique(data[target])
# Assign the number key to the node dictionary nodes.append(len(nodes) + 1) tree["number"] = nodes[-1]
if len(unique_targets) == 1: # Insert code here that assigns the "label" field to the node dictionary tree["label"] = unique_targets return
best_column = find_best_column(data, target, columns) column_median = data[best_column].median()
# Insert code here that assigns the "column" and "median" fields to the node dictionary tree["column"] = best_column tree["median"] = column_median left_split = data[data[best_column] <= column_median] right_split = data[data[best_column] > column_median] split_dict = [["left", left_split], ["right", right_split]]
for name, split in split_dict: tree[name] = {} id3(split, target, columns, tree[name])
# Call the function on our data to set the counters properly id3(data, "high_income", ["age", "marital_status"], tree) |
把决策树打印出来
def print_with_depth(string, depth): # Add space before a string prefix = " " * depth # Print a string, and indent it appropriately print("{0}{1}".format(prefix, string))
def print_node(tree, depth): # Check for the presence of "label" in the tree if "label" in tree: # If found, then this is a leaf, so print it and return print_with_depth("Leaf: Label {0}".format(tree["label"]), depth) # This is critical -- without it, you'll get infinite recursion return # Print information about what the node is splitting on print_with_depth("{0} > {1}".format(tree["column"], tree["median"]), depth)
# Create a list of tree branches branches = [tree["left"], tree["right"]]
# Insert code here to recursively call print_node on each branch # Don't forget to increment depth when you pass it in for b in branches: print_node(b,depth+1) print_node(tree, 0) |
结果如下:
使用决策树预测
def predict(tree, row): 1 Check for the presence of "label" in the tree dictionary 2 If found, return tree["label"] 3 Extract tree["column"] and tree["median"] 4 Check whether row[tree["column"]] is less than or equal to tree["median"] 5 If it's less than or equal, call predict(tree["left"], row) and return the result 6 If it's greater, call predict(tree["right"], row) and return the result |
Python代码
def predict(tree, row): if "label" in tree: return tree["label"]
column = tree["column"] median = tree["median"]
# Insert code here to check whether row[column] is less than or equal to median # If it's less than or equal, return the result of predicting on the left branch of the tree # If it's greater, return the result of predicting on the right branch of the tree # Remember to use the return statement to return the result! if row[column] <= median: return predict(tree["left"],row) else: return predict(tree["right"],row)
# Print the prediction for the first row in our data print(predict(tree, data.iloc[0])) |
预测一系列数据
new_data = pandas.DataFrame([ [40,0], [20,2], [80,1], [15,1], [27,2], [38,1] ]) # Assign column names to the data new_data.columns = ["age", "marital_status"]
def batch_predict(tree, df): # Insert your code here return df.apply(lambda x: predict(tree, x), axis=1)
predictions = batch_predict(tree, new_data) |
当然,ID3是一个简单的决策树版本,更为复杂的C4.5采用了增益率(gainratio)来选择划分属性,CART决策树采用了基尼指数来选择最优划分属性。但是基本思想是相同的。
数据集来源http://archive.ics.uci.edu/ml/datasets/Adult,我们可以使用scikit-learn软件包来匹配决策树。 界面与我们以前适用的其他算法非常相似。我们使用DecisionTreeClassifier类作为分类问题,使用DecisionTreeRegressor作为回归问题。sklearn.tree包中包含这两个类。在这种情况下,我们预测了二元结果,所以我们将使用分类器。第一步是对数据进行分类器训练。我们将使用分类器的拟合方法来做到这一点。
from sklearn.tree import DecisionTreeClassifier
# A list of columns to train with # We've already converted all columns to numeric columns = ["age", "workclass", "education_num", "marital_status", "occupation", "relationship", "race", "sex", "hours_per_week", "native_country"]
# Instantiate the classifier # Set random_state to 1 to make sure the results are consistent clf = DecisionTreeClassifier(random_state=1)
# We've already loaded the variable "income," which contains all of the income data clf.fit(income[columns],income["high_income"]) |
从输出可以看出clf的一系列参数
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None, max_features=None, max_leaf_nodes=None, min_impurity_split=1e-07, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, presort=False, random_state=1, splitter='best') |
创建一个80%数据的训练集,和剩余数据的测试集
import numpy import math
# Set a random seed so the shuffle is the same every time numpy.random.seed(1)
# Shuffle the rows # This permutes the index randomly using numpy.random.permutation # Then, it reindexes the dataframe with the result # The net effect is to put the rows into random order income = income.reindex(numpy.random.permutation(income.index))
train_max_row = math.floor(income.shape[0] * .8)
train = income[0:train_max_row] test = income[train_max_row:] |
使用AUC判断是否出现过拟合
from sklearn.metrics import roc_auc_score
clf = DecisionTreeClassifier(random_state=1) clf.fit(train[columns], train["high_income"])
predictions = clf.predict(test[columns])
error = roc_auc_score(test["high_income"],predictions)
print (error) |
有三种主要的方法来解决过度拟合:
1) 剪枝。
2) 使用混合多树的预测。
3) 在构建树时限制树的深度。
其中限制树的深度是一种常见手段
# The first decision trees model we trained and tested clf = DecisionTreeClassifier(random_state=1,max_depth = 7,min_samples_split = 13) clf.fit(train[columns], train["high_income"]) predictions = clf.predict(test[columns]) test_auc = roc_auc_score(test["high_income"], predictions)
train_predictions = clf.predict(train[columns]) train_auc = roc_auc_score(train["high_income"], train_predictions)
print(test_auc) print(train_auc) |