决策树原理及代码

最新推荐文章于 2023-08-16 14:55:38 发布

shuxuanhan4926

最新推荐文章于 2023-08-16 14:55:38 发布

阅读量448

点赞数

分类专栏： python

python 专栏收录该内容

3 篇文章 0 订阅

订阅专栏

本文翻译自dataquest网站，介绍了决策树的基本原理，包括熵和信息增益的概念，以及如何通过信息增益选择最优特征进行划分。通过Python代码展示了如何构建决策树，并解释了如何防止过拟合，如限制树的深度。内容还涵盖了ID3算法和在实际问题中使用决策树的步骤，例如使用scikit-learn库进行训练和预测。

摘要由CSDN通过智能技术生成

翻译自dataquest网站，由于网站是付费的，这部分内容我觉得不错，翻译出来供大家参考和自己复习，另外部分知识来自周志华的《机器学习》

决策树是一类常见的机器学习方法，以二分为例，讨论“我是否要与这只熊搏斗”问题：

当我们有“当面遇到熊后存活者”的数据时，为了最优化遇到熊的存活率，我们可以使用决策树决定我们的行动。

决策树是一个监督学习算法——我们首先采用历史数据创建决策树，然后用它来预测一个输出。决策树的主要好处是它可以发现数据间非线性的相关性，相比线性回归而言。在与熊搏斗的例子中，一个决策树可以得到当熊是巨熊时，逃跑不可能存活；然而一个线性回归必须权衡两个因素，虽然其中一个在实际中并不存在。

首先读取一个csv文件，关于工资是否高工资的问题。文件可以从http://archive.ics.uci.edu/ml/datasets/Adult下载到。

import pandas

# Set index_col to False to avoid pandas thinking that the first column is row indexes (it's age)

income = pandas.read_csv("income.csv", index_col=False)

print(income.head(5))

然后对一些string类型的数据进行分类

# Convert a single column from text categories to numbers

col = pandas.Categorical.from_array(income["workclass"])

income["workclass"] = col.codes

print(income["workclass"].head(5))

col = pandas.Categorical.from_array(income["education"])

income["education"] = col.codes

col = pandas.Categorical.from_array(income["marital_status"])

income["marital_status"] = col.codes

for i in ["occupation","relationship","race","sex","native_country","high_income"]:

col = pandas.Categorical.from_array(income[i])

income[i] = col.codes

print (income.head(5))

这里必须提到，Categorical.from_array是非常有用的代码，col.codes可以将原先是string类型的一列数据以数值分类好。

col = pandas.Categorical.from_array(income["workclass"])

income["workclass"] = col.codes

一个决策树由一系列节点和分支组成。如：

如上所示，节点将数据分成两个分支，N和Y，基于这些人是否工作在私营部门（根据workclass列）。我们已经将“私营部门”对应到code4。所以N对应workclass ！= 4，Y对应workclass == 4。

再深一层可见：

以此类推。

private_incomes = income[income["workclass"] == 4]

public_incomes = income[income["workclass"] != 4]

那些位于底层的节点（不再分支）成为终节点或叶子。当我们决定分支时，并不是随机的，而是有一个目标，就是确保我们可以将结果用于预测，为了达到这个目标，所有叶子在预测目标列里必须只有一个值。

例子中，我们使用了high_income列作为目标列。当high_income的值超过50k每年时= 1，反之为0。

所有我们会继续到所有节点只有一个high_income结果为止。

此处我们学习如何进行分支。首先，进行一个预分支，常见的分支指标是熵，主要用来判断节点的“纯度”。

熵的定义如下：

有一点信息论基础的都应该知道，熵值越高，混乱度越高，纯度越低。

import math

# We'll do the same calculation we did above, but in Python

# Passing in 2 as the second parameter to math.log will take a base 2 log

entropy = -(2/5 * math.log(2/5, 2) + 3/5 * math.log(3/5, 2))

print(entropy)

low_income_len = len(income[income["high_income"] == 0])/len(income)

low_income_len1 = income[income["high_income"] == 0].shape[0]/income.shape[0]

print (low_income_len,low_income_len1)

high_income_len = 1 - low_income_len

income_entropy = -(high_income_len*math.log(high_income_len,2) + low_income_len*math.log(low_income_len,2))

此处定义一个新的信息增益：

如何简单理解呢，首先我们计算了IG针对的是特定目标T，此处对应的是high_income，同时，我们给定了一个节点分支A。计算T的熵值，然后计算A的分支v1针对特定目标T的熵值，再计算v2针对特定目标T的熵值，以此类推，乘以每个分支的权重并相加，得到分支A后的熵值，再与T的原先熵值进行减法，可以得到信息增益。

import numpy

def calc_entropy(column):

"""

Calculate entropy given a pandas series, list, or numpy array.

"""

# Compute the counts of each unique value in the column

counts = numpy.bincount(column)

# Divide by the total column length to get a probability

probabilities = counts / len(column)

# Initialize the entropy to 0

entropy = 0

# Loop through the probabilities, and add each one to the total entropy

for prob in probabilities:

if prob > 0:

entropy += prob * math.log(prob, 2)

return -entropy

# Verify that our function matches our answer from earlier

entropy = calc_entropy([1,1,0,0,1])

print(entropy)

information_gain = entropy - ((.8 * calc_entropy([1,1,0,0])) + (.2 * calc_entropy([1])))

print(information_gain)

median_age = income["age"].median()

left_split = income[income["age"] <= median_age]

right_split = income[income["age"] > median_age]

age_information_gain = income_entropy - ((left_split.shape[0] / income.shape[0]) * calc_entropy(left_split["high_income"]) + ((right_split.shape[0] / income.shape[0]) * calc_entropy(right_split["high_income"])))

此时我们需要决定哪些变量可以在分支后获得最优的信息增益（即分支后纯度最高，熵值最低）。

def calc_information_gain(data, split_name, target_name):

"""

Calculate information gain given a data set, column to split on, and target

"""

# Calculate the original entropy

original_entropy = calc_entropy(data[target_name])

# Find the median of the column we're splitting

column = data[split_name]

median = column.median()

# Make two subsets of the data, based on the median

left_split = data[column <= median]

right_split = data[column > median]

# Loop through the splits and calculate the subset entropies

to_subtract = 0

for subset in [left_split, right_split]:

prob = (subset.shape[0] / data.shape[0])

to_subtract += prob * calc_entropy(subset[target_name])

# Return information gain

return original_entropy - to_subtract

# Verify that our answer is the same as on the last screen

print(calc_information_gain(income, "age", "high_income"))

columns = ["age", "workclass", "education_num", "marital_status", "occupation", "relationship", "race", "sex", "hours_per_week", "native_country"]

information_gains = []

for col in columns:

gain = calc_information_gain(income,col,"high_income")

information_gains.append(gain)

#print (information_gains,max(information_gains),information_gains == max(information_gains),)

highest_gain = columns[information_gains.index(max(information_gains))]

print(highest_gain)

对单个连续数据进行分支，需要多次进行分支动作。

ID3算法是创建决策树的常见算法，包括递归和时间复杂度等认知在内。通俗的说，递归指的是将大问题分解成许多小步骤，递归函数会调用自己并组合出最终的结果。

创建一个决策树是递归的良好使用例子，在每个节点我们调用一个递归函数，节点分化出两个分支，每个分支产生一个新的节点，新的节点又可以调用递归函数去逐步创建出完整的树来。

伪代码如下：

def id3(data, target, columns)

1 Create a node for the tree

2 If all values of the target attribute are 1, Return the node, with label = 1

3 If all values of the target attribute are 0, Return the node, with label = 0

4 Using information gain, find A, the column that splits the data best

5 Find the median value in column A

6 Split column A into values below or equal to the median (0), and values above the median (1)

7 For each possible value (0 or 1), vi, of A,

8 Add a new tree branch below Root that corresponds to rows of data where A = vi

9 Let Examples(vi) be the subset of examples that have the value vi for A

10 Below this new branch add the subtree id3(data[A==vi], target, columns)

11 Return Root

此处举个例子：假如我们需要使用年龄和婚姻状况去预测high_income。

high_income age marital_status

0 20 0

0 60 2

0 40 1

1 25 1

1 35 2

1 55 1

从伪代码走起，跳过第2和3行，第4行通过计算使用年龄作为分支，第5行，计算中点是37.5，第6行，少于37.5的记为0，反之记为1。进入第7行的循环，在第10行出进入id3()函数。在进入前我们称为节点1。

再进入id3，称为节点2

high_income age marital_status

0 20 0

1 25 1

1 35 2

high_income age marital_status

0 20 0

1 25 1

在节点4的位置，我们停在了第2行，再逐渐返回上去，剩余几行以叶子的形式挂上去。

寻找最优分支

def find_best_column(data, target_name, columns):

# Fill in the logic here to automatically find the column in columns to split on

# data is a dataframe

# target_name is the name of the target variable

# columns is a list of potential columns to split on

gain_feature = []

for col in columns:

information_gain = calc_information_gain(data,col,"high_income")

gain_feature.append(information_gain)

best_gain = gain_feature.index(max(gain_feature))

return columns[best_gain]

# A list of columns to potentially split income with

columns = ["age", "workclass", "education_num", "marital_status", "occupation", "relationship", "race", "sex", "hours_per_week", "native_country"]

income_split = find_best_column(income,"high_income",columns)

现在我们可以存储整个树，而不是叶标签。我们将使用嵌套字典来做到这一点。我们可以用一个字典来表示根节点，并用左右键来分支。我们将把我们分解的列存储为关键列，并将中间值存储为关键中值。最后，我们可以将叶子的标签作为关键标签。我们还会使用数字键为每个节点编号。

更新伪代码

def id3(data, target, columns, tree)

1 Create a node for the tree

2 Number the node

3 If all of the values of the target attribute are 1, assign 1 to the label key in tree

4 If all of the values of the target attribute are 0, assign 0 to the label key in tree

5 Using information gain, find A, the column that splits the data best

6 Find the median value in column A

7 Assign the column and median keys in tree

8 Split A into values less than or equal to the median (0), and values above the median (1)

9 For each possible value (0 or 1), vi, of A,

10 Add a new tree branch below Root that corresponds to rows of data where A = vi

11 Let Examples(vi) be the subset of examples that have the value vi for A

12 Create a new key with the name corresponding to the side of the split (0=left, 1=right). The value of this key should be an empty dictionary.

13 Below this new branch, add the subtree id3(data[A==vi], target, columns, tree[split_side])

14 Return Root

红字部分是更改的内容，Python代码

# Create a dictionary to hold the tree

# It has to be outside of the function so we can access it later

tree = {}

# This list will let us number the nodes

# It has to be a list so we can access it inside the function

nodes = []

def id3(data, target, columns, tree):

unique_targets = pandas.unique(data[target])

# Assign the number key to the node dictionary

nodes.append(len(nodes) + 1)

tree["number"] = nodes[-1]

if len(unique_targets) == 1:

# Insert code here that assigns the "label" field to the node dictionary

tree["label"] = unique_targets

return

best_column = find_best_column(data, target, columns)

column_median = data[best_column].median()

# Insert code here that assigns the "column" and "median" fields to the node dictionary

tree["column"] = best_column

tree["median"] = column_median

left_split = data[data[best_column] <= column_median]

right_split = data[data[best_column] > column_median]

split_dict = [["left", left_split], ["right", right_split]]

for name, split in split_dict:

tree[name] = {}

id3(split, target, columns, tree[name])

# Call the function on our data to set the counters properly

id3(data, "high_income", ["age", "marital_status"], tree)

把决策树打印出来

def print_with_depth(string, depth):

# Add space before a string

prefix = " " * depth

# Print a string, and indent it appropriately

print("{0}{1}".format(prefix, string))

def print_node(tree, depth):

# Check for the presence of "label" in the tree

if "label" in tree:

# If found, then this is a leaf, so print it and return

print_with_depth("Leaf: Label {0}".format(tree["label"]), depth)

# This is critical -- without it, you'll get infinite recursion

return

# Print information about what the node is splitting on

print_with_depth("{0} > {1}".format(tree["column"], tree["median"]), depth)

# Create a list of tree branches

branches = [tree["left"], tree["right"]]

# Insert code here to recursively call print_node on each branch

# Don't forget to increment depth when you pass it in

for b in branches:

print_node(b,depth+1)

print_node(tree, 0)

结果如下：

使用决策树预测

def predict(tree, row):

1 Check for the presence of "label" in the tree dictionary

2 If found, return tree["label"]

3 Extract tree["column"] and tree["median"]

4 Check whether row[tree["column"]] is less than or equal to tree["median"]

5 If it's less than or equal, call predict(tree["left"], row) and return the result

6 If it's greater, call predict(tree["right"], row) and return the result

Python代码

def predict(tree, row):

if "label" in tree:

return tree["label"]

column = tree["column"]

median = tree["median"]

# Insert code here to check whether row[column] is less than or equal to median

# If it's less than or equal, return the result of predicting on the left branch of the tree

# If it's greater, return the result of predicting on the right branch of the tree

# Remember to use the return statement to return the result!

if row[column] <= median:

return predict(tree["left"],row)

else:

return predict(tree["right"],row)

# Print the prediction for the first row in our data

print(predict(tree, data.iloc[0]))

预测一系列数据

new_data = pandas.DataFrame([

[40,0],

[20,2],

[80,1],

[15,1],

[27,2],

[38,1]

])

# Assign column names to the data

new_data.columns = ["age", "marital_status"]

def batch_predict(tree, df):

# Insert your code here

return df.apply(lambda x: predict(tree, x), axis=1)

predictions = batch_predict(tree, new_data)

当然，ID3是一个简单的决策树版本，更为复杂的C4.5采用了增益率（gainratio）来选择划分属性，CART决策树采用了基尼指数来选择最优划分属性。但是基本思想是相同的。

数据集来源http://archive.ics.uci.edu/ml/datasets/Adult，我们可以使用scikit-learn软件包来匹配决策树。界面与我们以前适用的其他算法非常相似。我们使用DecisionTreeClassifier类作为分类问题，使用DecisionTreeRegressor作为回归问题。sklearn.tree包中包含这两个类。在这种情况下，我们预测了二元结果，所以我们将使用分类器。第一步是对数据进行分类器训练。我们将使用分类器的拟合方法来做到这一点。

from sklearn.tree import DecisionTreeClassifier

# A list of columns to train with

# We've already converted all columns to numeric

columns = ["age", "workclass", "education_num", "marital_status", "occupation", "relationship", "race", "sex", "hours_per_week", "native_country"]

# Instantiate the classifier

# Set random_state to 1 to make sure the results are consistent

clf = DecisionTreeClassifier(random_state=1)

# We've already loaded the variable "income," which contains all of the income data

clf.fit(income[columns],income["high_income"])

从输出可以看出clf的一系列参数

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None, max_features=None, max_leaf_nodes=None, min_impurity_split=1e-07, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, presort=False, random_state=1, splitter='best')

创建一个80%数据的训练集，和剩余数据的测试集

import numpy

import math

# Set a random seed so the shuffle is the same every time

numpy.random.seed(1)

# Shuffle the rows

# This permutes the index randomly using numpy.random.permutation

# Then, it reindexes the dataframe with the result

# The net effect is to put the rows into random order

income = income.reindex(numpy.random.permutation(income.index))

train_max_row = math.floor(income.shape[0] * .8)

train = income[0:train_max_row]

test = income[train_max_row:]

使用AUC判断是否出现过拟合

from sklearn.metrics import roc_auc_score

clf = DecisionTreeClassifier(random_state=1)

clf.fit(train[columns], train["high_income"])

predictions = clf.predict(test[columns])

error = roc_auc_score(test["high_income"],predictions)

print (error)

有三种主要的方法来解决过度拟合：

1) 剪枝。

2) 使用混合多树的预测。

3) 在构建树时限制树的深度。

其中限制树的深度是一种常见手段

# The first decision trees model we trained and tested

clf = DecisionTreeClassifier(random_state=1,max_depth = 7,min_samples_split = 13)

clf.fit(train[columns], train["high_income"])

predictions = clf.predict(test[columns])

test_auc = roc_auc_score(test["high_income"], predictions)

train_predictions = clf.predict(train[columns])

train_auc = roc_auc_score(train["high_income"], train_predictions)

print(test_auc)

print(train_auc)

shuxuanhan4926

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
1
评论
决策树原理及代码

翻译自dataquest网站，由于网站是付费的，这部分内容我觉得不错，翻译出来供大家参考和自己复习，另外部分知识来自周志华的《机器学习》决策树是一类常见的机器学习方法，以二分为例，讨论“我是否要与这只熊搏斗”问题：当我们有“当面遇到熊后存活者”的数据时，为了最优化遇到熊的存活率，我们可以使用决策树决定我们的行动。决策树是一个监督学习算法——我们首
复制链接

扫一扫

专栏目录