决策树原理及代码

本文翻译自dataquest网站,介绍了决策树的基本原理,包括熵和信息增益的概念,以及如何通过信息增益选择最优特征进行划分。通过Python代码展示了如何构建决策树,并解释了如何防止过拟合,如限制树的深度。内容还涵盖了ID3算法和在实际问题中使用决策树的步骤,例如使用scikit-learn库进行训练和预测。
摘要由CSDN通过智能技术生成

翻译自dataquest网站,由于网站是付费的,这部分内容我觉得不错,翻译出来供大家参考和自己复习,另外部分知识来自周志华的《机器学习》

       决策树是一类常见的机器学习方法,以二分为例,讨论“我是否要与这只熊搏斗”问题:

       当我们有“当面遇到熊后存活者”的数据时,为了最优化遇到熊的存活率,我们可以使用决策树决定我们的行动。

    决策树是一个监督学习算法——我们首先采用历史数据创建决策树,然后用它来预测一个输出。决策树的主要好处是它可以发现数据间非线性的相关性,相比线性回归而言。在与熊搏斗的例子中,一个决策树可以得到当熊是巨熊时,逃跑不可能存活;然而一个线性回归必须权衡两个因素,虽然其中一个在实际中并不存在。

    首先读取一个csv文件,关于工资是否高工资的问题。文件可以从http://archive.ics.uci.edu/ml/datasets/Adult下载到。

import pandas

 

# Set index_col to False to avoid pandas thinking that the first column is row indexes (it's age)

income = pandas.read_csv("income.csv", index_col=False)

 

print(income.head(5))

       然后对一些string类型的数据进行分类

# Convert a single column from text categories to numbers

col = pandas.Categorical.from_array(income["workclass"])

income["workclass"] = col.codes

print(income["workclass"].head(5))

col = pandas.Categorical.from_array(income["education"])

income["education"] = col.codes

col = pandas.Categorical.from_array(income["marital_status"])

income["marital_status"] = col.codes

for i in ["occupation","relationship","race","sex","native_country","high_income"]:

    col = pandas.Categorical.from_array(income[i])

    income[i] = col.codes

 

print (income.head(5))

       这里必须提到,Categorical.from_array是非常有用的代码,col.codes可以将原先是string类型的一列数据以数值分类好。

col = pandas.Categorical.from_array(income["workclass"])

income["workclass"] = col.codes

       一个决策树由一系列节点和分支组成。如:

       如上所示,节点将数据分成两个分支,N和Y,基于这些人是否工作在私营部门(根据workclass列)。我们已经将“私营部门”对应到code4。所以N对应workclass !=  4,Y对应workclass == 4。

       再深一层可见:

       以此类推。

private_incomes = income[income["workclass"] == 4]

public_incomes = income[income["workclass"] != 4]

 

 

 

       那些位于底层的节点(不再分支)成为终节点或叶子。当我们决定分支时,并不是随机的,而是有一个目标,就是确保我们可以将结果用于预测,为了达到这个目标,所有叶子在预测目标列里必须只有一个值。

       例子中,我们使用了high_income列作为目标列。当high_income的值超过50k每年时= 1,反之为0。

       所有我们会继续到所有节点只有一个high_income结果为止。

       此处我们学习如何进行分支。首先,进行一个预分支,常见的分支指标是熵,主要用来判断节点的“纯度”。

       熵的定义如下:

       有一点信息论基础的都应该知道,熵值越高,混乱度越高,纯度越低。

import math

# We'll do the same calculation we did above, but in Python

# Passing in 2 as the second parameter to math.log will take a base 2 log

entropy = -(2/5 * math.log(2/5, 2) + 3/5 * math.log(3/5, 2))

print(entropy)

 

low_income_len = len(income[income["high_income"] == 0])/len(income)

low_income_len1 = income[income["high_income"] == 0].shape[0]/income.shape[0]

print (low_income_len,low_income_len1)

high_income_len = 1 - low_income_len

 

income_entropy = -(high_income_len*math.log(high_income_len,2) + low_income_len*math.log(low_income_len,2))

       此处定义一个新的信息增益:

       如何简单理解呢,首先我们计算了IG针对的是特定目标T,此处对应的是high_income,同时,我们给定了一个节点分支A。计算T的熵值,然后计算A的分支v1针对特定目标T的熵值,再计算v2针对特定目标T的熵值,以此类推,乘以每个分支的权重并相加,得到分支A后的熵值,再与T的原先熵值进行减法,可以得到信息增益。

import numpy

 

def calc_entropy(column):

    """

    Calculate entropy given a pandas series, list, or numpy array.

    """

    # Compute the counts of each unique value in the column

    counts = numpy.bincount(column)

    # Divide by the total column length to get a probability

    probabilities = counts / len(column)

   

    # Initialize the entropy to 0

    entropy = 0

    # Loop through the probabilities, and add each one to the total entropy

    for prob in probabilities:

        if prob > 0:

            entropy += prob * math.log(prob, 2)

   

    return -entropy

 

# Verify that our function matches our answer from earlier

entropy = calc_entropy([1,1,0,0,1])

print(entropy)

 

information_gain = entropy - ((.8 * calc_entropy([1,1,0,0])) + (.2 * calc_entropy([1])))

print(information_gain)

 

median_age = income["age"].median()

 

left_split = income[income["age"] <= median_age]

right_split = income[income["age"] > median_age]

 

age_information_gain = income_entropy - ((left_split.shape[0] / income.shape[0]) * calc_entropy(left_split["high_income"]) + ((right_split.shape[0] / income.shape[0]) * calc_entropy(right_split["high_income"])))

       此时我们需要决定哪些变量可以在分支后获得最优的信息增益(即分支后纯度最高,熵值最低)。

def calc_information_gain(data, split_name, target_name):

    """

    Calculate information gain given a data set, column to split on, and target

    """

    # Calculate the original entropy

    original_entropy = calc_entropy(data[target_name])

   

    # Find the median of the column we're splitting

    column = data[split_name]

    median = column.median()

   

    # Make two subsets of the data, based on the median

    left_split = data[column <= median]

    right_split = data[column > median]

   

    # Loop through the splits and calculate the subset entropies

    to_subtract = 0

    for subset in [left_split, right_split]:

        prob = (subset.shape[0] / data.shape[0])

        to_subtract += prob * calc_entropy(subset[target_name])

   

    # Return information gain

    return original_entropy - to_subtract

 

# Verify that our answer is the same as on the last screen

print(calc_information_gain(income, "age", "high_income"))

 

columns = ["age", "workclass", "education_num", "marital_status", "occupation", "relationship", "race", "sex", "hours_per_week", "native_country"]

 

information_gains = []

 

for col in columns:

    gain = calc_information_gain(income,col,"high_income")  

    information_gains.append(gain)

#print (information_gains,max(information_gains),information_gains == max(information_gains),)

highest_gain = columns[information_gains.index(max(information_gains))]

print(highest_gain)

       对单个连续数据进行分支,需要多次进行分支动作。

 

ID3算法是创建决策树的常见算法,包括递归和时间复杂度等认知在内。通俗的说,递归指的是将大问题分解成许多小步骤,递归函数会调用自己并组合出最终的结果。

       创建一个决策树是递归的良好使用例子,在每个节点我们调用一个递归函数,节点分化出两个分支,每个分支产生一个新的节点,新的节点又可以调用递归函数去逐步创建出完整的树来。

       伪代码如下:

def id3(data, target, columns)

    1 Create a node for the tree

    2 If all values of the target attribute are 1, Return the node, with label = 1

    3 If all values of the target attribute are 0, Return the node, with label = 0

    4 Using information gain, find A, the column that splits the data best

    5 Find the median value in column A

    6 Split column A into values below or equal to the median (0), and values above the median (1)

    7 For each possible value (0 or 1), vi, of A,

    8    Add a new tree branch below Root that corresponds to rows of data where A = vi

    9    Let Examples(vi) be the subset of examples that have the value vi for A

   10    Below this new branch add the subtree id3(data[A==vi], target, columns)

   11 Return Root

 

此处举个例子:假如我们需要使用年龄和婚姻状况去预测high_income。

high_income    age    marital_status

0              20     0

0              60     2

0              40     1

1              25     1

1              35     2

1              55     1

 

从伪代码走起,跳过第2和3行,第4行通过计算使用年龄作为分支,第5行,计算中点是37.5,第6行,少于37.5的记为0,反之记为1。进入第7行的循环,在第10行出进入id3()函数。在进入前我们称为节点1。

再进入id3,称为节点2

 

high_income    age    marital_status

0              20     0

1              25     1

1              35     2

high_income    age    marital_status

0              20     0

1              25     1

 

在节点4的位置,我们停在了第2行,再逐渐返回上去,剩余几行以叶子的形式挂上去。

寻找最优分支

def find_best_column(data, target_name, columns):

    # Fill in the logic here to automatically find the column in columns to split on

    # data is a dataframe

    # target_name is the name of the target variable

    # columns is a list of potential columns to split on

    gain_feature = []

    for col in columns:

        information_gain = calc_information_gain(data,col,"high_income")

        gain_feature.append(information_gain)

    best_gain = gain_feature.index(max(gain_feature))

    return columns[best_gain]

 

# A list of columns to potentially split income with

columns = ["age", "workclass", "education_num", "marital_status", "occupation", "relationship", "race", "sex", "hours_per_week", "native_country"]

 

income_split = find_best_column(income,"high_income",columns)

 

现在我们可以存储整个树,而不是叶标签。 我们将使用嵌套字典来做到这一点。 我们可以用一个字典来表示根节点,并用左右键来分支。 我们将把我们分解的列存储为关键列,并将中间值存储为关键中值。 最后,我们可以将叶子的标签作为关键标签。 我们还会使用数字键为每个节点编号。

更新伪代码

def id3(data, target, columns, tree)

    1 Create a node for the tree

    2 Number the node

    3 If all of the values of the target attribute are 1, assign 1 to the label key in tree

    4 If all of the values of the target attribute are 0, assign 0 to the label key in tree

    5 Using information gain, find A, the column that splits the data best

    6 Find the median value in column A

    7 Assign the column and median keys in tree

    8 Split A into values less than or equal to the median (0), and values above the median (1)

    9 For each possible value (0 or 1), vi, of A,

   10    Add a new tree branch below Root that corresponds to rows of data where A = vi

   11    Let Examples(vi) be the subset of examples that have the value vi for A

   12    Create a new key with the name corresponding to the side of the split (0=left, 1=right).  The value of this key should be an empty dictionary.

   13    Below this new branch, add the subtree id3(data[A==vi], target, columns, tree[split_side])

   14 Return Root

       红字部分是更改的内容,Python代码

# Create a dictionary to hold the tree 

# It has to be outside of the function so we can access it later

tree = {}

 

# This list will let us number the nodes 

# It has to be a list so we can access it inside the function

nodes = []

 

def id3(data, target, columns, tree):

    unique_targets = pandas.unique(data[target])

   

    # Assign the number key to the node dictionary

    nodes.append(len(nodes) + 1)

    tree["number"] = nodes[-1]

 

    if len(unique_targets) == 1:

        # Insert code here that assigns the "label" field to the node dictionary

        tree["label"] = unique_targets

        return

   

    best_column = find_best_column(data, target, columns)

    column_median = data[best_column].median()

   

    # Insert code here that assigns the "column" and "median" fields to the node dictionary

    tree["column"] = best_column

    tree["median"] = column_median

    left_split = data[data[best_column] <= column_median]

    right_split = data[data[best_column] > column_median]

    split_dict = [["left", left_split], ["right", right_split]]

   

    for name, split in split_dict:

        tree[name] = {}

        id3(split, target, columns, tree[name])

 

# Call the function on our data to set the counters properly

id3(data, "high_income", ["age", "marital_status"], tree)

把决策树打印出来

def print_with_depth(string, depth):

    # Add space before a string

    prefix = "    " * depth

    # Print a string, and indent it appropriately

    print("{0}{1}".format(prefix, string))

   

   

def print_node(tree, depth):

    # Check for the presence of "label" in the tree

    if "label" in tree:

        # If found, then this is a leaf, so print it and return

        print_with_depth("Leaf: Label {0}".format(tree["label"]), depth)

        # This is critical -- without it, you'll get infinite recursion

        return

    # Print information about what the node is splitting on

    print_with_depth("{0} > {1}".format(tree["column"], tree["median"]), depth)

   

    # Create a list of tree branches

    branches = [tree["left"], tree["right"]]

       

    # Insert code here to recursively call print_node on each branch

    # Don't forget to increment depth when you pass it in

    for b in branches:

        print_node(b,depth+1)

print_node(tree, 0)

结果如下:

使用决策树预测

def predict(tree, row):

    1 Check for the presence of "label" in the tree dictionary

    2    If found, return tree["label"]

    3 Extract tree["column"] and tree["median"]

    4 Check whether row[tree["column"]] is less than or equal to tree["median"]

    5    If it's less than or equal, call predict(tree["left"], row) and return the result

    6    If it's greater, call predict(tree["right"], row) and return the result

Python代码

def predict(tree, row):

    if "label" in tree:

        return tree["label"]

   

    column = tree["column"]

    median = tree["median"]

   

    # Insert code here to check whether row[column] is less than or equal to median

    # If it's less than or equal, return the result of predicting on the left branch of the tree

    # If it's greater, return the result of predicting on the right branch of the tree

    # Remember to use the return statement to return the result!

    if row[column] <= median:

        return predict(tree["left"],row)

    else:

        return predict(tree["right"],row)

 

# Print the prediction for the first row in our data

print(predict(tree, data.iloc[0]))

预测一系列数据

new_data = pandas.DataFrame([

    [40,0],

    [20,2],

    [80,1],

    [15,1],

    [27,2],

    [38,1]

    ])

# Assign column names to the data

new_data.columns = ["age", "marital_status"]

 

def batch_predict(tree, df):

    # Insert your code here

    return df.apply(lambda x: predict(tree, x), axis=1)

 

predictions = batch_predict(tree, new_data)

 

       当然,ID3是一个简单的决策树版本,更为复杂的C4.5采用了增益率(gainratio)来选择划分属性,CART决策树采用了基尼指数来选择最优划分属性。但是基本思想是相同的。

       数据集来源http://archive.ics.uci.edu/ml/datasets/Adult,我们可以使用scikit-learn软件包来匹配决策树。 界面与我们以前适用的其他算法非常相似。我们使用DecisionTreeClassifier类作为分类问题,使用DecisionTreeRegressor作为回归问题。sklearn.tree包中包含这两个类。在这种情况下,我们预测了二元结果,所以我们将使用分类器。第一步是对数据进行分类器训练。我们将使用分类器的拟合方法来做到这一点。

from sklearn.tree import DecisionTreeClassifier

 

# A list of columns to train with

# We've already converted all columns to numeric

columns = ["age", "workclass", "education_num", "marital_status", "occupation", "relationship", "race", "sex", "hours_per_week", "native_country"]

 

# Instantiate the classifier

# Set random_state to 1 to make sure the results are consistent

clf = DecisionTreeClassifier(random_state=1)

 

# We've already loaded the variable "income," which contains all of the income data

clf.fit(income[columns],income["high_income"])

从输出可以看出clf的一系列参数

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None, max_features=None, max_leaf_nodes=None, min_impurity_split=1e-07, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, presort=False, random_state=1, splitter='best')

创建一个80%数据的训练集,和剩余数据的测试集

import numpy

import math

 

# Set a random seed so the shuffle is the same every time

numpy.random.seed(1)

 

# Shuffle the rows 

# This permutes the index randomly using numpy.random.permutation

# Then, it reindexes the dataframe with the result

# The net effect is to put the rows into random order

income = income.reindex(numpy.random.permutation(income.index))

 

train_max_row = math.floor(income.shape[0] * .8)

 

train = income[0:train_max_row]

test = income[train_max_row:]

使用AUC判断是否出现过拟合

from sklearn.metrics import roc_auc_score

 

clf = DecisionTreeClassifier(random_state=1)

clf.fit(train[columns], train["high_income"])

 

predictions = clf.predict(test[columns])

 

error = roc_auc_score(test["high_income"],predictions)

 

print (error)

 

有三种主要的方法来解决过度拟合:

1)       剪枝。

2)       使用混合多树的预测。

3)       在构建树时限制树的深度。

 

其中限制树的深度是一种常见手段

# The first decision trees model we trained and tested

clf = DecisionTreeClassifier(random_state=1,max_depth = 7,min_samples_split = 13)

clf.fit(train[columns], train["high_income"])

predictions = clf.predict(test[columns])

test_auc = roc_auc_score(test["high_income"], predictions)

 

train_predictions = clf.predict(train[columns])

train_auc = roc_auc_score(train["high_income"], train_predictions)

 

print(test_auc)

print(train_auc)

 

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值