机器学习笔记之DecisionTree

最新推荐文章于 2022-06-27 19:20:26 发布

YukAgame

最新推荐文章于 2022-06-27 19:20:26 发布

阅读量299

点赞数

分类专栏：机器学习学习笔记

本文链接：https://blog.csdn.net/weixin_38686737/article/details/108004780

版权

机器学习学习笔记专栏收录该内容

7 篇文章 2 订阅

订阅专栏

机器学习笔记之DecisionTree

Classification

Widely used models for classification and regression tasks. They learn a hierchy of if/else questions, leading to a decision.
(总之来说就是通过一连串的是/否问题来进行区分，具体对数据来说就是大于小于某个值）
A leaf of the tree that contains data points that all share the same target value is called pure (但是分的太多就会过于关注某个异常点吧 —哦豁 overfitting)
How to control complexity of decision tress: Two common strategies: stopping the creation of the tress early(also known as pre-pruning;(limiting the maximum depth/number of the tree building but then removing or collapsing nodes that contain little information(also known as post-pruning or just pruning)

在python当中主要通过scikit-learn中的 DecisionTreeClassifier和DecisionTreeRegressor来实现

from sklearn.tree import DecisionTreeClassifier,DecisionTreeRegressor
from sklearn.datasets import load_breast_cancer，make_moons
from sklearn.model_selection import train_test_split
from sklearn.tree import export_graphviz
'''载入一个可以使决策树模型可视化的包'''
import graphviz
import  matplotlib.pyplot as plt
import numpy as np
import mglearn
import pandas as pd
import os
'''引入matplotlib的颜色包'''
from matplotlib.colors import  ListedColormap

先来看一个随机生成的数据库的例子（可以与KNN和LinearModel的分类方法比较一下）

'''传入不同的分类方法进行分类，这里其他三种分类方法的neighbors,alpha,C这些值都是默认的'''
def Classifier(func,df_x,df_y,ax,title):
'''基本算法思路和之前的几个方法差不多'''
    model = func().fit(df_x,df_y)
    mglearn.plots.plot_2d_separator(model,x, fill = True,cm = ListedColormap(['#EE82EE','#87CEFA']),ax = ax)
'''作用顾名思义，产生2d平面的分界线'''
    mglearn.discrete_scatter(x[:,0],x[:,1],y,labels = 'Features',ax = ax)
	ax.set_title(title)
    ax.set_xlabel('Feature0')
    ax.set_ylabel('Feature1')
	return model

model = Classifier(DecisionTreeClassifier,x,y)
'''生成月牙形状的两组离散点'''
data = make_moons(noise = 0.35,random_state = 42)
x, y =data

fig,axes = plt.subplots(1,4,figsize =(64/3,3))
for i,type,ax,title in zip(range(1,5),[DecisionTreeClassifier,KNeighborsClassifier,LogisticRegression,LinearSVC],axes,['DecisionTree','K-NearstNrighbors','LogisticRegression','LinearSVC']):
    model = Classifier(type,x,y,ax,title)
axes[0].legend(['Type1','Type2'])
plt.show()

可以看出决策树的划分比较精细，K近邻的划分大致上是没有问题的但是存在几个异常点，逻辑斯蒂回归和线性支持向量机拟合并不是很好
在这里插入图片描述
Test score的表现也明显高于其他三类。

下面以乳腺癌人群为例（一个更复杂的例子）

cancer = load_breast_cancer()
'''划分训练集和测试集'''
x_train,x_test,y_train,y_test = train_test_split(cancer.data,cancer.target,stratify=cancer.target,random_state=42)
tree = DecisionTreeClassifier(random_state = 0).fit(x_train,y_train)
print('Training Data Accuracy:%s'%(tree.score(x_train,y_train)), 'Testing Data Accuracy:%s'%(tree.score(x_test,y_test)))
'''
The accuracy on training data set is 100%, because the leaves are pure, test accuracy is slightly than that of linear models
However, an unpruned tree may be overfitting, we need to restrict the max_depth.
'''
'''将生成的决策树模型保存并可视化'''
export_graphviz(tree,out_file = 'tree.dot',class_names = ['malignant','benign'],feature_names = cancer.feature_names,
                impurity = False, filled = True)
with open('tree.dot') as file:
    dot_graph  = file.read()
print(graphviz.Source(dot_graph))

在这里插入图片描述

训练集的精度达到了百分之百，因为分到最后的叶片内都是同一类数据。但是这样没有经过剪枝（意思就是野蛮生长是不行滴，要人工安排下生长空间）的情况很可能导致过拟合。（因为会分到叶片内都是一类数据为止，如下图中的情况）

在这里插入图片描述

通常来说剪枝主要分为pre-pruning和post-pruning。 scikit-learn里只提供pre-pruning的方法(限制数的高度，限制叶片数量）

tree = DecisionTreeClassifier(max_depth = 4,random_state = 0).fit(x_train,y_train)
print('Training Data Accuracy:%s'%(tree.score(x_train,y_train)), 'Testing Data Accuracy:%s'%(tree.score(x_test,y_test)))

export_graphviz(tree,out_file = 'tree.dot',class_names = ['malignant','benign'],feature_names = cancer.feature_names,
                impurity = False, filled = True)
with open('tree.dot') as file:
    dot_graph  = file.read()
print(graphviz.Source(dot_graph))

在这里插入图片描述

经过了剪枝处理后，可以看到训练集的表现有所下降，但测试集的表现得到了提升

在这里插入图片描述

那么决策树的节点怎么确定呢，通常来说是根据数据的特征权重（我瞎JB理解的）

'''Feature importance '''

print('Feature Importance:%s'%(tree.feature_importances_))


n_features  = cancer.data.shape[1]
plt.barh(range(n_features),tree.feature_importances_)
plt.yticks(np.arange(n_features),cancer.feature_names)
plt.xlabel('Feature Importance')
plt.ylabel('Feature')
plt.ylim(-1, n_features)
plt.show()
print(tree)

可以看出worst radius的权重较大（可以理解成这个问题是数据集中导致乳腺癌比较明显的特征之一？），所以根节点选取了worst radiu作为开头

在这里插入图片描述

Regression

Not able to extrapolate, or make predictions outside of the data

下面以RAM价格变化做分析

ram_prices = pd.read_csv(os.path.join(mglearn.datasets.DATA_PATH, 'ram_price.csv'))
plt.semilogy(ram_prices.date, ram_prices.price)
plt.xlabel('Year')
plt.ylabel('Price in $/Mbyte')
plt.show()

在这里插入图片描述

下面分别用决策树和LinearRegression

'''划分训练集和测试集'''
data_train = ram_prices[ram_prices.date < 2000]
data_test = ram_prices[ram_prices.date >= 2000]
x_train = data_train.date[:, np.newaxis]
y_train = np.log(data_train.price)

tree = DecisionTreeRegressor().fit(x_train, y_train)
lr = LinearRegression().fit(x_train, y_train)

X = ram_prices.date[:, np.newaxis]
pred_tree = tree.predict(X)
pred_lr = lr.predict(X)

price_tree = np.exp(pred_tree)
price_lr = np.exp(pred_lr)

plt.semilogy(data_train.date, data_train.price, 'b-', label='Training data')
plt.semilogy(data_test.date, data_test.price, 'c-', label='Test data')
plt.semilogy(ram_prices.date, price_tree, 'g-.', label='Tree prediction')
plt.semilogy(ram_prices.date, price_lr, 'r--', label='Linear prediction')
plt.legend()
plt.show()

可以看出，决策树的预测基本拟合了训练集，但是超出给出的训练集的范围就不行（就是拟合第一名，预测我不行）。 LinearRegression的预测提供了一个很好的结果，尽管有一些误差。
在这里插入图片描述

Strengths and Weaknesses:

Strengths: resulting models can easily be visualized and understood
by none experts, algorithms completely invariant to scaling of the
data
WeaknessesL even with the use of pre-pruning, the tress tend to
overfit and provide poor generalization performance.