机器学习笔记之DecisionTree

机器学习笔记之DecisionTree

Classification

  • Widely used models for classification and regression tasks. They learn a hierchy of if/else questions, leading to a decision.
    (总之来说就是通过一连串的是/否问题来进行区分,具体对数据来说就是大于小于某个值)

  • A leaf of the tree that contains data points that all share the same target value is called pure (但是分的太多就会过于关注某个异常点吧 —哦豁 overfitting)

  • How to control complexity of decision tress: Two common strategies: stopping the creation of the tress early(also known as pre-pruning;(limiting the maximum depth/number of the tree building but then removing or collapsing nodes that contain little information(also known as post-pruning or just pruning)

在python当中主要通过scikit-learn中的 DecisionTreeClassifier和DecisionTreeRegressor来实现

from sklearn.tree import DecisionTreeClassifier,DecisionTreeRegressor
from sklearn.datasets import load_breast_cancer,make_moons
from sklearn.model_selection import train_test_split
from sklearn.tree import export_graphviz
'''载入一个可以使决策树模型可视化的包'''
import graphviz
import  matplotlib.pyplot as plt
import numpy as np
import mglearn
import pandas as pd
import os
'''引入matplotlib的颜色包'''
from matplotlib.colors import  ListedColormap

先来看一个随机生成的数据库的例子(可以与KNN和LinearModel的分类方法比较一下)

'''传入不同的分类方法进行分类,这里其他三种分类方法的neighbors,alpha,C这些值都是默认的'''
def Classifier(func,df_x,df_y,ax,title):
'''基本算法思路和之前的几个方法差不多'''
    model = func().fit(df_x,df_y)
    mglearn.plots.plot_2d_separator(model,x, fill = True,cm = ListedColormap(['#EE82EE','#87CEFA']),ax = ax)
'''作用顾名思义,产生2d平面的分界线'''
    mglearn.discrete_scatter(x[:,0],x[:,1],y,labels = 'Features',ax = ax)
	ax.set_title(title)
    ax.set_xlabel('Feature0')
    ax.set_ylabel('Feature1')
	return model

model = Classifier(DecisionTreeClassifier,x,y)
'''生成月牙形状的两组离散点'''
data = make_moons(noise = 0.35,random_state = 42)
x, y =data

fig,axes = plt.subplots(1,4,figsize =(64/3,3))
for i,type,ax,title in zip(range(1,5),[DecisionTreeClassifier,KNeighborsClassifier,LogisticRegression,LinearSVC],axes,['DecisionTree','K-NearstNrighbors','LogisticRegression','LinearSVC']):
    model = Classifier(type,x,y,ax,title)
axes[0].legend(['Type1','Type2'])
plt.show()

可以看出决策树的划分比较精细,K近邻的划分大致上是没有问题的但是存在几个异常点,逻辑斯蒂回归和线性支持向量机拟合并不是很好
在这里插入图片描述
Test score的表现也明显高于其他三类。
在这里插入图片描述

下面以乳腺癌人群为例(一个更复杂的例子)

cancer = load_breast_cancer()
'''划分训练集和测试集'''
x_train,x_test,y_train,y_test = train_test_split(cancer.data,cancer.target,stratify=cancer.target,random_state=42)
tree = DecisionTreeClassifier(random_state = 0).fit(x_train,y_train)
print('Training Data Accuracy:%s'%(tree.score(x_train,y_train)), 'Testing Data Accuracy:%s'%(tree.score(x_test,y_test)))
'''
The accuracy on training data set is 100%, because the leaves are pure, test accuracy is slightly than that of linear models
However, an unpruned tree may be overfitting, we need to restrict the max_depth.
'''
'''将生成的决策树模型保存并可视化'''
export_graphviz(tree,out_file = 'tree.dot',class_names = ['malignant','benign'],feature_names = cancer.feature_names,
                impurity = False, filled = True)
with open('tree.dot') as file:
    dot_graph  = file.read()
print(graphviz.Source(dot_graph))

在这里插入图片描述

训练集的精度达到了百分之百,因为分到最后的叶片内都是同一类数据。 但是这样没有经过剪枝(意思就是野蛮生长是不行滴,要人工安排下生长空间)的情况很可能导致过拟合。(因为会分到叶片内都是一类数据为止,如下图中的情况)

在这里插入图片描述
在这里插入图片描述
通常来说剪枝主要分为pre-pruning和post-pruning。 scikit-learn里只提供pre-pruning的方法(限制数的高度,限制叶片数量)

tree = DecisionTreeClassifier(max_depth = 4,random_state = 0).fit(x_train,y_train)
print('Training Data Accuracy:%s'%(tree.score(x_train,y_train)), 'Testing Data Accuracy:%s'%(tree.score(x_test,y_test)))

export_graphviz(tree,out_file = 'tree.dot',class_names = ['malignant','benign'],feature_names = cancer.feature_names,
                impurity = False, filled = True)
with open('tree.dot') as file:
    dot_graph  = file.read()
print(graphviz.Source(dot_graph))

在这里插入图片描述

在这里插入图片描述

经过了剪枝处理后,可以看到训练集的表现有所下降,但测试集的表现得到了提升

在这里插入图片描述

那么决策树的节点怎么确定呢,通常来说是根据数据的特征权重(我瞎JB理解的)

'''Feature importance '''

print('Feature Importance:%s'%(tree.feature_importances_))


n_features  = cancer.data.shape[1]
plt.barh(range(n_features),tree.feature_importances_)
plt.yticks(np.arange(n_features),cancer.feature_names)
plt.xlabel('Feature Importance')
plt.ylabel('Feature')
plt.ylim(-1, n_features)
plt.show()
print(tree)

可以看出worst radius的权重较大(可以理解成这个问题是数据集中导致乳腺癌比较明显的特征之一?),所以根节点选取了worst radiu作为开头

在这里插入图片描述

Regression

  • Not able to extrapolate, or make predictions outside of the data

下面以RAM价格变化做分析

ram_prices = pd.read_csv(os.path.join(mglearn.datasets.DATA_PATH, 'ram_price.csv'))
plt.semilogy(ram_prices.date, ram_prices.price)
plt.xlabel('Year')
plt.ylabel('Price in $/Mbyte')
plt.show()

在这里插入图片描述

下面分别用决策树和LinearRegression

'''划分训练集和测试集'''
data_train = ram_prices[ram_prices.date < 2000]
data_test = ram_prices[ram_prices.date >= 2000]
x_train = data_train.date[:, np.newaxis]
y_train = np.log(data_train.price)

tree = DecisionTreeRegressor().fit(x_train, y_train)
lr = LinearRegression().fit(x_train, y_train)

X = ram_prices.date[:, np.newaxis]
pred_tree = tree.predict(X)
pred_lr = lr.predict(X)

price_tree = np.exp(pred_tree)
price_lr = np.exp(pred_lr)

plt.semilogy(data_train.date, data_train.price, 'b-', label='Training data')
plt.semilogy(data_test.date, data_test.price, 'c-', label='Test data')
plt.semilogy(ram_prices.date, price_tree, 'g-.', label='Tree prediction')
plt.semilogy(ram_prices.date, price_lr, 'r--', label='Linear prediction')
plt.legend()
plt.show()

可以看出,决策树的预测基本拟合了训练集,但是超出给出的训练集的范围就不行(就是拟合第一名,预测我不行)。 LinearRegression的预测提供了一个很好的结果,尽管有一些误差。
在这里插入图片描述

Strengths and Weaknesses:

  • Strengths: resulting models can easily be visualized and understood
    by none experts, algorithms completely invariant to scaling of the
    data

  • WeaknessesL even with the use of pre-pruning, the tress tend to
    overfit and provide poor generalization performance.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值