监督学习 | 决策树之网络搜索


相关文章:

机器学习 | 目录

监督学习 | ID3 决策树原理及Python实现

监督学习 | ID3 & C4.5 决策树原理

监督学习 | CART 分类回归树原理

监督学习 | 决策树之Sklearn实现

监督学习 | 决策树之网络搜索

1. 通过网格搜索完善模型

在本文中,我们将为决策树模型拟合一些样本数据。 这个初始模型会过拟合。 然后,我们将使用网格搜索为这个模型找到更好的参数,以减少过拟合。

首先,导入所需要的库:

%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

1.1 数据导入

首先定义一个函数用于读取 csv 数据并进行可视化:

def load_pts(csv_name):
    data = np.asarray(pd.read_csv(csv_name, header=None))
    X = data[:,0:2]
    y = data[:,2]

    plt.scatter(X[np.argwhere(y==0).flatten(),0], X[np.argwhere(y==0).flatten(),1],s = 50, color = 'blue', edgecolor = 'k')
    plt.scatter(X[np.argwhere(y==1).flatten(),0], X[np.argwhere(y==1).flatten(),1],s = 50, color = 'red', edgecolor = 'k')
    
    plt.xlim(-2.05,2.05)
    plt.ylim(-2.05,2.05)
    plt.grid(False)
    plt.tick_params(
        axis='x',
        which='both',
        bottom='off',
        top='off')

    return X,y

X, y = load_pts('Data/data.csv')
plt.show()

1.2 拆分数据为训练集和测试集

from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score, make_scorer

#Fixing a random seed
import random
random.seed(42)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

1.3 拟合决策树模型

from sklearn.tree import DecisionTreeClassifier

# Define the model (with default hyperparameters)
clf = DecisionTreeClassifier(random_state=42)

# Fit the model
clf.fit(X_train, y_train)

# Make predictions
train_predictions = clf.predict(X_train)
test_predictions = clf.predict(X_test)

现在我们来可视化模型,并测试 f1_score,首先定义可视化函数:

def plot_model(X, y, clf):
    
    # 绘制两类点的散点图
    plt.scatter(X[np.argwhere(y==0).flatten(),0],X[np.argwhere(y==0).flatten(),1],s = 50, color = 'blue', edgecolor = 'k')
    plt.scatter(X[np.argwhere(y==1).flatten(),0],X[np.argwhere(y==1).flatten(),1],s = 50, color = 'red', edgecolor = 'k')

    # 图形设置
    plt.xlim(-2.05,2.05)
    plt.ylim(-2.05,2.05)
    plt.grid(False)
    plt.tick_params(
        axis='x',
        which='both',
        bottom='off',
        top='off')

    # 利用 np.meshgrid(r,r) 生成一个平面对于的横纵坐标
    r = np.linspace(-2.1,2.1,300)
    s,t = np.meshgrid(r,r)
    
    # 将坐标转换为与决策树的训练集相同格式
    s = np.reshape(s,(np.size(s),1))
    t = np.reshape(t,(np.size(t),1))
    h = np.concatenate((s,t),1)

    # 对平面上的每一个点进行预测类别
    z = clf.predict(h)

    # 将横纵坐标及对应类别转换为矩阵形式
    s = s.reshape((np.size(r),np.size(r)))
    t = t.reshape((np.size(r),np.size(r)))
    z = z.reshape((np.size(r),np.size(r)))

    # 利用 plt.contourf 绘制不同等高面
    plt.contourf(s,t,z,colors = ['blue','red'],alpha = 0.2,levels = range(-1,2))
    
    # 绘制等高面边缘
    if len(np.unique(z)) > 1:
        plt.contour(s,t,z,colors = 'k', linewidths = 2)
    plt.show()
plot_model(X, y, clf)
print('The Training F1 Score is', f1_score(train_predictions, y_train))
print('The Testing F1 Score is', f1_score(test_predictions, y_test))
The Training F1 Score is 1.0
The Testing F1 Score is 0.7000000000000001

训练集得分为 1 ,而测试集得分为 0.7,可以看出当前模型有些过拟合,下面我们通过网络搜索来优化参数。

1.4 使用网络搜索完善模型

现在,我们将执行以下步骤:

1.首先,定义一些参数来执行网格搜索:max_depth, min_samples_leaf, 和 min_samples_split

2.使用f1_score,为模型制作记分器。

3.使用参数和记分器,在分类器上执行网格搜索。

4.将数据拟合到新的分类器中。

5.绘制模型并找到 f1_score。

6.如果模型不太好,则更改参数的范围并再次拟合。

from sklearn.metrics import make_scorer
from sklearn.model_selection import GridSearchCV

clf = DecisionTreeClassifier(random_state=42)

# 生成参数列表
parameters = {'max_depth':[2,4,6,8,10],'min_samples_leaf':[2,4,6,8,10], 'min_samples_split':[2,4,6,8,10]}

# 定义计分器
scorer = make_scorer(f1_score)

# 生成网络搜索器
grid_obj = GridSearchCV(clf, parameters, scoring=scorer)

# 拟合网络搜索器
grid_fit = grid_obj.fit(X_train, y_train)

# 获得最佳决策树模型
best_clf = grid_fit.best_estimator_

# 对最佳模型进行拟合
best_clf.fit(X_train, y_train)

# 对测试集和训练集进行预测
best_train_predictions = best_clf.predict(X_train)
best_test_predictions = best_clf.predict(X_test)

# 计算测试集得分和训练集得分
print('The training F1 Score is', f1_score(best_train_predictions, y_train))
print('The testing F1 Score is', f1_score(best_test_predictions, y_test))

# 模型可视化
plot_model(X, y, best_clf)

# 查看最佳模型的参数设置
best_clf
The training F1 Score is 0.8148148148148148
The testing F1 Score is 0.8
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=4,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=2, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=42, splitter='best')

由此可以看出,最佳参数为:

max_depth=4

min_samples_leaf=2

min_samples_split=2

且相对于第一个图,边界更为简单,这意味着它不太可能过拟合。

1.5 交叉验证可视化

首先看一下不同参数下的信息:

results = pd.DataFrame(grid_obj.cv_results_)
results.T
0123456789...115116117118119120121122123124
mean_fit_time0.0005369190.0006096360.000670910.00050060.0005326270.0005384290.001622760.0007250310.0003466610.000960668...0.0006916520.0003636680.000547330.0004147690.0003654160.0003147130.0004833540.0003890990.0003786880.000585318
std_fit_time7.36079e-050.0002179650.0001209177.64067e-050.0001185790.0002018790.001250170.0002574372.95338e-050.000841452...0.000275423.20732e-050.0001664275.31612e-052.22742e-055.03509e-060.0001567710.0001131686.09452e-050.000168713
mean_score_time0.001245420.002091570.00117540.001184780.001274510.001329820.001735690.001581670.0008043450.00165256...0.001074950.0007761320.00104960.001079720.0007999740.0008893810.000979980.0009579660.000821670.00108504
std_score_time0.0004602230.001317650.0002173130.0001753570.0002741290.0006842210.000125850.0005317962.83336e-050.00103978...0.0003272781.61637e-050.0001819630.0002262133.18651e-050.000152530.0002821820.000147715.40954e-050.00015636
param_max_depth2222222222...10101010101010101010
param_min_samples_leaf2222244444...888881010101010
param_min_samples_split246810246810...246810246810
params{'max_depth': 2, 'min_samples_leaf': 2, 'min_s...{'max_depth': 2, 'min_samples_leaf': 2, 'min_s...{'max_depth': 2, 'min_samples_leaf': 2, 'min_s...{'max_depth': 2, 'min_samples_leaf': 2, 'min_s...{'max_depth': 2, 'min_samples_leaf': 2, 'min_s...{'max_depth': 2, 'min_samples_leaf': 4, 'min_s...{'max_depth': 2, 'min_samples_leaf': 4, 'min_s...{'max_depth': 2, 'min_samples_leaf': 4, 'min_s...{'max_depth': 2, 'min_samples_leaf': 4, 'min_s...{'max_depth': 2, 'min_samples_leaf': 4, 'min_s......{'max_depth': 10, 'min_samples_leaf': 8, 'min_...{'max_depth': 10, 'min_samples_leaf': 8, 'min_...{'max_depth': 10, 'min_samples_leaf': 8, 'min_...{'max_depth': 10, 'min_samples_leaf': 8, 'min_...{'max_depth': 10, 'min_samples_leaf': 8, 'min_...{'max_depth': 10, 'min_samples_leaf': 10, 'min...{'max_depth': 10, 'min_samples_leaf': 10, 'min...{'max_depth': 10, 'min_samples_leaf': 10, 'min...{'max_depth': 10, 'min_samples_leaf': 10, 'min...{'max_depth': 10, 'min_samples_leaf': 10, 'min...
split0_test_score0.6428570.6428570.6428570.6428570.6428570.6428570.6428570.6428570.6428570.642857...0.6428570.6428570.6428570.6428570.6428570.6428570.6428570.6428570.6428570.642857
split1_test_score0.7647060.7647060.7647060.7647060.7647060.7647060.7647060.7647060.7647060.764706...0.50.50.50.50.50.50.50.50.50.5
split2_test_score0.7096770.7096770.7096770.7096770.7096770.7096770.7096770.7096770.7096770.709677...0.7142860.7142860.7142860.7142860.7142860.6666670.6666670.6666670.6666670.666667
mean_test_score0.7056980.7056980.7056980.7056980.7056980.7056980.7056980.7056980.7056980.705698...0.6178570.6178570.6178570.6178570.6178570.6023810.6023810.6023810.6023810.602381
std_test_score0.05013060.05013060.05013060.05013060.05013060.05013060.05013060.05013060.05013060.0501306...0.08899950.08899950.08899950.08899950.08899950.07371350.07371350.07371350.07371350.0737135
rank_test_score14141414141414141414...42424242426262626262

14 rows × 125 columns

接着我们来看一下在不同的最大深度(max_depth)下,每片叶子的最小样本数(min_samples_leaf)和每次分裂的最小样本数(min_samples_split)对决策树模型的泛化性能的影响。

首先定义一个函数来绘制不同最大深度下的热力图(需安装 mglearn):

def hotmap(max_depth, results):
    fliter = results[results['param_max_depth']==max_depth]
    scores = np.array(fliter['mean_test_score']).reshape(5, 5)
    mglearn.tools.heatmap(scores, xlabel='min_samples_split', xticklabels=parameters['min_samples_split'],
                      ylabel='min_samples_leaf', yticklabels=parameters['min_samples_leaf'], cmap="viridis")

绘制到子图中:

import matplotlib.pyplot as plt 
plt.figure(figsize=(20, 20))
plt
for i in [1,2,3,4,5]:
    plt.subplot(1,5,i, title='max_depth={}'.format(2*i))
    hotmap(2*i, results)

从图中可以看出,每次分裂的最小样本数(min_samples_split)对模型几乎没有影响,而随着最大深度(max_depth)的增加,模型得分逐渐降低。

1.5 总结

通过使用网格搜索,我们将 F1 分数从 0.7 提高到 0.8(同时我们失去了一些训练分数,但这没问题)。 另外,如果你看绘制的图,第二个模型的边界更为简单,这意味着它不太可能过拟合。

  • 2
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值