决策树_练手小项目

最新推荐文章于 2023-10-31 20:08:40 发布

XuZhiyu_

最新推荐文章于 2023-10-31 20:08:40 发布

阅读量343

点赞数

分类专栏：练手项目文章标签：决策树算法 python 机器学习

本文链接：https://blog.csdn.net/weixin_43870329/article/details/106570394

版权

练手项目专栏收录该内容

5 篇文章 0 订阅

订阅专栏

所需库环境

import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import sklearn

print('pandas:',pd.__version__)
print('matplotlib:',matplotlib.__version__)
print('numpy:',np.__version__)
print('sklearn:',sklearn.__version__)

pandas: 0.23.4
matplotlib: 2.2.3
numpy: 1.16.4
sklearn: 0.22.2.post1

导入数据

原数据为titanic生还预测数据，这里导入的数据为预处理过后的数据（其中进行了填充缺失值和特征离散处理）

data_train = pd.read_csv('data/train_fixed.csv')
data_train.info()
X = data_train.iloc[:,1:15]   #特征
y = data_train.iloc[:,-2:]    #类别

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 889 entries, 0 to 888
Data columns (total 17 columns):
Unnamed: 0    889 non-null int64
Age           889 non-null float64
SibSp         889 non-null int64
Parch         889 non-null int64
Fare          889 non-null float64
Cabin_No      889 non-null int64
Cabin_Yes     889 non-null int64
Embarked_C    889 non-null int64
Embarked_Q    889 non-null int64
Embarked_S    889 non-null int64
Sex_female    889 non-null int64
Sex_male      889 non-null int64
Pclass_1      889 non-null int64
Pclass_2      889 non-null int64
Pclass_3      889 non-null int64
Survived_0    889 non-null int64
Survived_1    889 non-null int64
dtypes: float64(2), int64(15)
memory usage: 118.1 KB

通过查看训练集和测试集特征信息，可以看出共有889条训练数据以及418条测试数据，其中不存在缺失值，特征类型均为数字类型，便于运算处理。

Task Ⅰ

使用随机森林进行预测，并观察不同的树深度和数量对结果的影响并总结

from sklearn.ensemble import RandomForestClassifier
from sklearn import model_selection
import warnings
warnings.filterwarnings("ignore")

num_trees = [25, 50, 100, 200, 300, 500, 700, 900, 1000]
max_depths = [1, 3, 5, 10, 15, 20]
seed = 124
kfold = model_selection.KFold(n_splits=3, shuffle=True, random_state=seed)
plt.figure(figsize=(16,14))
for max_depth in max_depths:
    train_scores = []
    test_scores = []
    for num_tree in num_trees:
        rf_model = RandomForestClassifier(n_estimators=num_tree, max_depth=max_depth)
        test_scores.append(np.mean(model_selection.cross_val_score(rf_model, X, y, cv=kfold)))
        rf_model.fit(X,y)
        y_pdt = rf_model.predict(X)
        train_scores.append(np.mean(np.equal(np.argmax(np.array(y),1),np.argmax(np.array(y_pdt),1))))
    plt.subplot(211)
    plt.plot(num_trees,train_scores,'-^',label='max_depth_'+ str(max_depth))
    plt.title('train_scores')
    plt.legend()
    plt.xlabel('Number of trees')
    plt.ylabel('Accuracy')
    plt.subplot(212)
    plt.plot(num_trees,test_scores,'-^',label='max_depth_'+ str(max_depth))
    plt.title('test_scores')
    plt.legend()
    plt.xlabel('Number of trees')
    plt.ylabel('Accuracy')

在这里插入图片描述

总结：通过观察上图，可以发现对比训练准确率和测试准确率，随机森林算法存在过拟合的情况，这里需要对决策树模型进行预剪支处理，会有所改善。在训练集和测试集中森林中树的颗数越多，准确率越趋于稳定，而不是线性增加，因此在选择这一参数时，不应选的太小，如果考虑到时间消耗，那该参数也不应该选的过大，导致需要大量的运算时间。对于max_depth这一参数，在训练集中，可以明显的看出最大树深越大，准确率越高；而在测试集中，树深为10时，准确率较高。这说明越大的树深会导致更严重的过拟合情况。

Task Ⅱ

使用scikit-learn GBDT或XGboost进行预测，并观察不同的树数量对结果的影响

from sklearn.ensemble import GradientBoostingClassifier

y_train = []
for i in range(len(y)):
    if y.iloc[i,0]==1:
        y_train.append(0)
    else:
        y_train.append(1)
y_train = np.array(y_train).reshape(-1,1)

gmc = GradientBoostingClassifier()
test_scores = np.mean(model_selection.cross_val_score(gmc, X, y_train, cv=kfold))
gmc.fit(X,y_train)
y_pdt = gmc.predict(X)
train_scores=np.mean(np.equal(np.argmax(np.array(y_train),1),np.argmax(np.array(y_pdt).reshape(-1,1),1)))
print('默认情况下GBDT的训练集准确率：{:.3f}%,测试集准确率：{:.3f}%'.format(train_scores*100,test_scores*100))

默认情况下GBDT的训练集准确率：100.000%,测试集准确率：81.999%

可以看出在默认情况下，训练集准确率很高达到了100%，而测试集效果较差，出现了比较严重的过拟合。

for num in range(20,301,10):
    gmc = GradientBoostingClassifier(n_estimators=num)
    test_scores = np.mean(model_selection.cross_val_score(gmc, X, y_train, cv=kfold))
    gmc.fit(X,y_train)
    y_pdt = gmc.predict(X)
    train_scores=np.mean(np.equal(np.argmax(np.array(y_train),1),np.argmax(np.array(y_pdt).reshape(-1,1),1)))
    print('迭代{:d}次下GBDT的训练集准确率：{:.3f}%,测试集准确率：{:.3f}%'.format(num,train_scores*100,test_scores*100))

迭代20次下GBDT的训练集准确率：100.000%,测试集准确率：81.662%
迭代30次下GBDT的训练集准确率：100.000%,测试集准确率：81.437%
迭代40次下GBDT的训练集准确率：100.000%,测试集准确率：81.661%
迭代50次下GBDT的训练集准确率：100.000%,测试集准确率：81.548%
迭代60次下GBDT的训练集准确率：100.000%,测试集准确率：81.662%
迭代70次下GBDT的训练集准确率：100.000%,测试集准确率：81.437%
迭代80次下GBDT的训练集准确率：100.000%,测试集准确率：81.774%
迭代90次下GBDT的训练集准确率：100.000%,测试集准确率：81.886%
迭代100次下GBDT的训练集准确率：100.000%,测试集准确率：81.999%
迭代110次下GBDT的训练集准确率：100.000%,测试集准确率：82.224%
迭代120次下GBDT的训练集准确率：100.000%,测试集准确率：81.999%
迭代130次下GBDT的训练集准确率：100.000%,测试集准确率：81.999%
迭代140次下GBDT的训练集准确率：100.000%,测试集准确率：82.224%
迭代150次下GBDT的训练集准确率：100.000%,测试集准确率：81.999%
迭代160次下GBDT的训练集准确率：100.000%,测试集准确率：82.337%
迭代170次下GBDT的训练集准确率：100.000%,测试集准确率：82.337%
迭代180次下GBDT的训练集准确率：100.000%,测试集准确率：82.226%
迭代190次下GBDT的训练集准确率：100.000%,测试集准确率：82.113%
迭代200次下GBDT的训练集准确率：100.000%,测试集准确率：82.001%
迭代210次下GBDT的训练集准确率：100.000%,测试集准确率：82.001%
迭代220次下GBDT的训练集准确率：100.000%,测试集准确率：82.113%
迭代230次下GBDT的训练集准确率：100.000%,测试集准确率：81.888%
迭代240次下GBDT的训练集准确率：100.000%,测试集准确率：81.551%
迭代250次下GBDT的训练集准确率：100.000%,测试集准确率：81.776%
迭代260次下GBDT的训练集准确率：100.000%,测试集准确率：82.001%
迭代270次下GBDT的训练集准确率：100.000%,测试集准确率：81.550%
迭代280次下GBDT的训练集准确率：100.000%,测试集准确率：81.550%
迭代290次下GBDT的训练集准确率：100.000%,测试集准确率：81.888%
迭代300次下GBDT的训练集准确率：100.000%,测试集准确率：81.663%

考虑时间成本的情况下：110次迭代效果比较好，注意此时的学习率为0.1。

for num in range(20,301,10):
    gmc = GradientBoostingClassifier(n_estimators=num, learning_rate=0.01)
    test_scores = np.mean(model_selection.cross_val_score(gmc, X, y_train, cv=kfold))
    gmc.fit(X,y_train)
    y_pdt = gmc.predict(X)
    train_scores=np.mean(np.equal(np.argmax(np.array(y_train),1),np.argmax(np.array(y_pdt).reshape(-1,1),1)))
    print('迭代{:d}次下GBDT的训练集准确率：{:.3f}%,测试集准确率：{:.3f}%'.format(num,train_scores*100,test_scores*100))

迭代20次下GBDT的训练集准确率：100.000%,测试集准确率：65.809%
迭代30次下GBDT的训练集准确率：100.000%,测试集准确率：79.528%
迭代40次下GBDT的训练集准确率：100.000%,测试集准确率：79.866%
迭代50次下GBDT的训练集准确率：100.000%,测试集准确率：80.203%
迭代60次下GBDT的训练集准确率：100.000%,测试集准确率：80.989%
迭代70次下GBDT的训练集准确率：100.000%,测试集准确率：81.101%
迭代80次下GBDT的训练集准确率：100.000%,测试集准确率：81.664%
迭代90次下GBDT的训练集准确率：100.000%,测试集准确率：81.664%
迭代100次下GBDT的训练集准确率：100.000%,测试集准确率：81.101%
迭代110次下GBDT的训练集准确率：100.000%,测试集准确率：81.437%
迭代120次下GBDT的训练集准确率：100.000%,测试集准确率：81.325%
迭代130次下GBDT的训练集准确率：100.000%,测试集准确率：81.325%
迭代140次下GBDT的训练集准确率：100.000%,测试集准确率：81.437%
迭代150次下GBDT的训练集准确率：100.000%,测试集准确率：81.325%
迭代160次下GBDT的训练集准确率：100.000%,测试集准确率：81.437%
迭代170次下GBDT的训练集准确率：100.000%,测试集准确率：81.437%
迭代180次下GBDT的训练集准确率：100.000%,测试集准确率：81.438%
迭代190次下GBDT的训练集准确率：100.000%,测试集准确率：81.662%
迭代200次下GBDT的训练集准确率：100.000%,测试集准确率：81.662%
迭代210次下GBDT的训练集准确率：100.000%,测试集准确率：81.662%
迭代220次下GBDT的训练集准确率：100.000%,测试集准确率：81.888%
迭代230次下GBDT的训练集准确率：100.000%,测试集准确率：81.887%
迭代240次下GBDT的训练集准确率：100.000%,测试集准确率：81.887%
迭代250次下GBDT的训练集准确率：100.000%,测试集准确率：81.887%
迭代260次下GBDT的训练集准确率：100.000%,测试集准确率：81.775%
迭代270次下GBDT的训练集准确率：100.000%,测试集准确率：81.887%
迭代280次下GBDT的训练集准确率：100.000%,测试集准确率：82.000%
迭代290次下GBDT的训练集准确率：100.000%,测试集准确率：81.887%
迭代300次下GBDT的训练集准确率：100.000%,测试集准确率：81.887%

通过上面的结果看出，将学习率下降至0.01时效果并不是很好。

import xgboost as xgb

xgb_model = xgb.XGBClassifier(learning_rate=0.1)
test_scores = np.mean(model_selection.cross_val_score(xgb_model, X, y_train, cv=kfold))
xgb_model.fit(X,y_train)
y_pdt = xgb_model.predict(X)
train_scores=np.mean(np.equal(np.argmax(np.array(y_train),1),np.argmax(np.array(y_pdt).reshape(-1,1),1)))
print('默认情况下XGboost的训练集准确率：{:.3f}%,测试集准确率：{:.3f}%'.format(train_scores*100,test_scores*100))

默认情况下XGboost的训练集准确率：100.000%,测试集准确率：82.000%

for num in range(50,500,20):
    model = xgb.XGBClassifier(learning_rate=0.1,n_estimators=num)
    test_scores = np.mean(model_selection.cross_val_score(model, X, y_train, cv=kfold))
    model.fit(X,y_train)
    y_pdt = model.predict(X)
    train_scores=np.mean(np.equal(np.argmax(np.array(y_train),1),np.argmax(np.array(y_pdt).reshape(-1,1),1)))
    print('{:d}颗树情况下XGboost的训练集准确率：{:.3f}%,测试集准确率：{:.3f}%'.format(num, train_scores*100,test_scores*100))

50颗树情况下XGboost的训练集准确率：100.000%,测试集准确率：82.562%
70颗树情况下XGboost的训练集准确率：100.000%,测试集准确率：83.012%
90颗树情况下XGboost的训练集准确率：100.000%,测试集准确率：82.337%
110颗树情况下XGboost的训练集准确率：100.000%,测试集准确率：82.112%
130颗树情况下XGboost的训练集准确率：100.000%,测试集准确率：81.775%
150颗树情况下XGboost的训练集准确率：100.000%,测试集准确率：81.775%
170颗树情况下XGboost的训练集准确率：100.000%,测试集准确率：81.101%
190颗树情况下XGboost的训练集准确率：100.000%,测试集准确率：81.100%
210颗树情况下XGboost的训练集准确率：100.000%,测试集准确率：80.987%
230颗树情况下XGboost的训练集准确率：100.000%,测试集准确率：80.650%
250颗树情况下XGboost的训练集准确率：100.000%,测试集准确率：80.201%
270颗树情况下XGboost的训练集准确率：100.000%,测试集准确率：80.426%
290颗树情况下XGboost的训练集准确率：100.000%,测试集准确率：80.201%
310颗树情况下XGboost的训练集准确率：100.000%,测试集准确率：80.200%
330颗树情况下XGboost的训练集准确率：100.000%,测试集准确率：80.088%
350颗树情况下XGboost的训练集准确率：100.000%,测试集准确率：79.975%
370颗树情况下XGboost的训练集准确率：100.000%,测试集准确率：79.750%
390颗树情况下XGboost的训练集准确率：100.000%,测试集准确率：79.975%
410颗树情况下XGboost的训练集准确率：100.000%,测试集准确率：79.750%
430颗树情况下XGboost的训练集准确率：100.000%,测试集准确率：79.975%
450颗树情况下XGboost的训练集准确率：100.000%,测试集准确率：79.862%
470颗树情况下XGboost的训练集准确率：100.000%,测试集准确率：80.088%
490颗树情况下XGboost的训练集准确率：100.000%,测试集准确率：80.200%

通过增加树的数量课看出，一味的增加数量会导致测试集准确率下降。

Task Ⅲ

ID3算法代入实例

在这里插入图片描述

求解标签信息熵

信息熵(information entropy)是度量样本集合纯度最常用的一种指标。假定当前样本集合 $D$ 中第 $k$ 类样本所占的比例为 $p_k(k=1,2,\cdots,\begin{vmatrix}y\end{vmatrix})$ ，这里的 $\begin{vmatrix}y\end{vmatrix}$ 标示样本类别总数，即标签(labels)总数，则 $D$ 的信息增益熵定义为：
$Ent(D)=-\sum_{k=1}^{\begin{vmatrix}y\end{vmatrix}} p_klog_2p_k$
$E n t (D)$ 的值越小，则 $D$ 的纯度越高。

def entrop(p1, p2):
    if p1 == 0 or p2 == 0:  #当特征中只存在一个取值时 说明纯度最高
        return 0
    else:
        return -p1*np.log2(p1)-p2*np.log2(p2)

# 根据上表可以看出
p_yes = 9/14
p_no = 1 - p_yes
entrop_decision = entrop(p_yes, p_no)
print(entrop_decision)

0.9402859586706311

求解各特征信息增益

假定离散属性 $a$ 有 $V$ 个可能的取值 $\{a^1,a^2,\cdots,a^V\}$ (注意：这里指的是数据中的某一个特征，以及该特征的具体值)，若使用 $a$ 来对样本集 $D$ 进行划分，则会产生 $V$ 个分支节点，其中第 $v$ 个分支节点包含了 $D$ 中所有在属性 $a$ 上取值为 $a^v$ 的样本，记为 $D^v$ 。在通过上述公式计算出该分支样本的信息熵，由于各个分支节点数的样本不平均，需要给分支结点分配相对于的权重 $\begin{vmatrix}D^v\end{vmatrix}/\begin{vmatrix}D\end{vmatrix}$ 。由此可得到信息增益(information gain):
$Gain(D,a)=Ent(D)-\sum_{v=1}^{V} \frac{\begin{vmatrix}D^v\end{vmatrix}}{\begin{vmatrix}D\end{vmatrix}}Ent(D^v)$

# 特征：Outlook
p_outlook_sunny_yes = 2/5
p_outlook_sunny_no = 1 - p_outlook_sunny_yes
p_outlook_rain_yes = 3/5
p_outlook_rain_no = 1 - p_outlook_rain_yes
p_outlook_overcast_yes = 4/4
p_outlook_overcast_no = 1 - p_outlook_overcast_yes
p_outlook_sunny = 5/14
p_outlook_rain = 5/14
p_outlook_overcast = 4/14

gain_decision_outlook  = entrop_decision - (p_outlook_sunny * entrop(p_outlook_sunny_yes,p_outlook_sunny_no)
                                    +p_outlook_rain * entrop(p_outlook_rain_yes,p_outlook_rain_no)
                                    +p_outlook_overcast * entrop(p_outlook_overcast_yes,p_outlook_overcast_no))
print(gain_decision_outlook)

0.24674981977443933

#特征：Temp
p_temp_hot_yes = 2/4
p_temp_hot_no = 1 - p_temp_hot_yes
p_temp_mild_yes = 4/6
p_temp_mild_no = 1 - p_temp_mild_yes
p_temp_cool_yes = 3/4
p_temp_cool_no = 1 - p_temp_cool_yes
p_temp_hot = 4/14
p_temp_mild = 6/14
p_temp_cool = 4/14


gain_decision_temp = entrop_decision - (p_temp_hot * entrop(p_temp_hot_yes, p_temp_hot_no)
                                        + p_temp_mild * entrop(p_temp_mild_yes, p_temp_mild_no)
                                        + p_temp_cool * entrop(p_temp_cool_yes, p_temp_cool_no))
print(gain_decision_temp)

0.02922256565895487

#特征：Humidity
p_humidity_high_yes = 5/7
p_humidity_high_no = 1 - p_temp_hot_yes
p_humidity_normal_yes = 6/7
p_humidity_normal_no = 1 - p_temp_mild_yes
p_humidity_high = 7/14
p_humidity_normal = 7/14


gain_decision_humidity = entrop_decision - (p_humidity_high * entrop(p_humidity_high_yes, p_humidity_high_no)
                                        + p_humidity_normal * entrop(p_humidity_normal_yes, p_humidity_normal_no))
print(gain_decision_humidity)

0.15744778017877914

#特征：Wind
p_wind_weak_yes = 6/8
p_wind_weak_no = 1 - p_wind_weak_yes
p_wind_strong_yes = 3/6
p_wind_strong_no = 1 - p_wind_strong_yes
p_wind_weak = 8/14
p_wind_strong = 6/14


gain_decision_wind = entrop_decision - (p_wind_weak * entrop(p_wind_weak_yes, p_wind_weak_no)
                                        + p_wind_strong * entrop(p_wind_strong_yes, p_wind_strong_no))
print(gain_decision_wind)

0.04812703040826949

print('Gain(decisiong|outlook)={:.3f}\nGain(decision|temp)={:.3f}\nGain(decisiong|humidity)={:.3f}\nGain(decisiong|wind)={:.3f}\n'
      .format(gain_decision_outlook,gain_decision_temp,gain_decision_humidity,gain_decision_wind))

Gain(decisiong|outlook)=0.247
Gain(decision|temp)=0.029
Gain(decisiong|humidity)=0.157
Gain(decisiong|wind)=0.048

通过上述结果可以看出outlook特征的信息增益最大，说明使用该特征进行划分所获得的“纯度提升”效果最好。因此，使用outlook特征作为决策树的第一个节点。后续的决策树分支，一次进行各特征下的信息增益。

XuZhiyu_

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
决策树_练手小项目

所需库环境import pandas as pdimport matplotlibimport matplotlib.pyplot as pltimport numpy as npimport sklearnprint('pandas:',pd.__version__)print('matplotlib:',matplotlib.__version__)print('numpy:',np.__version__)print('sklearn:',sklearn.__version__)
复制链接

扫一扫

专栏目录