决策树_练手小项目

所需库环境

import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import sklearn

print('pandas:',pd.__version__)
print('matplotlib:',matplotlib.__version__)
print('numpy:',np.__version__)
print('sklearn:',sklearn.__version__)
pandas: 0.23.4
matplotlib: 2.2.3
numpy: 1.16.4
sklearn: 0.22.2.post1

导入数据

原数据为titanic生还预测数据,这里导入的数据为预处理过后的数据(其中进行了填充缺失值和特征离散处理)

data_train = pd.read_csv('data/train_fixed.csv')
data_train.info()
X = data_train.iloc[:,1:15]   #特征
y = data_train.iloc[:,-2:]    #类别
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 889 entries, 0 to 888
Data columns (total 17 columns):
Unnamed: 0    889 non-null int64
Age           889 non-null float64
SibSp         889 non-null int64
Parch         889 non-null int64
Fare          889 non-null float64
Cabin_No      889 non-null int64
Cabin_Yes     889 non-null int64
Embarked_C    889 non-null int64
Embarked_Q    889 non-null int64
Embarked_S    889 non-null int64
Sex_female    889 non-null int64
Sex_male      889 non-null int64
Pclass_1      889 non-null int64
Pclass_2      889 non-null int64
Pclass_3      889 non-null int64
Survived_0    889 non-null int64
Survived_1    889 non-null int64
dtypes: float64(2), int64(15)
memory usage: 118.1 KB

通过查看训练集和测试集特征信息,可以看出共有889条训练数据以及418条测试数据,其中不存在缺失值,特征类型均为数字类型,便于运算处理。

Task Ⅰ

使用随机森林进行预测,并观察不同的树深度和数量对结果的影响并总结

from sklearn.ensemble import RandomForestClassifier
from sklearn import model_selection
import warnings
warnings.filterwarnings("ignore")
num_trees = [25, 50, 100, 200, 300, 500, 700, 900, 1000]
max_depths = [1, 3, 5, 10, 15, 20]
seed = 124
kfold = model_selection.KFold(n_splits=3, shuffle=True, random_state=seed)
plt.figure(figsize=(16,14))
for max_depth in max_depths:
    train_scores = []
    test_scores = []
    for num_tree in num_trees:
        rf_model = RandomForestClassifier(n_estimators=num_tree, max_depth=max_depth)
        test_scores.append(np.mean(model_selection.cross_val_score(rf_model, X, y, cv=kfold)))
        rf_model.fit(X,y)
        y_pdt = rf_model.predict(X)
        train_scores.append(np.mean(np.equal(np.argmax(np.array(y),1),np.argmax(np.array(y_pdt),1))))
    plt.subplot(211)
    plt.plot(num_trees,train_scores,'-^',label='max_depth_'+ str(max_depth))
    plt.title('train_scores')
    plt.legend()
    plt.xlabel('Number of trees')
    plt.ylabel('Accuracy')
    plt.subplot(212)
    plt.plot(num_trees,test_scores,'-^',label='max_depth_'+ str(max_depth))
    plt.title('test_scores')
    plt.legend()
    plt.xlabel('Number of trees')
    plt.ylabel('Accuracy')

在这里插入图片描述

总结:通过观察上图,可以发现对比训练准确率和测试准确率,随机森林算法存在过拟合的情况,这里需要对决策树模型进行预剪支处理,会有所改善。在训练集和测试集中森林中树的颗数越多,准确率越趋于稳定,而不是线性增加,因此在选择这一参数时,不应选的太小,如果考虑到时间消耗,那该参数也不应该选的过大,导致需要大量的运算时间。对于max_depth这一参数,在训练集中,可以明显的看出最大树深越大,准确率越高;而在测试集中,树深为10时,准确率较高。这说明越大的树深会导致更严重的过拟合情况。

Task Ⅱ

使用scikit-learn GBDT或XGboost进行预测,并观察不同的树数量对结果的影响

from sklearn.ensemble import GradientBoostingClassifier

y_train = []
for i in range(len(y)):
    if y.iloc[i,0]==1:
        y_train.append(0)
    else:
        y_train.append(1)
y_train = np.array(y_train).reshape(-1,1)

gmc = GradientBoostingClassifier()
test_scores = np.mean(model_selection.cross_val_score(gmc, X, y_train, cv=kfold))
gmc.fit(X,y_train)
y_pdt = gmc.predict(X)
train_scores=np.mean(np.equal(np.argmax(np.array(y_train),1),np.argmax(np.array(y_pdt).reshape(-1,1),1)))
print('默认情况下GBDT的训练集准确率:{:.3f}%,测试集准确率:{:.3f}%'.format(train_scores*100,test_scores*100))
默认情况下GBDT的训练集准确率:100.000%,测试集准确率:81.999%

可以看出在默认情况下,训练集准确率很高达到了100%,而测试集效果较差,出现了比较严重的过拟合。

for num in range(20,301,10):
    gmc = GradientBoostingClassifier(n_estimators=num)
    test_scores = np.mean(model_selection.cross_val_score(gmc, X, y_train, cv=kfold))
    gmc.fit(X,y_train)
    y_pdt = gmc.predict(X)
    train_scores=np.mean(np.equal(np.argmax(np.array(y_train),1),np.argmax(np.array(y_pdt).reshape(-1,1),1)))
    print('迭代{:d}次下GBDT的训练集准确率:{:.3f}%,测试集准确率:{:.3f}%'.format(num,train_scores*100,test_scores*100))
迭代20次下GBDT的训练集准确率:100.000%,测试集准确率:81.662%
迭代30次下GBDT的训练集准确率:100.000%,测试集准确率:81.437%
迭代40次下GBDT的训练集准确率:100.000%,测试集准确率:81.661%
迭代50次下GBDT的训练集准确率:100.000%,测试集准确率:81.548%
迭代60次下GBDT的训练集准确率:100.000%,测试集准确率:81.662%
迭代70次下GBDT的训练集准确率:100.000%,测试集准确率:81.437%
迭代80次下GBDT的训练集准确率:100.000%,测试集准确率:81.774%
迭代90次下GBDT的训练集准确率:100.000%,测试集准确率:81.886%
迭代100次下GBDT的训练集准确率:100.000%,测试集准确率:81.999%
迭代110次下GBDT的训练集准确率:100.000%,测试集准确率:82.224%
迭代120次下GBDT的训练集准确率:100.000%,测试集准确率:81.999%
迭代130次下GBDT的训练集准确率:100.000%,测试集准确率:81.999%
迭代140次下GBDT的训练集准确率:100.000%,测试集准确率:82.224%
迭代150次下GBDT的训练集准确率:100.000%,测试集准确率:81.999%
迭代160次下GBDT的训练集准确率:100.000%,测试集准确率:82.337%
迭代170次下GBDT的训练集准确率:100.000%,测试集准确率:82.337%
迭代180次下GBDT的训练集准确率:100.000%,测试集准确率:82.226%
迭代190次下GBDT的训练集准确率:100.000%,测试集准确率:82.113%
迭代200次下GBDT的训练集准确率:100.000%,测试集准确率:82.001%
迭代210次下GBDT的训练集准确率:100.000%,测试集准确率:82.001%
迭代220次下GBDT的训练集准确率:100.000%,测试集准确率:82.113%
迭代230次下GBDT的训练集准确率:100.000%,测试集准确率:81.888%
迭代240次下GBDT的训练集准确率:100.000%,测试集准确率:81.551%
迭代250次下GBDT的训练集准确率:100.000%,测试集准确率:81.776%
迭代260次下GBDT的训练集准确率:100.000%,测试集准确率:82.001%
迭代270次下GBDT的训练集准确率:100.000%,测试集准确率:81.550%
迭代280次下GBDT的训练集准确率:100.000%,测试集准确率:81.550%
迭代290次下GBDT的训练集准确率:100.000%,测试集准确率:81.888%
迭代300次下GBDT的训练集准确率:100.000%,测试集准确率:81.663%

考虑时间成本的情况下:110次迭代效果比较好,注意此时的学习率为0.1。

for num in range(20,301,10):
    gmc = GradientBoostingClassifier(n_estimators=num, learning_rate=0.01)
    test_scores = np.mean(model_selection.cross_val_score(gmc, X, y_train, cv=kfold))
    gmc.fit(X,y_train)
    y_pdt = gmc.predict(X)
    train_scores=np.mean(np.equal(np.argmax(np.array(y_train),1),np.argmax(np.array(y_pdt).reshape(-1,1),1)))
    print('迭代{:d}次下GBDT的训练集准确率:{:.3f}%,测试集准确率:{:.3f}%'.format(num,train_scores*100,test_scores*100))
迭代20次下GBDT的训练集准确率:100.000%,测试集准确率:65.809%
迭代30次下GBDT的训练集准确率:100.000%,测试集准确率:79.528%
迭代40次下GBDT的训练集准确率:100.000%,测试集准确率:79.866%
迭代50次下GBDT的训练集准确率:100.000%,测试集准确率:80.203%
迭代60次下GBDT的训练集准确率:100.000%,测试集准确率:80.989%
迭代70次下GBDT的训练集准确率:100.000%,测试集准确率:81.101%
迭代80次下GBDT的训练集准确率:100.000%,测试集准确率:81.664%
迭代90次下GBDT的训练集准确率:100.000%,测试集准确率:81.664%
迭代100次下GBDT的训练集准确率:100.000%,测试集准确率:81.101%
迭代110次下GBDT的训练集准确率:100.000%,测试集准确率:81.437%
迭代120次下GBDT的训练集准确率:100.000%,测试集准确率:81.325%
迭代130次下GBDT的训练集准确率:100.000%,测试集准确率:81.325%
迭代140次下GBDT的训练集准确率:100.000%,测试集准确率:81.437%
迭代150次下GBDT的训练集准确率:100.000%,测试集准确率:81.325%
迭代160次下GBDT的训练集准确率:100.000%,测试集准确率:81.437%
迭代170次下GBDT的训练集准确率:100.000%,测试集准确率:81.437%
迭代180次下GBDT的训练集准确率:100.000%,测试集准确率:81.438%
迭代190次下GBDT的训练集准确率:100.000%,测试集准确率:81.662%
迭代200次下GBDT的训练集准确率:100.000%,测试集准确率:81.662%
迭代210次下GBDT的训练集准确率:100.000%,测试集准确率:81.662%
迭代220次下GBDT的训练集准确率:100.000%,测试集准确率:81.888%
迭代230次下GBDT的训练集准确率:100.000%,测试集准确率:81.887%
迭代240次下GBDT的训练集准确率:100.000%,测试集准确率:81.887%
迭代250次下GBDT的训练集准确率:100.000%,测试集准确率:81.887%
迭代260次下GBDT的训练集准确率:100.000%,测试集准确率:81.775%
迭代270次下GBDT的训练集准确率:100.000%,测试集准确率:81.887%
迭代280次下GBDT的训练集准确率:100.000%,测试集准确率:82.000%
迭代290次下GBDT的训练集准确率:100.000%,测试集准确率:81.887%
迭代300次下GBDT的训练集准确率:100.000%,测试集准确率:81.887%

通过上面的结果看出,将学习率下降至0.01时效果并不是很好。

import xgboost as xgb

xgb_model = xgb.XGBClassifier(learning_rate=0.1)
test_scores = np.mean(model_selection.cross_val_score(xgb_model, X, y_train, cv=kfold))
xgb_model.fit(X,y_train)
y_pdt = xgb_model.predict(X)
train_scores=np.mean(np.equal(np.argmax(np.array(y_train),1),np.argmax(np.array(y_pdt).reshape(-1,1),1)))
print('默认情况下XGboost的训练集准确率:{:.3f}%,测试集准确率:{:.3f}%'.format(train_scores*100,test_scores*100))
默认情况下XGboost的训练集准确率:100.000%,测试集准确率:82.000%
for num in range(50,500,20):
    model = xgb.XGBClassifier(learning_rate=0.1,n_estimators=num)
    test_scores = np.mean(model_selection.cross_val_score(model, X, y_train, cv=kfold))
    model.fit(X,y_train)
    y_pdt = model.predict(X)
    train_scores=np.mean(np.equal(np.argmax(np.array(y_train),1),np.argmax(np.array(y_pdt).reshape(-1,1),1)))
    print('{:d}颗树情况下XGboost的训练集准确率:{:.3f}%,测试集准确率:{:.3f}%'.format(num, train_scores*100,test_scores*100))
50颗树情况下XGboost的训练集准确率:100.000%,测试集准确率:82.562%
70颗树情况下XGboost的训练集准确率:100.000%,测试集准确率:83.012%
90颗树情况下XGboost的训练集准确率:100.000%,测试集准确率:82.337%
110颗树情况下XGboost的训练集准确率:100.000%,测试集准确率:82.112%
130颗树情况下XGboost的训练集准确率:100.000%,测试集准确率:81.775%
150颗树情况下XGboost的训练集准确率:100.000%,测试集准确率:81.775%
170颗树情况下XGboost的训练集准确率:100.000%,测试集准确率:81.101%
190颗树情况下XGboost的训练集准确率:100.000%,测试集准确率:81.100%
210颗树情况下XGboost的训练集准确率:100.000%,测试集准确率:80.987%
230颗树情况下XGboost的训练集准确率:100.000%,测试集准确率:80.650%
250颗树情况下XGboost的训练集准确率:100.000%,测试集准确率:80.201%
270颗树情况下XGboost的训练集准确率:100.000%,测试集准确率:80.426%
290颗树情况下XGboost的训练集准确率:100.000%,测试集准确率:80.201%
310颗树情况下XGboost的训练集准确率:100.000%,测试集准确率:80.200%
330颗树情况下XGboost的训练集准确率:100.000%,测试集准确率:80.088%
350颗树情况下XGboost的训练集准确率:100.000%,测试集准确率:79.975%
370颗树情况下XGboost的训练集准确率:100.000%,测试集准确率:79.750%
390颗树情况下XGboost的训练集准确率:100.000%,测试集准确率:79.975%
410颗树情况下XGboost的训练集准确率:100.000%,测试集准确率:79.750%
430颗树情况下XGboost的训练集准确率:100.000%,测试集准确率:79.975%
450颗树情况下XGboost的训练集准确率:100.000%,测试集准确率:79.862%
470颗树情况下XGboost的训练集准确率:100.000%,测试集准确率:80.088%
490颗树情况下XGboost的训练集准确率:100.000%,测试集准确率:80.200%

通过增加树的数量课看出,一味的增加数量会导致测试集准确率下降。

Task Ⅲ

ID3算法代入实例

在这里插入图片描述

求解标签信息熵

信息熵(information entropy)是度量样本集合纯度最常用的一种指标。假定当前样本集合 D D D中第 k k k类样本所占的比例为 p k ( k = 1 , 2 , ⋯   , ∣ y ∣ ) p_k(k=1,2,\cdots,\begin{vmatrix}y\end{vmatrix}) pk(k=1,2,,y),这里的 ∣ y ∣ \begin{vmatrix}y\end{vmatrix} y标示样本类别总数,即标签(labels)总数,则 D D D的信息增益熵定义为:
E n t ( D ) = − ∑ k = 1 ∣ y ∣ p k l o g 2 p k Ent(D)=-\sum_{k=1}^{\begin{vmatrix}y\end{vmatrix}} p_klog_2p_k Ent(D)=k=1ypklog2pk
E n t ( D ) Ent(D) Ent(D)的值越小,则 D D D的纯度越高。

def entrop(p1, p2):
    if p1 == 0 or p2 == 0:  #当特征中只存在一个取值时 说明纯度最高
        return 0
    else:
        return -p1*np.log2(p1)-p2*np.log2(p2)
# 根据上表可以看出
p_yes = 9/14
p_no = 1 - p_yes
entrop_decision = entrop(p_yes, p_no)
print(entrop_decision)
0.9402859586706311

求解各特征信息增益

假定离散属性 a a a V V V个可能的取值 { a 1 , a 2 , ⋯   , a V } \{a^1,a^2,\cdots,a^V\} {a1,a2,,aV}(注意:这里指的是数据中的某一个特征,以及该特征的具体值),若使用 a a a来对样本集 D D D进行划分,则会产生 V V V个分支节点,其中第 v v v个分支节点包含了 D D D中所有在属性 a a a上取值为 a v a^v av的样本,记为 D v D^v Dv。在通过上述公式计算出该分支样本的信息熵,由于各个分支节点数的样本不平均,需要给分支结点分配相对于的权重 ∣ D v ∣ / ∣ D ∣ \begin{vmatrix}D^v\end{vmatrix}/\begin{vmatrix}D\end{vmatrix} Dv/D。由此可得到信息增益(information gain):
G a i n ( D , a ) = E n t ( D ) − ∑ v = 1 V ∣ D v ∣ ∣ D ∣ E n t ( D v ) Gain(D,a)=Ent(D)-\sum_{v=1}^{V} \frac{\begin{vmatrix}D^v\end{vmatrix}}{\begin{vmatrix}D\end{vmatrix}}Ent(D^v) Gain(D,a)=Ent(D)v=1VDDvEnt(Dv)

# 特征:Outlook
p_outlook_sunny_yes = 2/5
p_outlook_sunny_no = 1 - p_outlook_sunny_yes
p_outlook_rain_yes = 3/5
p_outlook_rain_no = 1 - p_outlook_rain_yes
p_outlook_overcast_yes = 4/4
p_outlook_overcast_no = 1 - p_outlook_overcast_yes
p_outlook_sunny = 5/14
p_outlook_rain = 5/14
p_outlook_overcast = 4/14

gain_decision_outlook  = entrop_decision - (p_outlook_sunny * entrop(p_outlook_sunny_yes,p_outlook_sunny_no)
                                    +p_outlook_rain * entrop(p_outlook_rain_yes,p_outlook_rain_no)
                                    +p_outlook_overcast * entrop(p_outlook_overcast_yes,p_outlook_overcast_no))
print(gain_decision_outlook)
0.24674981977443933
#特征:Temp
p_temp_hot_yes = 2/4
p_temp_hot_no = 1 - p_temp_hot_yes
p_temp_mild_yes = 4/6
p_temp_mild_no = 1 - p_temp_mild_yes
p_temp_cool_yes = 3/4
p_temp_cool_no = 1 - p_temp_cool_yes
p_temp_hot = 4/14
p_temp_mild = 6/14
p_temp_cool = 4/14


gain_decision_temp = entrop_decision - (p_temp_hot * entrop(p_temp_hot_yes, p_temp_hot_no)
                                        + p_temp_mild * entrop(p_temp_mild_yes, p_temp_mild_no)
                                        + p_temp_cool * entrop(p_temp_cool_yes, p_temp_cool_no))
print(gain_decision_temp)
0.02922256565895487
#特征:Humidity
p_humidity_high_yes = 5/7
p_humidity_high_no = 1 - p_temp_hot_yes
p_humidity_normal_yes = 6/7
p_humidity_normal_no = 1 - p_temp_mild_yes
p_humidity_high = 7/14
p_humidity_normal = 7/14


gain_decision_humidity = entrop_decision - (p_humidity_high * entrop(p_humidity_high_yes, p_humidity_high_no)
                                        + p_humidity_normal * entrop(p_humidity_normal_yes, p_humidity_normal_no))
print(gain_decision_humidity)
0.15744778017877914
#特征:Wind
p_wind_weak_yes = 6/8
p_wind_weak_no = 1 - p_wind_weak_yes
p_wind_strong_yes = 3/6
p_wind_strong_no = 1 - p_wind_strong_yes
p_wind_weak = 8/14
p_wind_strong = 6/14


gain_decision_wind = entrop_decision - (p_wind_weak * entrop(p_wind_weak_yes, p_wind_weak_no)
                                        + p_wind_strong * entrop(p_wind_strong_yes, p_wind_strong_no))
print(gain_decision_wind)
0.04812703040826949
print('Gain(decisiong|outlook)={:.3f}\nGain(decision|temp)={:.3f}\nGain(decisiong|humidity)={:.3f}\nGain(decisiong|wind)={:.3f}\n'
      .format(gain_decision_outlook,gain_decision_temp,gain_decision_humidity,gain_decision_wind))
Gain(decisiong|outlook)=0.247
Gain(decision|temp)=0.029
Gain(decisiong|humidity)=0.157
Gain(decisiong|wind)=0.048

通过上述结果可以看出outlook特征的信息增益最大,说明使用该特征进行划分所获得的“纯度提升”效果最好。因此,使用outlook特征作为决策树的第一个节点。后续的决策树分支,一次进行各特征下的信息增益。

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值