机器学习3-决策树

最新推荐文章于 2024-08-27 17:27:28 发布

Chise1

最新推荐文章于 2024-08-27 17:27:28 发布

阅读量304

点赞数

分类专栏： python 文章标签：机器学习

本文链接：https://blog.csdn.net/weixin_36179862/article/details/84993102

版权

python 专栏收录该内容

57 篇文章 3 订阅

订阅专栏

介绍

核心思想：相似的输入必会产生相似的输出。
个人总结：枚举法根据各个元素把所有情况列举出来，生成树状图。可以用来分类，也可以用来拟合线性。

基本原理

相似的输入导致相似的输出。
年龄：青年-1，中年-2，老年-3
学历：专科-1，本科-2，硕士-3，博士-4
经验：缺乏-1，一般-2，丰富-3，资深-4
性别：男-1，女-2
薪资：1-低，2-中，3-高，4-超高
年龄  学历  工作经验  性别 -> 薪资
1       1    1       2     5000     1
1       2    2       1     8000     2
2       3    3       2     10000    3
3            4       1     30000    4
...
------------------------------------------
1       2      2               1          ?
  
回归——平均 \ 结合特征的相
分类——投票 / 似程度做加权
随着子表的划分，信息熵越来越小，信息量越来越大，
数据越来越有序。
     年龄1            年龄2            年龄3
    /  |   \      /    |   \       /   |    \
学历1 学历2 学历3 学历1 学历2 学历3 学历1 学历2 学历3 
11123                2...                  3...
12211                2...                  3...
11221                2...                  3...

11... 12... 13... 14... 
11... 12... 13... 14...
11... 12... 13... 14...

依次选择原始样本矩阵中的每一列，构建相应特征值相同的若干子表树，在叶级子表中所有特征值都是相同的，对于未知输出的输入，按照同样的规则，归属到某个叶级子表，将该子表中各样本的输出按照取平均(回归)或者取投票(分类)的方法，计算预测输出。

工程优化

根据事先给定的条件，提前结束子表的划分过程，借以简化决策树的结构，缩短构建和搜索的时间，在预测精度牺牲不大的前提下，提高模型性能。通常情况下，可以优先选择使信息熵减少量最大的特征作为划分子表的依据。

集合算法

所谓集合算法，亦称集合弱学习方法，其核心思想就是，通过平均或者投票，将多个不同学习方法的结论加以综合，给出一个相对可靠预测结果。所选择的弱学习方法，在算法或数据上应该具备足够分散性，以体现相对不同的倾向性，这样得出的综合结论才能够更加泛化。
基于决策树的集合算法，就是按照某种规则，构建多棵彼此不同的决策树模型，分别给出针对未知样本的预测结果，最后通过平均或投票得到相对综合的结论。
根据构建多棵决策树所依据的规则不同，基于决策树的集合算法可被分为以下几种：

- 自助聚合：从原始训练样本中，以有放回或无放回抽样的方式，随机选取部分样本，构建一棵决策树，重复以上过程，得到若干棵决策输，以此弱化某些强势样本对预测结果的影响力，提高模型精度。

import sklearn.tree as st
model = st.DecisionTreeRegressor(max_depth=4)
model.fit(train_x, train_y)

- 随机森林：如果在自助聚合的基础上，每次构建决策树时，不但随机选择样本(行)，而且其特征(列)也是随机选择的，则称为随机森林。

model = se.RandomForestRegressor(
    max_depth=10, n_estimators=1000,
    min_samples_split=2)
model.fit(train_x, train_y)

- 正向激励：首先为训练样本分配相等的权重，构建第一棵决策树，用该决策树对训练样本进行预测**，为预测错误的样本提升权重，**再次构建下一棵决策树，以此类推，得到针对每个样本拥有不同权重的多棵决策树。

model = se.AdaBoostRegressor(
    st.DecisionTreeRegressor(max_depth=4),n_estimators=400, random_state=7)
model.fit(train_x, train_y)
# 基于决策树的正向激励回归器给出的特征重要性
fi_ab = model.feature_importances_

示例代码

# -*- coding: utf-8 -*-
from __future__ import unicode_literals
#导入机器学习自带数据
import sklearn.datasets as sd
import sklearn.utils as su
import sklearn.tree as st
import sklearn.ensemble as se
import sklearn.metrics as sm
boston = sd.load_boston()
# 数据随机
x, y = su.shuffle(boston.data, boston.target,
                  random_state=7)

train_size = int(len(x) * 0.8)
train_x, test_x, train_y, test_y = x[:train_size], x[train_size:], y[:train_size], y[train_size:]
# 决策树回归器
model = st.DecisionTreeRegressor(max_depth=4)
model.fit(train_x, train_y)
pred_test_y = model.predict(test_x)
print(sm.r2_score(test_y, pred_test_y))
# 基于决策树的正向激励回归器
model = se.AdaBoostRegressor(
    st.DecisionTreeRegressor(max_depth=4),
    # n_estimators 评估器，随机种子random_state=7，为决策树分配初始权重
    n_estimators=400, random_state=7)
model.fit(train_x, train_y)
pred_test_y = model.predict(test_x)
print(sm.r2_score(test_y, pred_test_y))

特征重要性

决策树模型，在确定划分子表优先选择特征的过程中，需要根据最大熵减原则，确定划分子表的依据，因此，作为学习模型的副产品，可以得到每个特征对于输出的影响力度，即特征重要性：feature_importances_，该输出与模型算法有关。

# -*- coding: utf-8 -*-
from __future__ import unicode_literals
import numpy as np
import sklearn.datasets as sd
import sklearn.utils as su
import sklearn.tree as st
import sklearn.ensemble as se
import matplotlib.pyplot as mp
boston = sd.load_boston()
feature_names = boston.feature_names
x, y = su.shuffle(boston.data, boston.target,
                  random_state=7)
train_size = int(len(x) * 0.8)
train_x, test_x, train_y, test_y = x[:train_size], x[train_size:], y[:train_size], y[train_size:]
model = st.DecisionTreeRegressor(max_depth=4)
model.fit(train_x, train_y)
# 决策树回归器给出的特征重要性
fi_dt = model.feature_importances_
model = se.AdaBoostRegressor(
    st.DecisionTreeRegressor(max_depth=4),n_estimators=400, random_state=7)
model.fit(train_x, train_y)
# 基于决策树的正向激励回归器给出的特征重要性
fi_ab = model.feature_importances_
mp.figure('Feature Importance', facecolor='lightgray')
mp.subplot(211)
mp.title('Decision Tree', fontsize=16)
mp.ylabel('Importance', fontsize=12)
mp.tick_params(labelsize=10)
mp.grid(axis='y', linestyle=':')
sorted_indices = fi_dt.argsort()[::-1]
pos = np.arange(sorted_indices.size)
mp.bar(pos, fi_dt[sorted_indices],
       facecolor='deepskyblue', edgecolor='steelblue')
mp.xticks(pos, feature_names[sorted_indices],
          rotation=30)
mp.subplot(212)
mp.title('AdaBoost Decision Tree', fontsize=16)
mp.ylabel('Importance', fontsize=12)
mp.tick_params(labelsize=10)
mp.grid(axis='y', linestyle=':')
sorted_indices = fi_ab.argsort()[::-1]
pos = np.arange(sorted_indices.size)
mp.bar(pos, fi_ab[sorted_indices],
       facecolor='lightcoral', edgecolor='indianred')
mp.xticks(pos, feature_names[sorted_indices],
          rotation=30)
mp.tight_layout()
mp.show()

学习模型关于特征重要性的计算，除了与选择的算法有关以外，还与数据的采集粒度有关。
例子讲解了按天和按小时统计共享单车使用量，得到影响因素柱状图。

import csv
import numpy as np
import sklearn.utils as su
import sklearn.ensemble as se
import sklearn.metrics as sm
import matplotlib.pyplot as mp
with open(r'C:\Users\Cs\Desktop\机器学习\ML\data\bike_day.csv', 'r') as f:
    reader = csv.reader(f)
    x, y = [], []
    for row in reader:
        x.append(row[2:13])
        y.append(row[-1])
fn_dy = np.array(x[0])
x = np.array(x[1:], dtype=float)
y = np.array(y[1:], dtype=float)
#su.shuffle：以一致的方式随机阵列或稀疏矩阵
# 这是对集合进行随机排列的便利别名。resample(*arrays, replace=False)
# random_state 随机种子，随机种子一样保证每次输出的随机值一样，便于验证。（不设置每次随机的值不一样）
x, y = su.shuffle(x, y, random_state=7)
#0.9的训练数据
#0.1的测试数据
train_size = int(len(x) * 0.9)
train_x, test_x, train_y, test_y = x[:train_size], x[train_size:],y[:train_size], y[train_size:]
# max_depth=10：最大深度为10，决策树深度
# n_estimators=1000,最大估计量为1000，及最大使用的样本数
# min_samples_split=2 每一节点最少子节点2，小于2会影响决策树准确性，因为没得选？？
#RandomForestRegressor：随机森林回归器
model = se.RandomForestRegressor(
    max_depth=10, n_estimators=1000,
    min_samples_split=2)
model.fit(train_x, train_y)
# 基于“天”数据集的特征重要性
fi_dy = model.feature_importances_
pred_test_y = model.predict(test_x)
# 输出r2值，判断预测效果
print(sm.r2_score(test_y, pred_test_y))
with open(r'C:\Users\Cs\Desktop\机器学习\ML\data\bike_hour.csv', 'r') as f:
    reader = csv.reader(f)
    x, y = [], []
    for row in reader:
        x.append(row[2:13])
        y.append(row[-1])
fn_hr = np.array(x[0])
x = np.array(x[1:], dtype=float)
y = np.array(y[1:], dtype=float)
x, y = su.shuffle(x, y, random_state=7)
train_size = int(len(x) * 0.9)
train_x, test_x, train_y, test_y = \
    x[:train_size], x[train_size:], \
    y[:train_size], y[train_size:]
# 随机森林回归器
model = se.RandomForestRegressor(
    max_depth=10, n_estimators=1000,
    min_samples_split=2)
model.fit(train_x, train_y)
# 基于“小时”数据集的特征重要性
fi_hr = model.feature_importances_
pred_test_y = model.predict(test_x)
print(sm.r2_score(test_y, pred_test_y))


mp.figure('Bike', facecolor='lightgray')
mp.subplot(211)
mp.title('Day', fontsize=16)
mp.ylabel('Importance', fontsize=12)
mp.tick_params(labelsize=10)
mp.grid(axis='y', linestyle=':')
sorted_indices = fi_dy.argsort()[::-1]
pos = np.arange(sorted_indices.size)
mp.bar(pos, fi_dy[sorted_indices],
       facecolor='deepskyblue', edgecolor='steelblue')
mp.xticks(pos, fn_dy[sorted_indices],
          rotation=30)
mp.subplot(212)
mp.title('Hour', fontsize=16)
mp.ylabel('Importance', fontsize=12)
mp.tick_params(labelsize=10)
mp.grid(axis='y', linestyle=':')
sorted_indices = fi_hr.argsort()[::-1]
pos = np.arange(sorted_indices.size)
mp.bar(pos, fi_hr[sorted_indices],
       facecolor='lightcoral', edgecolor='indianred')
mp.xticks(pos, fn_hr[sorted_indices],
          rotation=30)
mp.tight_layout()
mp.show()