【学习笔记】陈强-机器学习-Python-Ch13 提升法（1）

赛博机器喵

于 2024-08-28 20:38:26 发布

阅读量504

点赞数 12

文章标签：学习笔记机器学习 python

本文链接：https://blog.csdn.net/2201_76026029/article/details/141559055

版权

系列文章目录

监督学习：参数方法

前言

本学习笔记仅为以防自己忘记了，顺便分享给一起学习的网友们参考。如有不同意见/建议，可以友好讨论。

本学习笔记所有的代码和数据都可以从陈强老师的个人主页上下载

参考书目：陈强.机器学习及Python应用. 北京：高等教育出版社, 2021.

数学原理等详见陈强老师的 PPT

参考了：网友阡之尘埃的Python机器学习10——梯度提升

一、集成学习：提升法（Boosting）

集成学习（Ensemble Learning）方法通过结合多个模型的预测结果来减少单个模型可能出现的过拟合和偏差，提高预测性能和稳定性。
主要的集成学习方法包括：Bagging（Bootstrap Aggregating），Boosting，Stacking（Stacked Generalization），Voting。

集成学习	随机森林	提升法（Boosting）
组合策略	尽量使决策树之间不相关	考虑基学习器序列，使得每个基学习器正好弥补之前所有基学习器的缺点
位置	每棵决策树的作用完全对称，可随便更换决策树的位置	每棵决策树的作用并不相同，这些依次而种的决策树之间的相对位置不能随意变动。
袋外误差	使用自助样本，故可计算袋外误差	基于原始样本，故一般无法计算袋外误差，，但可使用不同的观测值权重

Boosting是一种 “序贯集成法”（sequential ensemble approach）：通过顺序训练多个基学习器，每个新学习器都关注之前学习器未能正确分类的数据。最终模型是所有学习器预测的加权和。
Boosting 方法的核心思想是通过逐步改进模型来减少预测错误。

常见的Boosting方法：

AdaBoost (Adaptive Boosting，自适应提升)：仅适用于分类问题。通过加权样本来调整每个基学习器的关注点。
1）初始化时为每个样本分配相同的权重。
2）训练第一个基学习器，并计算其在训练数据上的错误率。
3）基于错误率调整每个样本的权重，使得之前被错分类的样本权重增加。
4）训练下一个基学习器，重点关注调整后的样本权重。
5）最终模型是所有基学习器的加权和，其中每个基学习器的权重与其准确率有关。
Gradient Boosting（简记GBM，梯度提升）：通过最小化损失函数的梯度来训练新的基学习器。
1）初始化模型为一个简单的模型（例如均值预测）
2）计算当前模型对训练数据的残差（真实值 - 预测值）。
3）训练新的基学习器来预测这些残差。
4）更新模型，将新学习器的预测结果加到当前模型中。
5）重复上述步骤，直到达到指定的基学习器数量或误差收敛。
XGBoost (Extreme Gradient Boosting，极端坡度提升)：一种高效的 Gradient Boosting 实现，支持各种优化和正则化技术。
1）XGBoost 是对 Gradient Boosting 的一种优化实现。它通过引入正则化、剪枝和并行计算来提高模型的效率和准确性。
2）采用了二阶梯度信息进行优化，使得训练过程更快、更准确。

二、回归提升树

使用波士顿房价数据boston （参考【学习笔记】陈强-机器学习-Python-Ch4 线性回归 &【学习笔记】陈强-机器学习-Python-Ch11 决策树（Decision Tree） & 【学习笔记及课后题练习】陈强-机器学习-Python-Ch12 随机森林（Random Forest））
我不知道该怎么跟你们解释，我的结果很多都和陈强老师书上的不一样……麻了……

1. 载入数据+数据处理

import numpy as np
import pandas as pd

# 从原始来源加载数据
data_url = "http://lib.stat.cmu.edu/datasets/boston"
raw_df = pd.read_csv(data_url, sep=r"\s+", skiprows=22, header=None)

# 处理数据
data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
target = raw_df.values[1::2, 2]

# 创建DataFrame
columns = [
    "CRIM", "ZN", "INDUS", "CHAS", "NOX", "RM", "AGE", "DIS", "RAD", "TAX", 
    "PTRATIO", "B", "LSTAT"
]
df = pd.DataFrame(data, columns=columns)
df['MEDV'] = target

# 确定特征
X = df.drop(columns=['MEDV'])
y = df['MEDV']


# 将数据分割为训练集（70%）和测试集（30%）
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=1)

2. 回归提升树的估计

from sklearn.ensemble import GradientBoostingRegressor
model = GradientBoostingRegressor(random_state=123)
model.fit(X_train, y_train)
model.score(X_test, y_test)

结果输出： 0.9214598819200858

`笔记：GradientBoostingRegressor()`

GradientBoostingRegressor 是 scikit-learn 库中的一个类，用于回归任务。它实现了梯度提升回归（Gradient Boosting Regression），用于通过组合多个弱学习器（通常是决策树）的预测来提高回归模型的性能。

#基本语法和参数
from sklearn.ensemble import GradientBoostingRegressor

# 初始化模型
model = GradientBoostingRegressor(
    n_estimators=100, #训练的提升阶段的数量（即树的数量）默认值: 100
    learning_rate=0.1, #学习率，用于缩小每棵树的贡献。默认值: 0.1
    max_depth=3, #每棵树的最大深度。默认值: 3
    min_samples_split=2, #内部节点再分裂所需的最小样本数。默认值: 2
    min_samples_leaf=1, #叶节点上所需的最小样本数。默认值: 1
    subsample=1.0, #用于拟合每棵基学习器的样本比例。默认值: 1.0
    criterion='squared_error', #衡量节点分裂质量的标准。默认值: 'squared_error'。可选：'friedman_mse'
    loss='squared_error') #损失函数，用于优化目标。默认值: 'squared_error'（适用于普通的回归任务）。可选：'absolute_error'，'huber'（对异常值有一定的鲁棒性）, 'quantile'（适用于分位回归）

基本属性

train_score_ 训练集上的得分，通常是对数损失的序列。用于诊断模型在训练过程中的表现。
estimators_所有基学习器（决策树）的列表。包含训练过程中的每棵树。类型：list
n_estimators_ 提升阶段的数量，即训练的树的数量。类型：int
feature_importances_为特征的重要性
learning_rate学习率，控制每棵树对最终预测的影响。
max_depth每棵决策树的最大深度。
min_samples_split 内部节点再分裂所需的最小样本数。
min_samples_leaf 叶节点上所需的最小样本数。
subsample 每棵基学习器的样本比例。类型: float
loss 损失函数的名称。类型: str

对比：线性回归

from sklearn.linear_model import LinearRegression
LinearRegression().fit(X_train, y_train).score(X_test, y_test)

结果输出： 0.7836295385076302 远低于回归提升树

3. 选出最优超参数组合

回归提升树 (boosted regression tree）的超参数较多：

决策树的数量 $M$ ：一般可使用交叉验证选择
决策树的分裂次数 $d$ （交互深度 $in t er a c t i o n d e pt h$ ）:一般建议选择 $d = 48$
学习率（learning rate，也称为收缩参数shrinkage parameter） $0 < η < 1$ ：通常选择 $η = 0.1 或 0.01$

from sklearn.model_selection import KFold, StratifiedKFold
from sklearn.model_selection import RandomizedSearchCV #随机搜索超参数空间

param_distributions = { #param_distributions = {...}:定义一个字典，指定要搜索的超参数范围
    'n_estimators': range(1, 300), # n_estimators(树的数量)的取值范围是从 1 到 299。
    'max_depth': range(1, 10), # max_depth(树的最大深度)的取值范围是从 1 到 9。
    'subsample': np.linspace(0.1,1,10), # subsample(样本的子采样比例)的取值范围是从0.1到1，共10个均匀间隔的值。
    'learning_rate': np.linspace(0.1, 1, 10)} #learning_rate(学习率)的取值范围是从0.1到1，共10个均匀间隔的值。
kfold = KFold(n_splits=10, shuffle=True, random_state=1)
model = RandomizedSearchCV( #创建一个RandomizedSearchCV对象，用于在指定的超参数分布上进行随机搜索
    estimator=GradientBoostingRegressor(random_state=123), #使用GradientBoostingRegressor作为基模型
    param_distributions=param_distributions, #传入之前定义的超参数分布字典
    cv=kfold, 
    n_iter=100, #在超参数空间随机搜索100次
    random_state=0)
model.fit(X_train, y_train)
#最优超参数s
model.best_params_

结果输出：
{‘subsample’: 0.30000000000000004, ‘n_estimators’: 83, ‘max_depth’: 8, ‘learning_rate’: 0.1}
和陈强老师书上的不一样……

最优模型（最优超参数组合定义的）

model = model.best_estimator_ #.best_estimator_属性：使用最优超参数定义最优模型
model.score(X_test, y_test)

结果输出： 0.8520437464755751
和陈强老师书上的不一样……但陈强老师书上的R2也小于默认值的拟合优度

4. 变量重要性

import matplotlib.pyplot as plt

sorted_index = model.feature_importances_.argsort()
plt.barh(range(X.shape[1]), model.feature_importances_[sorted_index])
plt.yticks(np.arange(X.shape[1]), X.columns[sorted_index])
plt.xlabel('Feature Importance')
plt.ylabel('Feature')
plt.title('Gradient Boosting')
plt.tight_layout()

在这里插入图片描述

5. 偏依赖图

from sklearn.inspection import PartialDependenceDisplay
PartialDependenceDisplay.from_estimator(model, X, ['LSTAT', 'RM'])

在这里插入图片描述

6. 考察 M 对MSE的影响

1）超参数组合（我的）

# for循环：找出最优M
from sklearn.metrics import mean_squared_error

scores = []
for n_estimators in range(1, 301):
    model_my = GradientBoostingRegressor(
        n_estimators=n_estimators, 
        subsample=0.3, 
        max_depth=8, 
        learning_rate=0.1, 
        random_state=123)
    model_my.fit(X_train, y_train)
    pred = model_my.predict(X_test)
    mse = mean_squared_error(y_test, pred)
    scores.append(mse)

index = np.argmin(scores)
range(1, 301)[index]

结果输出： 27

plt.plot(range(1, 301), scores)
plt.axvline(range(1, 301)[index], linestyle='--', color='k', linewidth=1)
plt.axhline(min(scores), linewidth=1, linestyle='--', color='r')

# 标记交点
best_m = range(1, 301)[index]
min_mse = min(scores)

plt.scatter(best_m, min_mse, color='b', zorder=5)
plt.text(best_m, min_mse, 
         f'({best_m},  {min_mse:.6f})', 
         horizontalalignment='right', 
         verticalalignment='bottom', fontsize=11, color='b')

plt.xlabel('Number of Trees')
plt.ylabel('MSE')
plt.title('MSE on Test Set')

在这里插入图片描述

2）超参数组合（书上的）

# for循环：找出最优M
from sklearn.metrics import mean_squared_error
#
scores = []
for n_estimators in range(1, 301):
    model = GradientBoostingRegressor(
        n_estimators=n_estimators, 
        subsample=0.5, 
        max_depth=5, 
        learning_rate=0.1, 
        random_state=123)
    model.fit(X_train, y_train)
    pred = model.predict(X_test)
    mse = mean_squared_error(y_test, pred)
    scores.append(mse)

index = np.argmin(scores)
range(1, 301)[index]

结果输出： 137 还是和陈强老师的结果不一样……笑哭

plt.plot(range(1, 301), scores)
plt.axvline(range(1, 301)[index], linestyle='--', color='k', linewidth=1)
plt.axhline(min(scores), linewidth=1, linestyle='--', color='r')

# 标记交点
best_m = range(1, 301)[index]
min_mse = min(scores)

plt.scatter(best_m, min_mse, color='b', zorder=5)
plt.text(best_m, min_mse, 
         f'({best_m},  {min_mse:.6f})', 
         horizontalalignment='right', 
         verticalalignment='bottom', fontsize=11, color='b')

plt.xlabel('Number of Trees')
plt.ylabel('MSE')
plt.title('MSE on Test Set')

在这里插入图片描述
虽然我的图看起来差距有点大，但是这个主要是想说明：增加提升树的决策树数目，并不容易导致过拟合

另，我怀疑我的数据没对，但是我没有证据。我交叉验证出来的subsample是0.3，但是subsample 的值通常在 0.5 到 1.0 之间（直接从 0.1 开始可能不常见，因为较低的值（如 0.1 或 0.2）可能会导致训练集样本太少，影响模型的学习能力）。

三、二分类提升树

使用Hastie et al. (2009)的垃圾邮件数据spam.csv （参考【学习笔记】陈强-机器学习-Python-Ch8 朴素贝叶斯 )

1. 载入数据+数据处理

import pandas as pd
import numpy as np

#读取CSV文件的路径
csv_path = r'D:\桌面文件\Python\【陈强-机器学习】MLPython-PPT-PDF\MLPython_Data\spam.csv'
spam = pd.read_csv(csv_path)

#定义X与y
X = spam.iloc[:, :-1]
y = spam.iloc[:, -1]

#全样本随机分20%测试集和80%训练集
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=0)

2. 自适应提设法AdaBoost

from sklearn.ensemble import AdaBoostClassifier
model = AdaBoostClassifier(random_state=123)
model.fit(X_train, y_train)
model.score(X_test, y_test)

结果输出： 0.9489685124864278

`笔记：AdaBoostClassifier ()`

AdaBoostClassifier 是 scikit-learn 库中实现的 AdaBoost（Adaptive Boosting）算法的一个类。AdaBoost 是一种集成学习方法，通过加权组合多个弱分类器来提高分类性能。
它从训练集开始，首先训练一个弱分类器，然后根据分类错误的样本重新加权，接着训练下一个弱分类器，并重复这一过程。最终的分类器是所有弱分类器的加权组合。

#基本语法和参数
from sklearn.ensemble import AdaBoostClassifier

# 初始化 AdaBoostClassifier
model = AdaBoostClassifier(
    base_estimator=None, #基础分类器，即弱学习器。默认值: None(即 DecisionTreeClassifier(max_depth=1))
    n_estimators=50, #要训练的弱分类器的数量。默认值: 50
    learning_rate=1.0, #学习率，用于缩小每个弱分类器的贡献。默认值: 1.0
    algorithm='SAMME.R', #训练算法的类型。默认值: 'SAMME.R'(基于概率估计)
    random_state=None)

基本属性

classes_ 模型在训练期间观察到的类别标签。类型: numpy.ndarray
estimators_训练后的基础分类器的列表。类型：list
n_estimators_训练过程中使用的弱分类器的数量。类型：int
feature_importances_为特征的重要性
coef_在训练过程中计算的分类器的系数。仅在基础分类器支持系数（例如线性模型）时可用。类型: numpy.ndarray
decision_function决策函数的值，即每个样本的预测分数。
predict_proba每个样本的类别概率。对于多类别任务，这个方法返回每个类别的概率。
predict_log_proba每个样本类别的对数概率。对数概率是类别概率的对数值。

3. 使用逻辑损失函数进行二分类问题的梯度提升法

from sklearn.ensemble import GradientBoostingClassifier
GBC=GradientBoostingClassifier(random_state=123)
GBC.fit(X_train, y_train)
GBC.score(X_test, y_test)

结果输出： 0.9543973941368078 高于AdaBoost
这之后的输出数据，就没有和陈强老师一样的了

4. 最优超参数组合

from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import KFold, StratifiedKFold

param_distributions = {'n_estimators': range(1, 300), 
                       'max_depth': range(1, 10),
                       'subsample': np.linspace(0.1, 1, 10), 
                       'learning_rate': np.linspace(0.1, 1, 10)}
kfold = KFold(n_splits=10, shuffle=True, random_state=1)
model = RandomizedSearchCV(
    estimator=GradientBoostingClassifier(random_state=123),
    param_distributions=param_distributions, 
    cv=kfold, n_iter=10, random_state=66)
model.fit(X_train, y_train)
#最优超参数组合
model.best_params_

结果输出： {‘subsample’: 0.7000000000000001, ‘n_estimators’: 279, ‘max_depth’: 3, ‘learning_rate’: 0.30000000000000004}
和陈强老师书上的又不一样……

#最优模型
model = model.best_estimator_
model.score(X_test, y_test)

结果输出： 0.9359391965255157
……比默认参数还低……

5.变量重要性

import matplotlib.pyplot as plt

sorted_index = model.feature_importances_.argsort()
plt.barh(range(X.shape[1]), model.feature_importances_[sorted_index])
plt.yticks(np.arange(X.shape[1]), X.columns[sorted_index])
plt.xlabel('Feature Importance')
plt.ylabel('Feature')
plt.title('Gradient Boosting')
plt.tight_layout()

在这里插入图片描述
和书上比，虽然变量重要性排序没有吧，但是变量重要性的数值有变化

6. 偏依赖图

from sklearn.inspection import PartialDependenceDisplay
PartialDependenceDisplay.from_estimator(model, X, ['A.52', 'A.53'])

在这里插入图片描述

7. 预测

pred = model.predict(X_test)
table = pd.crosstab(
    y_test, 
    pred, 
    rownames=['Actual'], 
    colnames=['Predicted'])
table

在这里插入图片描述

table = np.array(table)
Accuracy = (table[0, 0] + table[1, 1]) / np.sum(table)
Sensitivity  = table[1 , 1] / (table[1, 0] + table[1, 1])
Specificity = table[0, 0] / (table[0, 0] + table[0, 1])
Recall = table[1, 1] / (table[0, 1] + table[1, 1])

print(f'准确率: {Accuracy}')
print(f'敏感度: {Sensitivity}')
print(f'特异度: {Specificity}')
print(f'召回率: {Recall}')

from sklearn.metrics import cohen_kappa_score
cohen_kappa_score(y_test, pred)

结果输出： 准确率: 0.9359391965255157
敏感度: 0.9201101928374655
特异度: 0.946236559139785
召回率: 0.9175824175824175
0.8659299339013035

#画ROC曲线
from sklearn.metrics import RocCurveDisplay

RocCurveDisplay.from_estimator(
    model, 
    X_test,
    y_test, 
    plot_chance_level=True)

plt.title('ROC Curve for Random Forest')

在这里插入图片描述

四、多分类提升树

使用玻璃种类数据Glass.csv （参考【学习笔记】陈强-机器学习-Python-Ch6 多项逻辑回归 )

1. 载入数据+数据处理

import pandas as pd
import numpy as np

#读取CSV文件的路径
csv_path = r'D:\桌面文件\Python\【陈强-机器学习】MLPython-PPT-PDF\MLPython_Data\Glass.csv'
Glass = pd.read_csv(csv_path)

#定义X与y
X = Glass.iloc[:, :-1]
y = Glass.iloc[:, -1]

#全样本随机分20%测试集和80%训练集
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, stratify=y, random_state=0)

2. 多分类问题的梯度提升法

from sklearn.ensemble import GradientBoostingClassifier
GBC=GradientBoostingClassifier(random_state=123)
GBC.fit(X_train, y_train)
GBC.score(X_test, y_test)

结果输出： 0.7846153846153846

`笔记：GradientBoostingClassifier()`

GradientBoostingClassifier 是 scikit-learn 库中实现的梯度提升分类器。它是基于梯度提升（Gradient Boosting）算法的集成学习模型，通常用于分类任务。GradientBoostingClassifier 构建了一个集成的模型，通过迭代地训练弱分类器（通常是决策树），并在每一步上最小化损失函数来提高分类性能。

#基本语法和参数
from sklearn.ensemble import GradientBoostingClassifier

# 初始化 GradientBoostingClassifier
model = GradientBoostingClassifier(
    n_estimators=100, #要训练的提升树的数量。默认值: 100
    learning_rate=0.1, #学习率（也称为步长），用于缩小每个树的贡献。默认值: 0.1
    max_depth=3, #每棵树的最大深度。默认值: 3
    min_samples_split=2, #内部节点再划分所需的最小样本数。默认值: 2
    min_samples_leaf=1, #每棵树叶子节点的最小样本数。默认值: 1
    max_features=None, #每棵树在寻找最佳分裂时考虑的最大特征数。默认值: None
    subsample=1.0, #每棵树训练时使用的样本比例。默认值: 1.0
    random_state=None)

基本属性

oob_score_ 仅当 subsample < 1.0 时存在，返回基于袋外样本的模型得分。
estimators_训练过程中生成的所有提升树的列表。类型：list
n_estimators_实际使用的基学习器数量。类型：int
feature_importances_特征的重要性
classes_训练数据中出现的类别标签。类型: numpy.ndarray

3. 最优超参数组合

from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import KFold, StratifiedKFold

param_distributions = {'n_estimators': range(1, 300), 
                       'max_depth': range(1, 10),
                       'subsample': np.linspace(0.1, 1, 10), 
                       'learning_rate': np.linspace(0.1, 1, 10)}
kfold = KFold(n_splits=5, shuffle=True, random_state=1)
model = RandomizedSearchCV(
    estimator=GradientBoostingClassifier(random_state=123),
    param_distributions=param_distributions, 
    cv=kfold, random_state=66)
model.fit(X_train, y_train)
#最优超参数组合
model.best_params_

结果输出： {‘subsample’: 0.7000000000000001, ‘n_estimators’: 279, ‘max_depth’: 3, ‘learning_rate’: 0.30000000000000004}

#最优模型
model = model.best_estimator_
model.score(X_test, y_test)

结果输出： 0.7384615384615385 竟然和书上的最优模型一样！！！明明最优参数都不一样！！！

5.变量重要性

import matplotlib.pyplot as plt

sorted_index = model.feature_importances_.argsort()
plt.barh(range(X.shape[1]), model.feature_importances_[sorted_index])
plt.yticks(np.arange(X.shape[1]), X.columns[sorted_index])
plt.xlabel('Feature Importance')
plt.ylabel('Feature')
plt.title('Gradient Boosting')
plt.tight_layout()

在这里插入图片描述

6. 偏依赖图

from sklearn.inspection import PartialDependenceDisplay
PartialDependenceDisplay.from_estimator(model, X, ['Mg'], target=1) #第一类玻璃对于Mg的依赖

在这里插入图片描述

7. 预测

pred = model.predict(X_test)
table = pd.crosstab(
    y_test, 
    pred, 
    rownames=['Actual'], 
    colnames=['Predicted'])
table

在这里插入图片描述

赛博机器喵

关注

12
点赞
踩
11

收藏

觉得还不错? 一键收藏
0
评论
【学习笔记】陈强-机器学习-Python-Ch13 提升法（1）

本学习笔记仅为以防自己忘记了，顺便分享给一起学习的网友们参考。如有不同意见/建议，可以友好讨论。本学习笔记所有的代码和数据都可以从陈强老师的个人主页上下载陈强.机器学习及Python应用. 北京：高等教育出版社, 2021.数学原理等详见陈强老师的PPT参考了：网友阡之尘埃的Python机器学习10——梯度提升model = model.best_estimator_ #.best_estimator_属性：使用最优超参数定义最优模型。
复制链接

扫一扫