Boosting是一种集成学习算法,其核心思想是通过组合多个弱学习器来构建一个强学习器。本文包含 AdaBoost、Gradient Boosting Decision Tree(GBDT)、XGBoost、LightGBM 四种机器学习算法,主要给出示例代码以及模型参数,旨在快速的上手相应算法。
目录
1、GradientBoostingClassifier(分类)
2、GradientBoostingRegressor(回归)
一、 AdaBoost
AdaBoost(Adaptive Boosting,自适应增强)算法是一种集成学习技术,通过结合多个弱分类器来构建一个强分类器。
AdaBoost 的核心思想是通过修正数据的权重来训练一系列的弱学习器,由这些弱学习器的预测结果进行加权求和, 得到我们最终的预测结果。
1、AdaBoostClassifier(分类)
官方链接:AdaBoostClassifier — scikit-learn 1.5.1 documentation
class sklearn.ensemble.AdaBoostClassifier
示例代码:
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
# 加载数据集
iris = load_iris()
X, y = iris.data, iris.target
# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# 初始化AdaBoost分类器,使用决策树作为基分类器
clf = AdaBoostClassifier(estimator=DecisionTreeClassifier(max_depth=1), n_estimators=50,
learning_rate=1.0,algorithm='SAMME',random_state=66)
# 训练模型
clf.fit(X_train, y_train)
# 进行预测
y_pred = clf.predict(X_test)
# 打印准确率
print(f"Test Accuracy: {clf.score(X_test, y_test)}\n")
# 打印分类报告
print("classification report:")
print(classification_report(y_test, y_pred))
output:
2、AdaBoostRegressor(回归)
官方链接:AdaBoostRegressor — scikit-learn 1.5.1 documentation
class sklearn.ensemble.AdaBoostRegressor
示例代码:
import numpy as np
from sklearn.ensemble import AdaBoostRegressor
from sklearn.tree import DecisionTreeRegressor
# 准备数据
rng = np.random.RandomState(66)
num = 100
X = np.linspace(0, 6, num)[:, np.newaxis]
y = np.sin(X).ravel() + np.sin(6 * X).ravel() + rng.normal(0, 0.1, X.shape[0])
# 初始化模型
regr = AdaBoostRegressor(
DecisionTreeRegressor(max_depth=4), n_estimators=300, random_state=rng
)
# 训练模型
regr.fit(X, y)
# 进行预测
y_pred = regr.predict(X)
# 打印准确度
print(f"Accuracy: {regr.score(X, y):.4f}")
# output:
# Accuracy: 0.9723
可视化:
import matplotlib.pyplot as plt
import seaborn as sns
colors = sns.color_palette("colorblind")
plt.figure()
plt.scatter(X, y, color=colors[0], label="samples")
plt.plot(X, y, color=colors[1], label="Regressor", linewidth=3)
plt.xlabel("data")
plt.ylabel("target")
plt.title("AdaBoostRegressor")
plt.legend()
plt.show()
二、GBDT
GBDT(Gradient Boosting Decision Tree,梯度提升决策树)是一种集成学习算法,它通过构造一组弱的学习树,逐步添加决策树来最小化一个可导的损失函数,每棵树都在前一棵树的残差上进行训练,并把多颗决策树的结果累加起来作为最终的预测输出,它既能用于分类问题也可以用于回归问题。
1、GradientBoostingClassifier(分类)
官方链接:GradientBoostingClassifier — scikit-learn 1.5.1 documentation
class sklearn.ensemble.GradientBoostingClassifier
示例代码:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
# 加载数据集
iris = load_iris()
X, y = iris.data, iris.target
# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# 模型参数
params = {
"n_estimators": 100, # 弱学习器的数量
"max_depth": 3, # 树的最大深度
"learning_rate": 0.1, # 学习率
"random_state": 66 # 随机状态,确保结果可复现
}
# 初始化 GradientBoostingClassifier
gb_clf = GradientBoostingClassifier(**params)
# 训练模型
gb_clf.fit(X_train, y_train)
# 进行预测
y_pred = gb_clf.predict(X_test)
# 评估模型的准确度
# 打印分类报告
print("classification report:")
print(classification_report(y_test, y_pred))
# 打印特征重要性
feature_importances = gb_clf.feature_importances_
print("Feature importances:")
for feature, importance in zip(iris.feature_names, feature_importances):
print(f"\t{feature}: {importance:.4f}")
output:
2、GradientBoostingRegressor(回归)
官方链接:GradientBoostingRegressor — scikit-learn 1.5.1 documentation
class sklearn.ensemble.GradientBoostingRegressor
示例代码:
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
# 生成合成回归数据
X, y = make_regression(n_samples=1000, n_features=10, noise=0.1, random_state=42)
# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# 模型参数
params = {
'n_estimators': 100, # 弱学习器的数量
'learning_rate': 0.1, # 学习率
'max_depth': 3, # 树的最大深度
'random_state': 42, # 随机状态,确保结果可复现
'loss': 'squared_error' # 使用均方误差作为损失函数
}
# 初始化 GradientBoostingRegressor
gbr = GradientBoostingRegressor(**params)
# 训练模型
gbr.fit(X_train, y_train)
# 对测试集进行预测
y_pred = gbr.predict(X_test)
# 计算测试集的均方误差(MSE)
mse = mean_squared_error(y_test, y_pred)
print(f"Test Mean Squared Error: {mse:.2f}")
# 计算测试集的确定系数(R^2)
r2_score = gbr.score(X_test, y_test)
print(f"Test R^2 Score: {r2_score:.2f}")
# output
# Test Mean Squared Error: 1237.63
# Test R^2 Score: 0.93
可视化:
import matplotlib.pyplot as plt
import seaborn as sns
colors = sns.color_palette("colorblind")
check_len = len(y_test)
data = [x+(1000-check_len) for x in range(check_len)]
plt.figure()
plt.plot(data, y_test[-check_len:], color=colors[1], label="Actual", linewidth=2)
plt.plot(data, y_pred[-check_len:], color=colors[2], label="Predictions", linewidth=2)
plt.xlabel("data")
plt.ylabel("target")
plt.title("GradientBoostingRegressor")
plt.legend()
plt.show()
三、XGBoost
XGBoost(eXtreme Gradient Boosting)是一种高效的机器学习算法,属于梯度提升(Gradient Boosting)算法的扩展。它由陈天奇等人在2016年开发,旨在提供一种快速、灵活且可扩展的树提升框架。
官方链接:Python Package Introduction — xgboost 2.1.1 documentation
说明:本文使用标准版(CPU版本)XGBoost,关于GPU加速版内容请参考官方文档。
1、XGBClassifier(分类)
class xgboost.XGBClassifier(*, objective='binary:logistic', **kwargs)
示例代码:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
import xgboost as xgb
# 加载数据集
iris = load_iris()
X, y = iris.data, iris.target
# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# 设定参数
param = {
"tree_method": 'hist', # 树方法
"objective": 'multi:softmax', # 目标函数
"eval_metric": 'merror', # 评估指标
"num_class": 3, # 类别数
"n_estimators": 10, # 弱学习器的数量
"max_depth": 3, # 树的最大深度
"learning_rate": 0.1, # 学习率
"subsample": 0.8, # 训练实例的子样本率
"random_state": 42 # 随机状态,确保结果可复现
}
# 早停回调函数
early_stop = xgb.callback.EarlyStopping(
rounds=2, # 早停回合数
metric_name='merror', # 用于早停的指标名
save_best=True # 返回最佳模型
)
# 创建XGBClassifier实例
xgb_clf = xgb.XGBClassifier(**param, callbacks=[early_stop])
# 训练模型
xgb_clf.fit(X_train, y_train, eval_set=[(X_train, y_train), (X_test, y_test)])
# 预测测试集
y_pred = xgb_clf.predict(X_test)
output:(每轮早停的验证信息)
评估性能指标--示例代码:
# 打印分类报告
print("classification report:")
print(classification_report(y_test, y_pred))
# 打印特征重要性
feature_importances = xgb_clf.feature_importances_
print("Feature importances:")
for feature, importance in zip(iris.feature_names, feature_importances):
print(f"\t{feature}: {importance:.4f}")
output:
2、XGBRegressor(回归)
class xgboost.XGBRegressor(*, objective='reg:squarederror', **kwargs)
示例代码:
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import xgboost as xgb
# 生成合成回归数据
X, y = make_regression(n_samples=1000, n_features=10, noise=0.1, random_state=42)
# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# 设定参数
param = {
"tree_method": 'hist', # 树方法
"objective": 'reg:squarederror', # 目标函数
"eval_metric": 'rmse', # 评估指标
"n_estimators": 100, # 弱学习器的数量
"max_depth": 3, # 树的最大深度
"learning_rate": 0.1, # 学习率
"base_score": 0.5, # 全局偏差
"random_state": 42 # 随机状态,确保结果可复现
}
# 创建XGBRegressor实例
xgb_reg = xgb.XGBRegressor(**param)
# 训练模型
xgb_reg.fit(X_train, y_train)
# 预测测试集
y_pred = xgb_reg.predict(X_test)
# 计算测试集的均方误差(MSE)
mse = mean_squared_error(y_test, y_pred)
print(f"Test Mean Squared Error: {mse:.2f}")
# 计算测试集的确定系数(R^2)
r2_score = xgb_reg.score(X_test, y_test)
print(f"Test R^2 Score: {r2_score:.2f}")
# output:
# Test Mean Squared Error: 1357.97
# Test R^2 Score: 0.93
可视化:
import matplotlib.pyplot as plt
import seaborn as sns
colors = sns.color_palette("colorblind")
check_len = len(y_test)
data = [x+(1000-check_len) for x in range(check_len)]
plt.figure()
plt.plot(data, y_test[-check_len:], color=colors[1], label="Actual", linewidth=2)
plt.plot(data, y_pred[-check_len:], color=colors[2], label="Predictions", linewidth=2)
plt.xlabel("data")
plt.ylabel("target")
plt.title("XGBRegressor")
plt.legend()
plt.show()
四、LightGBM
LightGBM 是一个开源的梯度提升框架,是由微软开发的boosting集成模型,和XGBoost一样是对GBDT的优化和高效实现,使用基于树的学习算法。它设计用于处理大规模数据,具有高性能、低内存消耗和高准确率的特点。
官方链接:Python API — LightGBM 4.5.0.99 documentation
1、LGBMClassifier(分类)
class lightgbm.LGBMClassifier
示例代码:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
import lightgbm as lgb
# 加载数据集
iris = load_iris()
X, y = iris.data, iris.target
# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# 设置参数
param = {
"boosting_type": 'gbdt', # boosting 类型
"num_leaves": 31, # 最大树叶数
"max_depth": -1, # -1即不限制树深度
"learning_rate": 0.1,
"n_estimators": 10,
"objective": 'multiclass', # 多分类
"random_state": 42,
}
# 创建LGBMClassifier实例
lgbm_clf = lgb.LGBMClassifier(**param)
# 训练模型
lgbm_clf.fit(X_train, y_train)
# 预测测试集
y_pred = lgbm_clf.predict(X_test)
训练模型时,输出 :
注:"[LightGBM] [Warning] No further splits with positive gain, best gain: -inf" 警告出现可能是模型已经训练足够,或者所有可能的分裂均已测试,不一定是模型性能不佳。
评估性能指标--示例代码:
# 打印分类报告
print("classification report:")
print(classification_report(y_test, y_pred))
# 打印特征重要性
feature_importances = lgbm_clf.feature_importances_
print("Feature importances:")
for feature, importance in zip(iris.feature_names, feature_importances):
print(f"\t{feature}: {importance:.4f}")
output:
2、LGBMRegressor (回归)
class lightgbm.LGBMRegressor
示例代码:
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import lightgbm as lgb
# 生成合成回归数据
X, y = make_regression(n_samples=1000, n_features=10, noise=0.1, random_state=42)
# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# 设置参数
param = {
"boosting_type": 'gbdt', # boosting 类型
"num_leaves": 31,
"max_depth": -1,
"learning_rate": 0.05,
"n_estimators": 100,
"objective": 'regression', # 回归
"random_state": 42,
}
# 创建LGBMRegressor实例
lgb_reg = lgb.LGBMRegressor(**param)
# 训练模型
lgb_reg.fit(X_train, y_train)
# 预测测试集
y_pred = lgb_reg.predict(X_test)
# 计算测试集的均方误差(MSE)
mse = mean_squared_error(y_test, y_pred)
print(f"Test Mean Squared Error: {mse:.2f}")
# 计算测试集的确定系数(R^2)
r2_score = lgb_reg.score(X_test, y_test)
print(f"Test R^2 Score: {r2_score:.2f}")
# output:
# Test Mean Squared Error: 1219.03
# Test R^2 Score: 0.93
可视化:
import matplotlib.pyplot as plt
import seaborn as sns
colors = sns.color_palette("colorblind")
check_len = len(y_test)
data = [x+(1000-check_len) for x in range(check_len)]
plt.figure()
plt.plot(data, y_test[-check_len:], color=colors[1], label="Actual", linewidth=2)
plt.plot(data, y_pred[-check_len:], color=colors[2], label="Predictions", linewidth=2)
plt.xlabel("data")
plt.ylabel("target")
plt.title("LGBMRegressor")
plt.legend()
plt.show()
总结
本文列举了AdaBoost、GBDT、XGBoost、LightGBM四种机器学习算法,通过scikit-learn API 描述内容进行简单实操。实际上,本文只是使用简单的数据进行验证,并不能体现算法的实际效率,比如XGBoost、LightGBM两种在处理超大数据集上性能。与此同时,对于上述算法的使用仅仅是停留在最基础层次,还有许多可以优化的地方,比如设置早停参数、超参数调优、交叉验证 等方法。