集成学习 python-CSDN博客

本文链接：https://blog.csdn.net/yu_moyu/article/details/132566354

在这里插入图片描述

文章目录

1、决策树相关知识
2、Bagging
- 随机森林
3、Boosting

1、决策树相关知识

1.1 简介

树
决策树
熵
信息增益
一些定义

1.2 ID3、C4.5算法

ID3：信息增益作为评价指标
决策树算法–ID3算法 https://zhuanlan.zhihu.com/p/133846252
C4.5：信息增益比率作为评价指标

1.3 CART算法

评价指标：基尼系数

1.4 树剪枝

防止过拟合
如何做？加一个罚项

2、Bagging

并行方法，它可以 (1) 使用相同的算法在不同的训练集上面训练多个基学习器，也可以(2) 使用不同的训练算法训练得到多个基学习器。
当所有的分类器被训练后，集成可以通过对所有分类器结果的简单聚合来对新的实例进行预测。聚合函数通常对分类是统计模式（例如硬投票分类器）或者对回归取平均。
随机森林是Bagging的典型应用。

随机森林

随机森林分类 python调包

# 导入必要的库和模块：
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report

# 创建合成的分类数据集：
# X,y = make_classification(n_samples=100, n_features=20, n_classes=4, random_state=42)
X, y = make_classification(n_samples=100, n_features=20, n_classes=4, n_clusters_per_class=1, random_state=42)
# 查看前5行数据
print("前5行特征矩阵 (X):")
print(X[:5])
print("前5行目标标签 (y):")
print(y[:5])

# 将数据集分成训练集和测试集
X_train, X_test, y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=42)
# 初始化随机森林分类器，并拟合训练数据
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)
rf_classifier.fit(X_train, y_train)
# 使用拟合好的模型进行预测
y_pred = rf_classifier.predict(X_test)
# 评估模型性能
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy:{accuracy:.2f}')
report = classification_report(y_test, y_pred)
print('Classification Report:\n',report)

## 特征重要性可视化
import matplotlib.pyplot as plt
import numpy as np
# 获取特征重要性
feature_importance = rf_classifier.feature_importances_
# 创建特征名字
feature_names = [f"Feature {i}" for i in range(len(feature_importance))]
# 创建柱状图
plt.figure(figsize=(10,6))
plt.barh(range(len(feature_importance)),feature_importance, tick_label=feature_names)
plt.xlabel('Feature Importance')
plt.ylabel('Feature Name')
plt.title('Random Forest Feature Importance')
plt.show()

## 决策树可视化
from sklearn.tree import export_graphviz
import pydotplus
from IPython.display import Image
# 选择一棵树
sub_tree = rf_classifier.estimators_[0]
# 导出决策树的结构
dot_data = export_graphviz(sub_tree, out_file=None, filled=True, rounded=True, precision=2)
# 创建图形对象并显示
graph = pydotplus.graph_from_dot_data(dot_data)
Image(graph.create_png())

## 性能可视化
from sklearn.metrics import accuracy_score, confusion_matrix
import seaborn as sns
# 创建混淆矩阵可视化
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(8,6))
sns.heatmap(cm,annot=True, fmt="d", cmap="Blues")
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()

随机森林参考资料，含案例应用：
Random Forests Leo Breiman and Adele Cutler

3、Boosting

3.1 AdaBoost

Adaptive Boosting。
AdaBoost python调包

# 导入必要的库和模块：
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
# 创建合成的分类数据集：
# X,y = make_classification(n_samples=100, n_features=20, n_classes=4, random_state=42)
X, y = make_classification(n_samples=100, n_features=20, n_classes=4, n_clusters_per_class=1, random_state=42)
# 将数据集分成训练集和测试集
X_train, X_test, y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=42)

from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier

# 创建一个弱学习器（例如，决策树）
base_classifier = DecisionTreeClassifier(max_depth=1)

# 创建AdaBoost分类器并指定弱学习器
adaboost_classifier = AdaBoostClassifier(base_classifier, n_estimators=50)

# 使用训练数据进行拟合
adaboost_classifier.fit(X_train, y_train)

# 进行分类预测
y_pred = adaboost_classifier.predict(X_test)

AdaBoost相关参考资料：
1、AdaBoost模型及案例（Python）
2、【项目实战】Python实现AdaBoost分类模型(AdaBoostClassifier算法)项目实战

3.2 GBDT

Gradient Boosting Decision Trees，梯度提升决策树。
GBDT分类 python调包

# 导入必要的库
from sklearn.datasets import load_iris
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# 加载示例数据集（这里使用鸢尾花数据集）
iris = load_iris()
X, y = iris.data, iris.target

# 数据集分割成训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 初始化GBDT分类器
gbdt_classifier = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42)

# 训练模型
gbdt_classifier.fit(X_train, y_train)

# 对测试集进行预测
y_pred = gbdt_classifier.predict(X_test)

# 计算准确率
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

GBDT相关参考资料：
GBDT的原理、公式推导、Python实现、可视化和应用

3.3 XGBoost

eXtreme Gradient Boosting，是一种梯度提升树。
XGBoost分类 python调包

from sklearn.datasets import load_iris
import xgboost as xgb
from xgboost import plot_importance
from matplotlib import pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score   # 准确率
# 加载样本数据集
iris = load_iris()
X,y = iris.data,iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1234565) # 数据集分割

# 算法参数
params = {
    'booster': 'gbtree',
    'objective': 'multi:softmax',
    'num_class': 3,
    'gamma': 0.1,
    'max_depth': 6,
    'lambda': 2,
    'subsample': 0.7,
    'colsample_bytree': 0.75,
    'min_child_weight': 3,
    'silent': 0,
    'eta': 0.1,
    'seed': 1,
    'nthread': 4,
}

plst = params.copy()

dtrain = xgb.DMatrix(X_train, y_train) # 生成数据集格式
num_rounds = 500
model = xgb.train(plst, dtrain, num_rounds) # xgboost模型训练

# 对测试集进行预测
dtest = xgb.DMatrix(X_test)
y_pred = model.predict(dtest)

# 计算准确率
accuracy = accuracy_score(y_test,y_pred)
print("accuarcy: %.2f%%" % (accuracy*100.0))

# 显示重要特征
plot_importance(model)
plt.show()