机器学习——集成学习和梯度提升决策树

本文链接：https://blog.csdn.net/star_and_sun/article/details/139636955

集成学习

不同的算法都可以对解决同一个问题，但是可能准确率不同，集成学习就是不同算法按照某种组合来解决问题，使得准确率提升。
那怎么组合算法呢？
自举聚合算法**（bagging）**
顾名思义是自举+聚合
自举是指的是自举采样，保证随机性，允许重复的又放回抽样，每次抽与原样本大小相同的样本出来，如果进行B次。则有B个数据集，然后独立的训练出模型 f（x），求得平均值
在这里插入图片描述
对于低偏差、高方差模型的稳定性有较大提升

随机森林

bagging算法的改进版就是随机森林
在这里插入图片描述

from tqdm import tqdm
import numpy as np
from matplotlib import pyplot as plt
from sklearn.datasets import make_classification
from sklearn.tree import DecisionTreeClassifier as DTC
from sklearn.model_selection import train_test_split

# 创建随机数据集
X, y = make_classification(
    n_samples=1000, # 数据集大小
    n_features=16, # 特征数，即数据维度
    n_informative=5, # 有效特征个数
    n_redundant=2, # 冗余特征个数，为有效特征的随机线性组合
    n_classes=2, # 类别数
    flip_y=0.1, # 类别随机的样本个数，该值越大，分类越困难
    random_state=0 # 随机种子
)

print(X.shape)
#%%
class RandomForest():

    def __init__(self, n_trees=10, max_features='sqrt'):
        # max_features是DTC的参数，表示结点分裂时随机采样的特征个数
        # sqrt代表取全部特征的平方根，None代表取全部特征，log2代表取全部特征的对数
        self.n_trees = n_trees
        self.oob_score = 0
        self.trees = [DTC(max_features=max_features)
            for _ in range(n_trees)]

    # 用X和y训练模型
    def fit(self, X, y):
        n_samples, n_features = X.shape
        self.n_classes = np.unique(y).shape[0]   
        # 集成模型的预测，累加单个模型预测的分类概率，再取较大值作为最终分类
        ensemble = np.zeros((n_samples, self.n_classes))
            
        for tree in self.trees:
            # 自举采样，该采样允许重复
            idx = np.random.randint(0, n_samples, n_samples)
            # 没有被采到的样本
            unsampled_mask = np.bincount(idx, minlength=n_samples) == 0
            unsampled_idx = np.arange(n_samples)[unsampled_mask]
            # 训练当前决策树
            tree.fit(X[idx], y[idx])
            # 累加决策树对OOB样本的预测
            ensemble[unsampled_idx] += tree.predict_proba(X[unsampled_idx])
        # 计算OOB分数，由于是分类任务，我们用正确率来衡量
        self.oob_score = np.mean(y == np.argmax(ensemble, axis=1))
    
    # 预测类别
    def predict(self, X):
        proba = self.predict_proba(X)
        return np.