基于Python的机器学习系列（14）：随机森林（Random Forests）

最新推荐文章于 2025-04-03 07:00:00 发布

会飞的Anthony

最新推荐文章于 2025-04-03 07:00:00 发布

阅读量1.6k

点赞数 34

分类专栏：信息系统机器学习人工智能文章标签：机器学习 python 随机森林

本文链接：https://blog.csdn.net/ljd939952281/article/details/141426712

版权

信息系统同时被 3 个专栏收录

263 篇文章

订阅专栏

人工智能

253 篇文章

订阅专栏

机器学习

44 篇文章

订阅专栏

简介

在上一节中，我们探讨了Bagging方法，并了解到通过构建多个树模型来减少方差是有效的。然而，Bagging方法中树与树之间仍然可能存在一定的相关性，降低了方差减少的效果。为了解决这个问题，我们引入了随机森林（Random Forests），这是一种基于Bagging的增强技术，通过在每个树的每个分割点上随机选择特征来进一步减少树之间的相关性。

1. Out of Bag (OOB) 评价

在Bagging方法中，每棵树仅看到训练数据的一个子集。未被某棵树看到的数据被称为“袋外”（Out of Bag, OOB）数据。由于OOB数据对这棵树来说是完全陌生的，我们可以将其视为一种验证集，用来评估模型的性能。具体来说，在训练每棵树后，我们可以使用这棵树的OOB数据来测试其准确性，然后平均所有树的OOB准确性，得到整体模型的OOB评价分数。

2. 随机特征子集

随机森林通过Bagging方法构建，但在每棵树的每个分割点上，只考虑特定数量的随机特征子集进行分裂。这样可以进一步去除树之间的相关性。通常，分类树中使用的随机特征子集大小为特征总数的平方根。

3. 特征重要性

随机森林中的每棵决策树都可以计算每个特征对减少不纯度的贡献，并对所有树中的特征重要性进行平均，得到最终的特征重要性排序。这种方法可以帮助我们理解哪些特征在模型中最为重要。

4. 从零开始的实现

代码示例

# 引入必要的库
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# 加载数据集
iris = load_iris()
X = iris.data
y = iris.target

# 分割训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, shuffle=True, random_state=42)

# 实现RandomForest类
import random, math
from sklearn.tree import DecisionTreeClassifier
from scipy import stats

class RandomForest:
    def __init__(self, B, bootstrap_ratio, with_no_replacement=True):
        self.B = B
        self.bootstrap_ratio = bootstrap_ratio
        self.with_no_replacement = with_no_replacement
        self.tree_params = {'max_depth': 2, 'max_features': 'sqrt'}
        self.models = [DecisionTreeClassifier(**self.tree_params) for _ in range(B)]
                
    def fit(self, X, y):
        m, n = X.shape
        sample_size = int(self.bootstrap_ratio * len(X))
        xsamples = np.zeros((self.B, sample_size, n))
        ysamples = np.zeros((self.B, sample_size))
        xsamples_oob = []
        ysamples_oob = []
        
        for i in range(self.B):
            oob_idx = []
            idxes = []
            for j in range(sample_size):
                idx = random.randrange(m)
                if (self.with_no_replacement):
                    while idx in idxes:
                        idx = random.randrange(m)
                idxes.append(idx)
                oob_idx.append(idx)
                xsamples[i, j, :] = X[idx]
                ysamples[i, j] = y[idx]
            mask = np.zeros((m), dtype=bool)
            mask[oob_idx] = True
            xsamples_oob.append(X[~mask])
            ysamples_oob.append(y[~mask])
    
        oob_score = 0
        print("======Out of bag score for each tree======")
        for i, model in enumerate(self.models):
            _X = xsamples[i]
            _y = ysamples[i]
            model.fit(_X, _y)
            _X_test = np.asarray(xsamples_oob[i])
            _y_test = np.asarray(ysamples_oob[i])
            yhat = model.predict(_X_test)
            oob_score += accuracy_score(_y_test, yhat)
            print(f"Tree {i}", accuracy_score(_y_test, yhat))
        self.avg_oob_score = oob_score / len(self.models)
        print("======Average out of bag score======")
        print(self.avg_oob_score)
    
    def predict(self, X):
        predictions = np.zeros((self.B, X.shape[0]))
        for i, model in enumerate(self.models):
            yhat = model.predict(X)
            predictions[i, :] = yhat
        return stats.mode(predictions)[0][0]

model = RandomForest(B=5, bootstrap_ratio=0.8)
model.fit(X_train, y_train)
yhat = model.predict(X_test)
print(classification_report(y_test, yhat))

5. Sklearn 实现

# 使用Sklearn中的RandomForestClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

param_grid = {"n_estimators": [10, 50, 100], 
              "criterion": ["gini", "entropy"],
              "max_depth": np.arange(1, 10)}
model = RandomForestClassifier()

grid = GridSearchCV(model, param_grid, refit=True)
grid.fit(X_train, y_train)

print(grid.best_params_)

yhat = grid.predict(X_test)

print(classification_report(y_test, yhat))

使用随机森林的时机

优点:

通过投票机制减少过拟合
可以并行计算，提高计算效率
适用于高维数据
提供特征重要性评估
能够处理缺失数据
适用于不平衡数据集
能够解决分类和回归问题

缺点:

对回归问题效果不如分类问题
随机森林模型较为复杂，解释性较差
对稀有特征或结果不敏感
在某些情况下，更多的样本并不会提高准确性

在处理结构化数据时，如果你追求高准确性而不太关心可解释性，随机森林是一个很好的选择。

结语

随机森林作为一种集成学习方法，通过结合多棵决策树并进行投票或平均来提高模型的准确性和鲁棒性。它不仅能有效地减少单一决策树容易出现的过拟合问题，还能够处理高维数据和不平衡数据集，提供有价值的特征重要性评估。尽管随机森林在某些情况下可能缺乏深度解释性，但其强大的预测能力使其在实际应用中广受欢迎。总之，随机森林是一种灵活且强大的工具，尤其适合在对解释性要求不高的情况下追求高准确性的任务。

如果你觉得这篇博文对你有帮助，请点赞、收藏、关注我，并且可以打赏支持我！

欢迎关注我的后续博文，我将分享更多关于人工智能、自然语言处理和计算机视觉的精彩内容。

谢谢大家的支持！