随机森林算法介绍、使用场景及案例介绍分析

最新推荐文章于 2025-04-21 18:46:39 发布

战族狼魂

最新推荐文章于 2025-04-21 18:46:39 发布

阅读量1.9k

点赞数 6

分类专栏：算法算法分析文章标签：算法随机森林机器学习

本文链接：https://blog.csdn.net/nndsb/article/details/140371219

版权

算法同时被 2 个专栏收录

18 篇文章

订阅专栏

算法分析

17 篇文章

订阅专栏

简介

随机森林算法是一种集成学习方法，用于分类、回归和其他任务。它通过构建多个决策树（通常是数百个或更多），并在训练时进行随机采样和特征选择，从而提高模型的准确性和鲁棒性。

原理

随机采样（Bootstrap Aggregating, Bagging）：
- 从训练集中有放回地随机抽取多个子集，每个子集用于训练一个决策树。这样，每棵树都是在不同的子集上训练的。
随机特征选择：
- 在每个决策树的节点分裂时，不是使用所有特征，而是随机选择一部分特征进行最佳分裂。这增加了模型的多样性，减少了过拟合。
投票/平均：
- 对于分类问题，随机森林会让每棵树投票，选择票数最多的类别作为最终预测结果。
- 对于回归问题，随机森林会对所有树的预测结果取平均值。

使用场景

随机森林具有很强的泛化能力和鲁棒性，适用于以下场景：

分类问题：如垃圾邮件检测、疾病诊断、客户分类等。
回归问题：如房价预测、股票价格预测等。
特征重要性评估：随机森林可以评估每个特征的重要性，帮助理解数据中哪些特征对预测最为重要。
处理缺失值：随机森林可以处理数据中的缺失值，使用多数投票或平均值填补缺失数据。

案例

1. 分类问题

我们使用一个经典的鸢尾花数据集（Iris dataset）进行分类问题的示例。

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report

# 加载数据集
iris = load_iris()
X, y = iris.data, iris.target

# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# 创建随机森林分类器
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train, y_train)

# 预测
y_pred = clf.predict(X_test)

# 评估
print(f'Accuracy: {accuracy_score(y_test, y_pred)}')
print(classification_report(y_test, y_pred))

2. 回归问题

我们使用波士顿房价数据集（Boston Housing dataset）进行回归问题的示例。

python代码：

from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

# 加载数据集
boston = load_boston()
X, y = boston.data, boston.target

# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# 创建随机森林回归器
reg = RandomForestRegressor(n_estimators=100, random_state=42)
reg.fit(X_train, y_train)

# 预测
y_pred = reg.predict(X_test)

# 评估
print(f'Mean Squared Error: {mean_squared_error(y_test, y_pred)}')

特征重要性

使用随机森林评估特征的重要性是其一大优势。我们可以查看特征的重要性来理解哪些特征在预测中起主要作用。

python代码：

import matplotlib.pyplot as plt
import numpy as np

# 训练随机森林
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X, y)

# 获取特征重要性
feature_importances = clf.feature_importances_
features = iris.feature_names

# 可视化特征重要性
indices = np.argsort(feature_importances)[::-1]
plt.figure()
plt.title("Feature Importances")
plt.bar(range(X.shape[1]), feature_importances[indices], align="center")
plt.xticks(range(X.shape[1]), [features[i] for i in indices])
plt.xlim([-1, X.shape[1]])
plt.show()