【机器学习】01. python随机森林实现回归，相关性分析，特征重要性分析

MFT小白

已于 2023-12-26 13:16:22 修改

阅读量3.9k

点赞数 23

分类专栏：机器学习文章标签：机器学习随机森林回归

于 2023-12-26 13:08:40 首次发布

本文链接：https://blog.csdn.net/zhangbing0116/article/details/135183359

版权

机器学习专栏收录该内容

3 篇文章

订阅专栏

文章介绍了如何在保护客户数据的前提下，使用鸢尾花数据集替代真实数据，实现随机森林回归模型的训练、预测和性能评估，包括计算均方误差和R方得分，以及特征重要性的分析。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

背景：有个关于回归的任务，因保护客户数据资料，用鸢尾花数据集代替，完成随机森林算法实现部分功能。

完整代码在最后

1. 加载数据集

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import joblib

# 加载示例数据集
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_state=42)

print(iris.DESCR)

此时会显示当前数据的部分相关描述

:Summary Statistics:

============== ==== ==== ======= ===== ====================
Min Max Mean SD Class Correlation
============== ==== ==== ======= ===== ====================
sepal length: 4.3 7.9 5.84 0.83 0.7826
sepal width: 2.0 4.4 3.05 0.43 -0.4194
petal length: 1.0 6.9 3.76 1.76 0.9490 (high!)
petal width: 0.1 2.5 1.20 0.76 0.9565 (high!)
============== ==== ==== ======= ===== ====================

:Missing Attribute Values: None
:Class Distribution: 33.3% for each of 3 classes.
:Creator: R.A. Fisher
:Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
:Date: July, 1988

2. 输出数据特征之间的相关性矩阵

# 输出特征之间的相关性矩阵
correlation_matrix = np.corrcoef(X_train, rowvar=False)
# 使用热图可视化相关性矩阵
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', xticklabels=feature_names, yticklabels=feature_names)
plt.title('Correlation Matrix of Iris Features')
plt.show()

3. 训练模型并保存joblib文件

# 创建随机森林模型
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)

# 训练模型
rf_model.fit(X_train, y_train)

# 保存模型
joblib.dump(rf_model, 'random_forest_model.joblib')

4. 加载模型并预测输出均方误差和R方评估指标

# 加载模型
loaded_model = joblib.load('random_forest_model.joblib')

# 使用加载的模型进行预测
y_pred = loaded_model.predict(X_test)

# 评估模型性能
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f'Mean Squared Error: {mse}')
print(f'R-squared: {r2}')

Mean Squared Error: 0.0013833333333333336
R-squared: 0.9980206677265501

5. 特征重要性分析

# 输出特征的重要性
feature_importances = loaded_model.feature_importances_
print('Feature Importances:')
for i, importance in enumerate(feature_importances):
    print(f'Feature {i+1}: {importance}')

# 将特征重要性进行可视化
plt.figure(figsize=(10, 6))
sorted_idx = np.argsort(feature_importances)[::-1]  # 反向排序
plt.bar(list(range(len(feature_importances))), feature_importances[sorted_idx], align='center')
plt.xticks(list(range(len(feature_importances))),  np.array(feature_names)[sorted_idx], rotation=0)
plt.xlabel('Feature')
plt.ylabel('Importance Score')
plt.title('Feature Importance Scores')
plt.show()

Feature Importances:
Feature 1: 0.007247638926907056
Feature 2: 0.01241623468021743
Feature 3: 0.4956256973314748
Feature 4: 0.48471042906140077

6. 完整代码

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import joblib

# 加载示例数据集
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_state=42)

# print(iris.DESCR)

# 创建随机森林模型
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)

# 训练模型
rf_model.fit(X_train, y_train)

# 保存模型
joblib.dump(rf_model, 'random_forest_model.joblib')

# 加载模型
loaded_model = joblib.load('random_forest_model.joblib')

# 使用加载的模型进行预测
y_pred = loaded_model.predict(X_test)

# 评估模型性能
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f'Mean Squared Error: {mse}')
print(f'R-squared: {r2}')


feature_names = iris.feature_names

# 输出特征之间的相关性矩阵
correlation_matrix = np.corrcoef(X_train, rowvar=False)
# 使用热图可视化相关性矩阵
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', xticklabels=feature_names, yticklabels=feature_names)
plt.title('Correlation Matrix of Iris Features')
plt.show()


# 输出特征的重要性
feature_importances = loaded_model.feature_importances_
print('Feature Importances:')
for i, importance in enumerate(feature_importances):
    print(f'Feature {i+1}: {importance}')

# 将特征重要性进行可视化
plt.figure(figsize=(10, 6))
sorted_idx = np.argsort(feature_importances)[::-1]  # 反向排序
plt.bar(list(range(len(feature_importances))), feature_importances[sorted_idx], align='center')
plt.xticks(list(range(len(feature_importances))),  np.array(feature_names)[sorted_idx], rotation=0)
plt.xlabel('Feature')
plt.ylabel('Importance Score')
plt.title('Feature Importance Scores')
plt.show()

后续还可以添加一些寻优逻辑，比如网格搜索，交叉验证等。