基于随机森林的泰坦尼克号乘客生存预测

用户昵称已存在啊a

于 2024-08-22 09:00:00 发布

阅读量259

点赞数 4

文章标签：随机森林算法机器学习

本文链接：https://blog.csdn.net/weixin_51395935/article/details/141403192

版权

文章目录

前言

采用随机森林对泰坦尼克号数据集进行处理分析预测

随机森林包含多个决策树的分类器，是决策树的集成学习，最终预测结果类别由预测众数决定

本项目采用网格法和交叉验证法以寻求最佳参数以及最佳结果

一、代码分析

1.数据前期处理

数据获取、数据处理、数据转换以及训练集，测试集划分同下文

基于决策树算法的泰坦尼克号乘客生存预测-CSDN博客

2.网格法、交叉验证法

构成求解最优的参数列表通过迭代获取最优解

param_dict = {"n_estimators":[120, 200, 300,500, 800, 1200], "max_depth": [5, 8, 15, 25, 30]}
rfc = GridSearchCV(rfc, param_grid=param_dict, cv=3)

3.模型训练

模型训练预测后分别得到最佳参数、最佳结果

rfc = RandomForestClassifier()
# 网格搜索与交叉验证
param_dict = {"n_estimators":[120, 200, 300,500, 800, 1200], "max_depth": [5, 8, 15, 25, 30]}
rfc = GridSearchCV(rfc, param_grid=param_dict, cv=3)
rfc.fit(x_train, y_train)
y_pred = rfc.predict(x_test)
score = rfc.score(x_test, y_test)
print("y_pred:\n", y_pred)
print("比对真实值和预测值：\n", y_test == y_pred)
print("score:\n", score)
# 最佳参数
print("最佳参数：\n", rfc.best_params_)
print("最佳结果:\n", rfc.best_score_)
print("最佳估计器:\n", rfc.best_estimator_)

二、完整代码

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction import DictVectorizer
from sklearn.model_selection import GridSearchCV
import pandas as pd
T_data = pd.read_csv("titanic-data.csv")
x = T_data[["Pclass", "Age", "Sex"]]
y_target = T_data["Survived"]
x["Age"].fillna(x["Age"].mean(), inplace=True)
x_train, x_test, y_train, y_test = train_test_split(x, y_target, test_size=0.2, random_state=22)
x_train = x_train.to_dict(orient="records")
x_test = x_test.to_dict(orient="records")
transfer = DictVectorizer()
x_train = transfer.fit_transform(x_train)  
x_test = transfer.transform(x_test)
rfc = RandomForestClassifier()
# 网格搜索与交叉验证
param_dict = {"n_estimators":[120, 200, 300,500, 800, 1200], "max_depth": [5, 8, 15, 25, 30]}
rfc = GridSearchCV(rfc, param_grid=param_dict, cv=3)
rfc.fit(x_train, y_train)
y_pred = rfc.predict(x_test)
score = rfc.score(x_test, y_test)
print("y_pred:\n", y_pred)
print("比对真实值和预测值：\n", y_test == y_pred)
print("score:\n", score)
# 最佳参数
print("最佳参数：\n", rfc.best_params_)
print("最佳结果:\n", rfc.best_score_)
print("最佳估计器:\n", rfc.best_estimator_)

三. 网格法、交叉验证法详解

1.网格搜索法法

网格搜索是一种穷举搜索方法，它会遍历所有可能的参数组合，并评估每个组合的性能。选择性能最好的参数组合作为最优参数。

定义一个求解最优参数的参数列表

GridSearchCV(rfc, param_grid=param_dict, cv=3)

2.交叉验证法

交叉验证是一种评估模型性能的方法，它将数据分成多个子集，用不同的子集进行训练和测试，以评估模型的泛化能力。

将数据分成 k 个子集，采用k-1个子集进行训练，1个子集进行测试。重复上述过程 k 次，每次选择不同的子集作为测试集。计算 k 次测试结果的平均值，作为模型的最终性能评估。

在sklearn库中提供的GridSearchCV 类将网格搜索和交叉验证结合使用

总结

两种方法可以找到全局最优解，但由于穷举和计算量较大，训练时间会很长。

用户昵称已存在啊a

关注

4
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
基于随机森林的泰坦尼克号乘客生存预测

随机森林算法在泰坦尼克号数据集上应用
复制链接

扫一扫