决策树、随机森林——泰坦尼克号生死预测示例

蒋含竹

已于 2022-09-14 18:09:27 修改

阅读量3.5k

点赞数 10

分类专栏： MachineLearning Python # Sklearn 文章标签：机器学习决策树随机森林分类 sklearn

于 2019-03-05 11:29:38 首次发布

本文链接：https://blog.csdn.net/alionsss/article/details/88173945

版权

MachineLearning 同时被 3 个专栏收录

42 篇文章 7 订阅

订阅专栏

Python

39 篇文章 2 订阅

订阅专栏

Sklearn

13 篇文章 0 订阅

订阅专栏

文章目录

决策树、随机森林——泰坦尼克号生死预测示例

决策树、随机森林——泰坦尼克号生死预测示例

1. 导包

import pandas as pd
from sklearn.feature_extraction import DictVectorizer
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
# 作图的库需要安装python-graphviz
import graphviz

2. 原始数据

# 网络数据获取可能需要一定时间
# 这个数据拿不到的话，你可以去Kaggle拿 https://www.kaggle.com/competitions/titanic/data
titanic = pd.read_csv("http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic.txt")
titanic

数据示例图

3. 数据预处理

# 特征和标签抽取
X = titanic[["pclass", "age", "sex"]]
y = titanic["survived"]
print(X)

# 缺失值处理
X["age"].fillna(X["age"].mean(), inplace=True)

# 拆分训练集、测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.75)

# 编码
dict = DictVectorizer(sparse=False)
X_train = dict.fit_transform(X_train.to_dict(orient="records"))
X_test = dict.transform(X_test.to_dict(orient="records"))
print(dict.get_feature_names())
print(X_train)

4. 使用决策树

4.1 构建决策树模型

# criterion -> gini 或 entropy
# max_depth -> 树的深度
# min_samples_split -> 节点的样本数大于等于该值，才可能被拆分
# min_samples_leaf -> 每个节点最少样本数
dtf = DecisionTreeClassifier(criterion="entropy", max_depth=5, min_samples_split=3, min_samples_leaf=1)
dtf.fit(X_train, y_train)

4.2 结果预测与评估

print("实际：", y_test)
print("预测：", dtf.predict(X_test))

# 准确度
print(dtf.score(X_test, y_test))

# 画决策树的图
# 作图的库需要安装python-graphviz
dot_data = export_graphviz(
    dtf, 
    feature_names=['age', 'pclass=1st', 'pclass=2nd', 'pclass=3rd', 'sex=female', 'sex=male'],
    class_names=["生", "死"],
    filled=True,
    rounded=True
)
graph = graphviz.Source(dot_data)
graph

4.3 画学习曲线

import matplotlib.pyplot as plt

scores = []
for i in range(10):
    clf = DecisionTreeClassifier(criterion="entropy", max_depth=i+1, min_samples_split=3, min_samples_leaf=1, random_state=10)
    clf.fit(X_train, y_train)
    score = clf.score(X_test, y_test)
    scores.append(score)
    
plt.plot(range(1,11), scores, color='b', label='max_depth')
plt.legend()
plt.show()

5. 使用随机森林

5.1 构建随机森林模型

# 随机森林模型同样有决策树的那些参数，可以进行调整
# n_estimators -> 树的棵数
srfc = RandomForestClassifier(n_estimators=200, max_depth=5)
srfc.fit(X_train, y_train)

5.2 结果预测与评估

# 准确度
print(srfc.score(X_test, y_test))

# 预测
print(srfc.predict(X_test)[0:10])
print(y_test[0:10])

5.3 利用网格搜索和交叉验证

rfc = RandomForestClassifier()
param_grid = {
    "n_estimators": [120, 200, 360, 450, 500],
    "max_depth": [5, 9, 18, 27, 36]
}
gscv = GridSearchCV(rfc, param_grid=param_grid, cv=5)
gscv.fit(X_train, y_train)

# 最佳准确度
print(gscv.score(X_test, y_test))

# 最佳参数
print(gscv.best_params_)

评分图

蒋含竹

关注

10
点赞
踩
59

收藏

觉得还不错? 一键收藏
0
评论
决策树、随机森林——泰坦尼克号生死预测示例

文章目录决策树、随机森林——泰坦尼克号生死预测示例1. 导包2. 原始数据3. 数据预处理4. 使用决策树4.1 构建决策树模型4.2 结果预测与评估5. 使用随机森林5.1 构建随机森林模型5.2 结果预测与评估5.3 利用网格搜索和交叉验证决策树、随机森林——泰坦尼克号生死预测示例1. 导包import pandas as pdfrom sklearn.feature_extracti......
复制链接

扫一扫

专栏目录