ML学习笔记-2021-08-26-分类算法-随机森林

最新推荐文章于 2024-07-25 12:14:33 发布

燥栋

最新推荐文章于 2024-07-25 12:14:33 发布

阅读量131

点赞数

分类专栏： ML 文章标签：机器学习

本文链接：https://blog.csdn.net/qq_45363979/article/details/119930423

版权

ML 专栏收录该内容

17 篇文章 0 订阅

订阅专栏

3.6 集成学习方法之随机森林

目标
说明随机森林每棵决策的建立过程
知道为什么需要随机又放回Bootstrap的抽样
说明随机森林的超参数
应用
泰坦尼克号生存

3.6.1 什么集成学习方法

在这里插入图片描述

3.6.2 什么是随机森林

随机：
森林：概念如下
在这里插入图片描述

3.6.3 随机森林原理过程

训练集：包含N个样本；特征值目标值：包含M个特征
随机包含两层：

特征值随机：M个特征随机抽取m个特征，M>>m，起到了降维的作用，正确的结果脱颖而出。
训练集随机：N个样本中随机有放回的抽样N个，为了生成多颗树，那么就要使用不同分布数据集。
bootsstrap 随机有放回抽样：[1,2,3,4,5]—>[2,2,3,1,5]

3.6.4 API

在这里插入图片描述

3.6.5 随机森林案例

import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.feature_extraction import DictVectorizer
from sklearn.ensemble import RandomForestClassifier
import numpy as np
"""
# 随机森林就是多个树, 最后通过投票选择多数的那个决策
# 随机有两种方式
# 1: 每一个树训练集不同
# 2: 需要训练的特征进行随机分配 从特定的特征集里面抽取一些特征来分配
"""


def load_data():
    data = pd.read_csv("../../resources/titanic/titanic.csv")
    titanic = data.copy()

    # 方法一: 过滤掉空的值的数据组, 准确率高点
    data_used = titanic[["pclass", "age", "sex", "survived"]]
    real_data = pd.DataFrame(columns=["pclass", "age", "sex", "survived"])
    for row in data_used.values:
        if not np.isnan(row[1]):
            real_data = real_data.append([{'pclass': row[0], 'age': row[1],
                                           'sex': row[2], 'survived': row[3]}],
                                         ignore_index=True)
    x = real_data[["pclass", "age", "sex"]].to_dict(orient="records")
    y = real_data["survived"]

    # 方法二: 对空数据设置个非0值
    # x = titanic[["pclass", "age", "sex"]]  # 只提取这一些特征
    # y = titanic["survived"]  # 目标值
    # x["age"].fillna(x["age"].mean(), inplace=True)
    # x = x.to_dict(orient="records")

    x_train, x_test, y_train, y_test = train_test_split(x, y.astype('int'), random_state=22)
    return x_train, x_test, y_train, y_test


def titanic_ramdo_test():
    x_train, x_test, y_train, y_test = load_data()

    transfer = DictVectorizer()
    x_train = transfer.fit_transform(x_train)
    x_test = transfer.transform(x_test)

    estimator = RandomForestClassifier()
    # 默认bootstrap 表示为true,也就是说默认情况下放回抽样

    param_dict = {"n_estimators": [120, 200, 300, 500, 800, 1200],
                  "max_depth": [5, 8, 15, 25, 30]}
    estimator = GridSearchCV(estimator, param_grid=param_dict, cv=3)
    estimator.fit(x_train, y_train)  # 训练集里面的数据和目标值

    # 传入测试值通过前面的预估器获得预测值
    y_predict = estimator.predict(x_test)
    print("预测值为:", y_predict, "\n真实值为:", y_test, "\n比较结果为:", y_test == y_predict)
    score = estimator.score(x_train, y_train)
    print("准确率为: ", score)
    # ------------------
    print("最佳参数:\n", estimator.best_params_)
    print("最佳结果:\n", estimator.best_score_)
    print("最佳估计器:\n", estimator.best_estimator_)
    print("交叉验证结果:\n", estimator.cv_results_)

    return None


if __name__ == '__main__':
    titanic_ramdo_test()

3.6.6 总结

在当前所有算法中具有极好的准确率。
能够有效地运行在大数据集上，处理具有高维特征的输入样本，而且不需要降维。
能够评估各个特征在分类上的重要性。

燥栋

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
ML学习笔记-2021-08-26-分类算法-随机森林

3.6 集成学习方法之随机森林目标说明随机森林每棵决策的建立过程知道为什么需要随机又放回Bootstrap的抽样说明随机森林的超参数应用泰坦尼克号生存3.6.1 什么集成学习方法3.6.2 什么是随机森林随机：森林：概念如下3.6.3 随机森林原理过程训练集：包含N个样本；特征值目标值：包含M个特征随机包含两层：特征值随机：M个特征随机抽取m个特征，M>>m，起到了降维的作用，正确的结果脱颖而出。训练集随机：N个样本中随机有放回的抽样N个，为了生成多颗树
复制链接

扫一扫

专栏目录