3.6 集成学习方法之随机森林

最新推荐文章于 2024-06-24 17:23:45 发布

追光而遇

最新推荐文章于 2024-06-24 17:23:45 发布

阅读量240

点赞数

分类专栏： python机器学习入门文章标签：集成学习随机森林机器学习

本文链接：https://blog.csdn.net/Janna_woo/article/details/124615542

版权

python机器学习入门专栏收录该内容

26 篇文章 0 订阅

订阅专栏

文章目录

3.6.1 什么是集成学习方法

集成学习通过建立几个模型组合的来解决单一预测问题。它的工作原理是生成多个分类器/模型，各自独立地学习和作出预测。这些预测最后结合成组合预测，因此优于任何一个单分类的做出预测。

3.6.2 什么是随机森林

在机器学习中，随机森林是一个包含多个决策树的分类器，并且其输出的类别是由个别树输出的类别的众数而定。
在这里插入图片描述
例如, 如果你训练了5个树, 其中有4个树的结果是True, 1个数的结果是False, 那么最终投票结果就是True

3.6.3 随机森林原理过程

随机有两方面
o 特征值随机
o 训练集随机

学习算法根据下列算法而建造每棵树：

用N来表示训练用例（样本）的个数，M表示特征数目。
1、[训练集随机] 一次随机选出一个样本，重复N次，（有可能出现重复的样本）
2、[特征随机] 从M个特征这中随机选出m个特征, m <<M—>降维
采取bootstrap(随机有放回)抽样
假设总共有五个样本[1，2，3，4，5]
新的树训练集先抽一个假设是2，[2]
将2放回再抽，可能还抽到2，[2,2]
将2再放回，可能抽到3，以此类推最后新的数据集可能是[2,2,3,1,5]

为什么采用BootStrap抽样

为什么要随机抽样训练集？　　
如果不进行随机抽样，每棵树的训练集都一样，那么最终训练出的树分类结果也是完全一样的
为什么要有放回地抽样？
如果不是有放回的抽样，那么每棵树的训练样本都是不同的，都是没有交集的，这样每棵树都是“有偏的”，都是绝对“片面的”（当然这样说可能不对），也就是说每棵树训练出来都是有很大的差异的；而随机森林最后分类取决于多棵树（弱分类器）的投票表决。

3.6.4 API

class sklearn.ensemble.RandomForestClassifier(n_estimators=10, criterion=’gini’, max_depth=None,bootstrap=True, random_state=None, min_samples_split=2)

随机森林分类器
n_estimators：integer，optional（default = 10）森林里的树木数量
120,200,300,500,800,1200
criteria：string，可选（default =“gini”）分割特征的测量方法
max_depth：integer或None，可选（默认=无）树的最大深度 5,8,15,25,30
max_features="auto”,每个决策树的最大特征数量
o If “auto”, then max_features=sqrt(n_features) .
o If “sqrt”, then max_features=sqrt(n_features) (same as “auto”).
o If “log2”, then max_features=log2(n_features) .
o If None, then max_features=n_features .
bootstrap：boolean，optional（default = True）是否在构建树时使用放回抽样
min_samples_split:节点划分最少样本数
min_samples_leaf:叶子节点的最小样本数

超参数：n_estimator, max_depth, min_samples_split,min_samples_leaf

3.6.5 随机森林预测案例

实例化随机森林

# 随机森林去进行预测
 rf = RandomForestClassifier()

定义超参数的选择列表

 param = {"n_estimators": [120,200,300,500,800,1200], "max_depth": [5, 8, 15, 25, 30]}

使用GridSearchCV进行网格搜索

 # 超参数调优
 gc = GridSearchCV(rf, param_grid=param, cv=2) 
 
 gc.fit(x_train, y_train) 
 
 print("随机森林预测的准确率为：",  gc.score(x_test, y_test))

完整代码：

import pandas as pd
titanic = pd.read_csv("train.csv")

x = titanic[["Pclass","Age","Sex"]]
y = titanic["Survived"]

x["Age"].fillna(x["Age"].mean(),inplace=True)

x = x.to_dict(orient="recordes")

from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y,random_state=22)

from sklearn.feature_extraction import DictVectorizer
transfer = DictVectorizer()
x_train = transfer.fit_transform(x_train)
x_test = transfer.fit_transform(x_test)

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

estimator = RandomForestClassifier()
param_dict={"n_estimators":[120,200,300,500,800,1200],"max_depth":[5,8,15,25,30]}
estimator = GridSearchCV(estimator,param_grid=param_dict,cv=13)
estimator.fit(x_train,y_train)

y_predict= estimator.predict(x_test)
print("y_predict:\n",y_predict)
print("直接比对真实值和预测值：\n",y_test==y_predict)

score=estimator.score(x_test,y_test)
print("准确率为：\n",score)
print("最佳参数:\n",estimator.best_params_)
print("最佳结果:\n",estimator.best_score_)
print("最佳预估器:\n",estimator.best_estimator_)
print("交叉验证结果:\n",estimator.cv_results_)

输出结果：

y_predict:
 [1 0 0 1 1 0 1 0 0 1 1 1 0 0 0 0 1 0 1 0 1 0 1 0 0 0 0 0 1 0 0 0 1 1 1 0 0
 0 0 0 0 1 1 1 1 0 0 0 1 0 1 0 1 0 0 0 0 0 1 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 1 0 0 1 1 1 0 0 1 0 0 1 0 0 0
 0 1 1 1 0 1 0 0 1 0 1 0 0 1 0 0 0 1 1 0 0 0 1 0 1 1 1 0 0 1 0 0 0 0 0 0 1
 0 0 1 0 1 0 0 1 1 0 1 1 0 0 0 0 0 1 0 1 0 1 0 1 0 0 1 1 1 1 0 0 0 1 1 0 0
 0 0 0 0 1 0 1 0 0 0 0 0 0 0 1 1 0 1 1 0 0 1 0 0 1 0 1 0 1 1 1 0 1 0 0 0 1
 1]
直接比对真实值和预测值：
 816    False
789     True
869    False
235    False
473     True
       ...  
174     True
723     True
350     True
399     True
194     True
Name: Survived, Length: 223, dtype: bool
准确率为：
 0.7668161434977578
最佳参数:
 {'max_depth': 5, 'n_estimators': 500}
最佳结果:
 0.8159011028966185
最佳预估器:
 RandomForestClassifier(max_depth=5, n_estimators=500)
交叉验证结果:
 {'mean_fit_time': array([0.14382346, 0.24311701, 0.35549331, 0.59480373, 1.03430533,
       1.58348234, 0.15034223, 0.29175345, 0.43171024, 0.69643927,
       1.08060575, 1.66969768, 0.15859906, 0.26394582, 0.3909564 ,
       0.66351326, 1.05977241, 1.57854199, 0.15739274, 0.26667706,
       0.39561955, 0.66135383, 1.06135798, 1.68801379, 0.16833258,
       0.28169401, 0.42419593, 0.68171875, 1.11704437, 1.65046819]), 'std_fit_time': array([0.00460574, 0.00457774, 0.00560244, 0.0237492 , 0.0693016 ,
       0.08063175, 0.00088036, 0.00196045, 0.0299682 , 0.02767114,
       0.0185797 , 0.10037114, 0.00095291, 0.00313476, 0.0080378 ,
       0.00314837, 0.0085277 , 0.02675928, 0.00138337, 0.00443381,
       0.00621335, 0.00857361, 0.00366948, 0.01269344, 0.00834818,
       0.01879558, 0.00185441, 0.00884583, 0.02930271, 0.04502446]), 'mean_score_time': array([0.01224462, 0.01972437, 0.03063742, 0.04909714, 0.08723656,
       0.12680173, 0.01283073, 0.02806497, 0.03263768, 0.05937211,
       0.08187675, 0.11716437, 0.01212867, 0.01990636, 0.04246132,
       0.04946303, 0.07848565, 0.11714268, 0.01236963, 0.01902437,
       0.02998249, 0.05101713, 0.07680162, 0.13931696, 0.01260932,
       0.02083095, 0.03526004, 0.04966497, 0.08392723, 0.13167572]), 'std_score_time': array([3.22815019e-04, 3.64044690e-05, 2.88254653e-03, 2.09637893e-03,
       1.01517426e-02, 1.36090891e-02, 4.62676711e-04, 9.25492250e-03,
       1.91093686e-03, 8.79019308e-03, 4.33523386e-03, 2.97017526e-03,
       9.89005296e-04, 3.20695727e-04, 1.92351699e-02, 4.18609844e-04,
       1.02999229e-03, 1.78868124e-03, 5.41567474e-04, 1.05280353e-03,
       8.46999943e-04, 1.43264628e-03, 2.46888252e-03, 2.15878994e-03,
       9.90785315e-04, 8.04257087e-04, 4.44009255e-03, 6.22252212e-04,
       5.09247981e-03, 9.19640998e-03]), 'param_max_depth': masked_array(data=[5, 5, 5, 5, 5, 5, 8, 8, 8, 8, 8, 8, 15, 15, 15, 15, 15,
                   15, 25, 25, 25, 25, 25, 25, 30, 30, 30, 30, 30, 30],
             mask=[False, False, False, False, False, False, False, False,
                   False, False, False, False, False, False, False, False,
                   False, False, False, False, False, False, False, False,
                   False, False, False, False, False, False],
       fill_value='?',
            dtype=object), 'param_n_estimators': masked_array(data=[120, 200, 300, 500, 800, 1200, 120, 200, 300, 500, 800,
                   1200, 120, 200, 300, 500, 800, 1200, 120, 200, 300,
                   500, 800, 1200, 120, 200, 300, 500, 800, 1200],
             mask=[False, False, False, False, False, False, False, False,
                   False, False, False, False, False, False, False, False,
                   False, False, False, False, False, False, False, False,
                   False, False, False, False, False, False],
       fill_value='?',
            dtype=object), 'params': [{'max_depth': 5, 'n_estimators': 120}, {'max_depth': 5, 'n_estimators': 200}, {'max_depth': 5, 'n_estimators': 300}, {'max_depth': 5, 'n_estimators': 500}, {'max_depth': 5, 'n_estimators': 800}, {'max_depth': 5, 'n_estimators': 1200}, {'max_depth': 8, 'n_estimators': 120}, {'max_depth': 8, 'n_estimators': 200}, {'max_depth': 8, 'n_estimators': 300}, {'max_depth': 8, 'n_estimators': 500}, {'max_depth': 8, 'n_estimators': 800}, {'max_depth': 8, 'n_estimators': 1200}, {'max_depth': 15, 'n_estimators': 120}, {'max_depth': 15, 'n_estimators': 200}, {'max_depth': 15, 'n_estimators': 300}, {'max_depth': 15, 'n_estimators': 500}, {'max_depth': 15, 'n_estimators': 800}, {'max_depth': 15, 'n_estimators': 1200}, {'max_depth': 25, 'n_estimators': 120}, {'max_depth': 25, 'n_estimators': 200}, {'max_depth': 25, 'n_estimators': 300}, {'max_depth': 25, 'n_estimators': 500}, {'max_depth': 25, 'n_estimators': 800}, {'max_depth': 25, 'n_estimators': 1200}, {'max_depth': 30, 'n_estimators': 120}, {'max_depth': 30, 'n_estimators': 200}, {'max_depth': 30, 'n_estimators': 300}, {'max_depth': 30, 'n_estimators': 500}, {'max_depth': 30, 'n_estimators': 800}, {'max_depth': 30, 'n_estimators': 1200}], 'split0_test_score': array([0.77578475, 0.79820628, 0.78475336, 0.79820628, 0.79820628,
       0.79820628, 0.75784753, 0.75784753, 0.76681614, 0.76233184,
       0.75784753, 0.75784753, 0.78026906, 0.78026906, 0.78475336,
       0.78475336, 0.78475336, 0.78026906, 0.78026906, 0.78475336,
       0.78026906, 0.78475336, 0.78475336, 0.78475336, 0.78475336,
       0.78475336, 0.78475336, 0.78475336, 0.78475336, 0.78026906]), 'split1_test_score': array([0.81165919, 0.80717489, 0.80717489, 0.81165919, 0.81165919,
       0.81165919, 0.78923767, 0.80717489, 0.80269058, 0.78923767,
       0.79372197, 0.79372197, 0.78026906, 0.78026906, 0.78026906,
       0.78475336, 0.78026906, 0.78026906, 0.78475336, 0.78026906,
       0.78026906, 0.78026906, 0.78026906, 0.78026906, 0.78475336,
       0.78026906, 0.78026906, 0.78475336, 0.78026906, 0.78026906]), 'split2_test_score': array([0.83783784, 0.83333333, 0.83783784, 0.83783784, 0.83783784,
       0.83783784, 0.85135135, 0.85135135, 0.85135135, 0.85135135,
       0.84684685, 0.84684685, 0.84234234, 0.82432432, 0.84234234,
       0.82882883, 0.82432432, 0.84234234, 0.82432432, 0.84684685,
       0.82882883, 0.84234234, 0.84234234, 0.84684685, 0.84684685,
       0.81081081, 0.84684685, 0.82882883, 0.81081081, 0.82432432]), 'mean_test_score': array([0.80842726, 0.81290483, 0.80992203, 0.8159011 , 0.8159011 ,
       0.8159011 , 0.79947885, 0.80545792, 0.80695269, 0.80097362,
       0.79947212, 0.79947212, 0.80096015, 0.79495415, 0.80245492,
       0.79944519, 0.79644892, 0.80096015, 0.79644892, 0.80395642,
       0.79645565, 0.80245492, 0.80245492, 0.80395642, 0.80545119,
       0.79194441, 0.80395642, 0.79944519, 0.79194441, 0.79495415]), 'std_test_score': array([0.02543594, 0.01490194, 0.02175853, 0.0164552 , 0.0164552 ,
       0.0164552 , 0.03885359, 0.03819208, 0.0346427 , 0.0372775 ,
       0.03656061, 0.03656061, 0.02926163, 0.02076785, 0.02826402,
       0.02077737, 0.01979572, 0.02926163, 0.01979572, 0.03038331,
       0.0228913 , 0.02826402, 0.02826402, 0.03038331, 0.02927115,
       0.01346559, 0.03038331, 0.02077737, 0.01346559, 0.02076785]), 'rank_test_score': array([ 6,  4,  5,  1,  1,  1, 19,  8,  7, 16, 20, 20, 17, 27, 13, 22, 25,
       17, 25, 10, 24, 13, 13, 10,  9, 29, 10, 22, 29, 27], dtype=int32)}

3.6.6 总结

在当前所有算法中，具有极好的准确率
能够有效地运行在大数据集上，处理具有高维特征的输入样本，而且不需要降维
能够评估各个特征在分类问题上的重要性

追光而遇

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
3.6 集成学习方法之随机森林

文章目录3.6.1 什么是集成学习方法3.6.2 什么是随机森林3.6.3 随机森林原理过程为什么采用BootStrap抽样3.6.4 API3.6.5 随机森林预测案例3.6.6 总结3.6.1 什么是集成学习方法集成学习通过建立几个模型组合的来解决单一预测问题。它的工作原理是生成多个分类器/模型，各自独立地学习和作出预测。这些预测最后结合成组合预测，因此优于任何一个单分类的做出预测。3.6.2 什么是随机森林在机器学习中，随机森林是一个包含多个决策树的分类器，并且其输出的类别是由个别树输出的类别
复制链接

扫一扫

专栏目录