3.6 集成学习方法之随机森林

3.6.1 什么是集成学习方法

集成学习通过建立几个模型组合的来解决单一预测问题。它的工作原理是生成多个分类器/模型,各自独立地学习和作出预测。这些预测最后结合成组合预测,因此优于任何一个单分类的做出预测

3.6.2 什么是随机森林

在机器学习中,随机森林是一个包含多个决策树的分类器,并且其输出的类别是由个别树输出的类别的众数而定。
在这里插入图片描述
例如, 如果你训练了5个树, 其中有4个树的结果是True, 1个数的结果是False, 那么最终投票结果就是True

3.6.3 随机森林原理过程

  • 随机有两方面
    o 特征值随机
    o 训练集随机

学习算法根据下列算法而建造每棵树:

  • 用N来表示训练用例(样本)的个数,M表示特征数目。
    1、[训练集随机] 一次随机选出一个样本,重复N次, (有可能出现重复的样本)
    2、[特征随机] 从M个特征这中随机选出m个特征, m <<M—>降维
  • 采取bootstrap(随机有放回)抽样
    假设总共有五个样本[1,2,3,4,5]
    新的树训练集先抽一个假设是2,[2]
    将2放回再抽,可能还抽到2,[2,2]
    将2再放回,可能抽到3,以此类推最后新的数据集可能是[2,2,3,1,5]

为什么采用BootStrap抽样

  • 为什么要随机抽样训练集?  
    如果不进行随机抽样,每棵树的训练集都一样,那么最终训练出的树分类结果也是完全一样的
  • 为什么要有放回地抽样?
    如果不是有放回的抽样,那么每棵树的训练样本都是不同的,都是没有交集的,这样每棵树都是“有偏的”,都是绝对“片面的”(当然这样说可能不对),也就是说每棵树训练出来都是有很大的差异的;而随机森林最后分类取决于多棵树(弱分类器)的投票表决。

3.6.4 API

class sklearn.ensemble.RandomForestClassifier(n_estimators=10, criterion=’gini’, max_depth=None,bootstrap=True, random_state=None, min_samples_split=2)

  • 随机森林分类器
  • n_estimators:integer,optional(default = 10)森林里的树木数量
    120,200,300,500,800,1200
  • criteria:string,可选(default =“gini”)分割特征的测量方法
  • max_depth:integer或None,可选(默认=无)树的最大深度 5,8,15,25,30
  • max_features="auto”,每个决策树的最大特征数量
    o If “auto”, then max_features=sqrt(n_features) .
    o If “sqrt”, then max_features=sqrt(n_features) (same as “auto”).
    o If “log2”, then max_features=log2(n_features) .
    o If None, then max_features=n_features .
  • bootstrap:boolean,optional(default = True)是否在构建树时使用放回抽样
  • min_samples_split:节点划分最少样本数
  • min_samples_leaf:叶子节点的最小样本数

超参数:n_estimator, max_depth, min_samples_split,min_samples_leaf

3.6.5 随机森林预测案例

  • 实例化随机森林
# 随机森林去进行预测
 rf = RandomForestClassifier() 
  • 定义超参数的选择列表
 param = {"n_estimators": [120,200,300,500,800,1200], "max_depth": [5, 8, 15, 25, 30]} 
  • 使用GridSearchCV进行网格搜索
 # 超参数调优
 gc = GridSearchCV(rf, param_grid=param, cv=2) 
 
 gc.fit(x_train, y_train) 
 
 print("随机森林预测的准确率为:",  gc.score(x_test, y_test)) 

完整代码:

import pandas as pd
titanic = pd.read_csv("train.csv")

x = titanic[["Pclass","Age","Sex"]]
y = titanic["Survived"]

x["Age"].fillna(x["Age"].mean(),inplace=True)

x = x.to_dict(orient="recordes")

from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y,random_state=22)

from sklearn.feature_extraction import DictVectorizer
transfer = DictVectorizer()
x_train = transfer.fit_transform(x_train)
x_test = transfer.fit_transform(x_test)

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

estimator = RandomForestClassifier()
param_dict={"n_estimators":[120,200,300,500,800,1200],"max_depth":[5,8,15,25,30]}
estimator = GridSearchCV(estimator,param_grid=param_dict,cv=13)
estimator.fit(x_train,y_train)

y_predict= estimator.predict(x_test)
print("y_predict:\n",y_predict)
print("直接比对真实值和预测值:\n",y_test==y_predict)

score=estimator.score(x_test,y_test)
print("准确率为:\n",score)
print("最佳参数:\n",estimator.best_params_)
print("最佳结果:\n",estimator.best_score_)
print("最佳预估器:\n",estimator.best_estimator_)
print("交叉验证结果:\n",estimator.cv_results_)

输出结果:

y_predict:
 [1 0 0 1 1 0 1 0 0 1 1 1 0 0 0 0 1 0 1 0 1 0 1 0 0 0 0 0 1 0 0 0 1 1 1 0 0
 0 0 0 0 1 1 1 1 0 0 0 1 0 1 0 1 0 0 0 0 0 1 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 1 0 0 1 1 1 0 0 1 0 0 1 0 0 0
 0 1 1 1 0 1 0 0 1 0 1 0 0 1 0 0 0 1 1 0 0 0 1 0 1 1 1 0 0 1 0 0 0 0 0 0 1
 0 0 1 0 1 0 0 1 1 0 1 1 0 0 0 0 0 1 0 1 0 1 0 1 0 0 1 1 1 1 0 0 0 1 1 0 0
 0 0 0 0 1 0 1 0 0 0 0 0 0 0 1 1 0 1 1 0 0 1 0 0 1 0 1 0 1 1 1 0 1 0 0 0 1
 1]
直接比对真实值和预测值:
 816    False
789     True
869    False
235    False
473     True
       ...  
174     True
723     True
350     True
399     True
194     True
Name: Survived, Length: 223, dtype: bool
准确率为:
 0.7668161434977578
最佳参数:
 {'max_depth': 5, 'n_estimators': 500}
最佳结果:
 0.8159011028966185
最佳预估器:
 RandomForestClassifier(max_depth=5, n_estimators=500)
交叉验证结果:
 {'mean_fit_time': array([0.14382346, 0.24311701, 0.35549331, 0.59480373, 1.03430533,
       1.58348234, 0.15034223, 0.29175345, 0.43171024, 0.69643927,
       1.08060575, 1.66969768, 0.15859906, 0.26394582, 0.3909564 ,
       0.66351326, 1.05977241, 1.57854199, 0.15739274, 0.26667706,
       0.39561955, 0.66135383, 1.06135798, 1.68801379, 0.16833258,
       0.28169401, 0.42419593, 0.68171875, 1.11704437, 1.65046819]), 'std_fit_time': array([0.00460574, 0.00457774, 0.00560244, 0.0237492 , 0.0693016 ,
       0.08063175, 0.00088036, 0.00196045, 0.0299682 , 0.02767114,
       0.0185797 , 0.10037114, 0.00095291, 0.00313476, 0.0080378 ,
       0.00314837, 0.0085277 , 0.02675928, 0.00138337, 0.00443381,
       0.00621335, 0.00857361, 0.00366948, 0.01269344, 0.00834818,
       0.01879558, 0.00185441, 0.00884583, 0.02930271, 0.04502446]), 'mean_score_time': array([0.01224462, 0.01972437, 0.03063742, 0.04909714, 0.08723656,
       0.12680173, 0.01283073, 0.02806497, 0.03263768, 0.05937211,
       0.08187675, 0.11716437, 0.01212867, 0.01990636, 0.04246132,
       0.04946303, 0.07848565, 0.11714268, 0.01236963, 0.01902437,
       0.02998249, 0.05101713, 0.07680162, 0.13931696, 0.01260932,
       0.02083095, 0.03526004, 0.04966497, 0.08392723, 0.13167572]), 'std_score_time': array([3.22815019e-04, 3.64044690e-05, 2.88254653e-03, 2.09637893e-03,
       1.01517426e-02, 1.36090891e-02, 4.62676711e-04, 9.25492250e-03,
       1.91093686e-03, 8.79019308e-03, 4.33523386e-03, 2.97017526e-03,
       9.89005296e-04, 3.20695727e-04, 1.92351699e-02, 4.18609844e-04,
       1.02999229e-03, 1.78868124e-03, 5.41567474e-04, 1.05280353e-03,
       8.46999943e-04, 1.43264628e-03, 2.46888252e-03, 2.15878994e-03,
       9.90785315e-04, 8.04257087e-04, 4.44009255e-03, 6.22252212e-04,
       5.09247981e-03, 9.19640998e-03]), 'param_max_depth': masked_array(data=[5, 5, 5, 5, 5, 5, 8, 8, 8, 8, 8, 8, 15, 15, 15, 15, 15,
                   15, 25, 25, 25, 25, 25, 25, 30, 30, 30, 30, 30, 30],
             mask=[False, False, False, False, False, False, False, False,
                   False, False, False, False, False, False, False, False,
                   False, False, False, False, False, False, False, False,
                   False, False, False, False, False, False],
       fill_value='?',
            dtype=object), 'param_n_estimators': masked_array(data=[120, 200, 300, 500, 800, 1200, 120, 200, 300, 500, 800,
                   1200, 120, 200, 300, 500, 800, 1200, 120, 200, 300,
                   500, 800, 1200, 120, 200, 300, 500, 800, 1200],
             mask=[False, False, False, False, False, False, False, False,
                   False, False, False, False, False, False, False, False,
                   False, False, False, False, False, False, False, False,
                   False, False, False, False, False, False],
       fill_value='?',
            dtype=object), 'params': [{'max_depth': 5, 'n_estimators': 120}, {'max_depth': 5, 'n_estimators': 200}, {'max_depth': 5, 'n_estimators': 300}, {'max_depth': 5, 'n_estimators': 500}, {'max_depth': 5, 'n_estimators': 800}, {'max_depth': 5, 'n_estimators': 1200}, {'max_depth': 8, 'n_estimators': 120}, {'max_depth': 8, 'n_estimators': 200}, {'max_depth': 8, 'n_estimators': 300}, {'max_depth': 8, 'n_estimators': 500}, {'max_depth': 8, 'n_estimators': 800}, {'max_depth': 8, 'n_estimators': 1200}, {'max_depth': 15, 'n_estimators': 120}, {'max_depth': 15, 'n_estimators': 200}, {'max_depth': 15, 'n_estimators': 300}, {'max_depth': 15, 'n_estimators': 500}, {'max_depth': 15, 'n_estimators': 800}, {'max_depth': 15, 'n_estimators': 1200}, {'max_depth': 25, 'n_estimators': 120}, {'max_depth': 25, 'n_estimators': 200}, {'max_depth': 25, 'n_estimators': 300}, {'max_depth': 25, 'n_estimators': 500}, {'max_depth': 25, 'n_estimators': 800}, {'max_depth': 25, 'n_estimators': 1200}, {'max_depth': 30, 'n_estimators': 120}, {'max_depth': 30, 'n_estimators': 200}, {'max_depth': 30, 'n_estimators': 300}, {'max_depth': 30, 'n_estimators': 500}, {'max_depth': 30, 'n_estimators': 800}, {'max_depth': 30, 'n_estimators': 1200}], 'split0_test_score': array([0.77578475, 0.79820628, 0.78475336, 0.79820628, 0.79820628,
       0.79820628, 0.75784753, 0.75784753, 0.76681614, 0.76233184,
       0.75784753, 0.75784753, 0.78026906, 0.78026906, 0.78475336,
       0.78475336, 0.78475336, 0.78026906, 0.78026906, 0.78475336,
       0.78026906, 0.78475336, 0.78475336, 0.78475336, 0.78475336,
       0.78475336, 0.78475336, 0.78475336, 0.78475336, 0.78026906]), 'split1_test_score': array([0.81165919, 0.80717489, 0.80717489, 0.81165919, 0.81165919,
       0.81165919, 0.78923767, 0.80717489, 0.80269058, 0.78923767,
       0.79372197, 0.79372197, 0.78026906, 0.78026906, 0.78026906,
       0.78475336, 0.78026906, 0.78026906, 0.78475336, 0.78026906,
       0.78026906, 0.78026906, 0.78026906, 0.78026906, 0.78475336,
       0.78026906, 0.78026906, 0.78475336, 0.78026906, 0.78026906]), 'split2_test_score': array([0.83783784, 0.83333333, 0.83783784, 0.83783784, 0.83783784,
       0.83783784, 0.85135135, 0.85135135, 0.85135135, 0.85135135,
       0.84684685, 0.84684685, 0.84234234, 0.82432432, 0.84234234,
       0.82882883, 0.82432432, 0.84234234, 0.82432432, 0.84684685,
       0.82882883, 0.84234234, 0.84234234, 0.84684685, 0.84684685,
       0.81081081, 0.84684685, 0.82882883, 0.81081081, 0.82432432]), 'mean_test_score': array([0.80842726, 0.81290483, 0.80992203, 0.8159011 , 0.8159011 ,
       0.8159011 , 0.79947885, 0.80545792, 0.80695269, 0.80097362,
       0.79947212, 0.79947212, 0.80096015, 0.79495415, 0.80245492,
       0.79944519, 0.79644892, 0.80096015, 0.79644892, 0.80395642,
       0.79645565, 0.80245492, 0.80245492, 0.80395642, 0.80545119,
       0.79194441, 0.80395642, 0.79944519, 0.79194441, 0.79495415]), 'std_test_score': array([0.02543594, 0.01490194, 0.02175853, 0.0164552 , 0.0164552 ,
       0.0164552 , 0.03885359, 0.03819208, 0.0346427 , 0.0372775 ,
       0.03656061, 0.03656061, 0.02926163, 0.02076785, 0.02826402,
       0.02077737, 0.01979572, 0.02926163, 0.01979572, 0.03038331,
       0.0228913 , 0.02826402, 0.02826402, 0.03038331, 0.02927115,
       0.01346559, 0.03038331, 0.02077737, 0.01346559, 0.02076785]), 'rank_test_score': array([ 6,  4,  5,  1,  1,  1, 19,  8,  7, 16, 20, 20, 17, 27, 13, 22, 25,
       17, 25, 10, 24, 13, 13, 10,  9, 29, 10, 22, 29, 27], dtype=int32)}

3.6.6 总结

  • 在当前所有算法中,具有极好的准确率
  • 能够有效地运行在大数据集上,处理具有高维特征的输入样本,而且不需要降维
  • 能够评估各个特征在分类问题上的重要性
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值