随机森林
RandomForestClassifier
随机森林是非常具有代表性的bagging集成算法,它的所有基评估器都是决策树。分类树组成的森林就叫做随机森林分类器,回归树集成的森林就叫做随机森林回归器。这里使用的是随机森林分类器
重点参数
n_estimators 森林中的数量,一般越大越好,但要与所需性能平衡
from sklearn. tree import DecisionTreeClassifier
from sklearn. ensemble import RandomForestClassifier
from sklearn. datasets import load_wine
% matplotlib inline
wine = load_wine( )
from sklearn. model_selection import train_test_split
xtrain, xtest, ytrain, ytest = train_test_split( wine. data, wine. target, test_size= 0.3 )
clf = DecisionTreeClassifier( random_state= 0 )
rfc = RandomForestClassifier( random_state= 0 )
clf = clf. fit( xtrain, ytrain)
rfc = rfc. fit( xtrain, ytrain)
score_c = clf. score( xtest, ytest)
score_r = rfc. score( xtest, ytest)
print ( f'决策树的评分{score_c}' , f'随机森林的评分{score_c}' )
决策树的评分0.9259259259259259 随机森林的评分0.9259259259259259
from sklearn. model_selection import cross_val_score
import matplotlib. pyplot as plt
rfc = RandomForestClassifier( n_estimators= 25 )
rfc_s = cross_val_score( rfc, wine. data, wine. target, cv= 10 )
clf = DecisionTreeClassifier( )
clf_s = cross_val_score( clf, wine. data, wine. target, cv= 10 )
plt. plot( range ( 1 , 11 ) , rfc_s, label= "RandomForest" )
plt. plot( range ( 1 , 11 ) , clf_s, label= "DecisionTree" )
plt. legend( )
plt. show( )
对比n_estimators对模型效果的影响
superpa = [ ]
for i in range ( 200 ) :
rfc = RandomForestClassifier( n_estimators= i+ 1 , n_jobs= - 1 )
rfc_s = cross_val_score( rfc, wine. data, wine. target, cv= 10 ) . mean( )
superpa. append( rfc_s)
plt. figure( )
plt. plot( range ( 1 , 201 ) , superpa)
plt. show( )
泛化误差 衡量模型在未知数据上的准确率
主观调参大法(按影响大小排序)
n_estimators max_depth min_samples_leaf min_samples_split max_features
用sklearn自带的乳腺癌数据做一个调参案例(参考菜菜的教程)
from sklearn. datasets import load_breast_cancer
from sklearn. ensemble import RandomForestClassifier
from sklearn. model_selection import GridSearchCV
from sklearn. model_selection import cross_val_score
import matplotlib. pyplot as plt
import pandas as pd
import numpy as np
导入数据集和建模
data = load_breast_cancer( )
rfc = RandomForestClassifier( n_estimators= 100 , random_state= 1 )
score_pre = cross_val_score( rfc, data. data, data. target, cv= 10 ) . mean( )
score_pre
0.9631265664160402
开始调参(n_estimators)
scorel = [ ]
for i in range ( 0 , 200 , 10 ) :
rfc = RandomForestClassifier( n_estimators= i+ 1 , n_jobs= - 1 , random_state= 90 )
score = cross_val_score( rfc, data. data, data. target, cv= 10 ) . mean( )
scorel. append( score)
print ( max ( scorel) , '使分数最高的n_estimators值为:' , scorel. index( max ( scorel) ) * 10 + 1 )
plt. figure( )
plt. plot( range ( 1 , 201 , 10 ) , scorel)
plt. show( )
0.9631265664160402 使分数最高的n_estimators值为: 71
网格搜索
param_grid = { 'max_depth' : np. arange( 1 , 20 , 1 ) }
rfc = RandomForestClassifier( n_estimators= 71 , random_state= 90 )
GS = GridSearchCV( rfc, param_grid, cv= 10 )
GS. fit( data. data, data. target)
GS. best_params_
GS. best_score_
0.9666353383458647