随机森林机器学习中调参的基本思想

最新推荐文章于 2024-07-01 18:30:25 发布

深夜起床

最新推荐文章于 2024-07-01 18:30:25 发布

阅读量410

点赞数

文章标签：机器学习随机森林算法

本文链接：https://blog.csdn.net/qq_47630787/article/details/133612745

版权

1.找目标：
提高模型在未知数据上的准确率score和oob_score
衡量模型在未知数据上的准确率的指标：**泛化误差（genelization error）
**
泛化误差是指模型在未知数据上表现糟糕，说明泛化误差大。
泛化误差受模型复杂度的影响。随着模型复杂度增加，泛化误差有极小点。
左边欠拟合，右边过拟合。
随机森林天生复杂度高，调整参数减少模型复杂度。
（1）模型太复杂或者太简单，都会让泛化误差高，追究的中间平衡点
（2）模型太复杂就会过拟合，太简单就会欠拟合。
（3）对树模型和树的集成模型，树的深度越深，树叶越多，模型越复杂
（4）树和树的集成模型，都是减少模型复杂度，把模型往左偏移
在这里插入图片描述

参数的影响顺序
n_estimators 提升平稳，不影响单个模型的复杂度，单调↑
max_depth 有增有减，默认最大深度（最大复杂度），减小单调
min_samples_leaf 有增有减，默认最小为1（最大复杂度），增大单调
min_samples_split有增有减，默认最小为2（最大复杂度），增大单调
max_features 有增有减，默认为aoto，特征总数的开平方，位于中间复杂度，既可以增加也可以减少
crterion有增有减，一般使用gini默认

1.先调n_estimators

from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np


# 导入数据
data = load_breast_cancer()
# print(data.data.shape, data.target)

rfc = RandomForestClassifier(n_estimators=100,random_state=90) # 实例化
score_pre = cross_val_score(rfc, data.data, data.target, cv=10).mean()
print("score_pre=", score_pre)

# 随机森林调参第一步：调n_estimators
scorel = []
for i in range(0,200,10):
    rfc = RandomForestClassifier(n_estimators=i+1,n_jobs=-1,random_state=90)
    score = cross_val_score(rfc, data.data, data.target, cv=10).mean()
    scorel.append(score)
print(max(scorel), (scorel.index(max(scorel))*10)+1) # 每10个取1个
plt.figure(figsize=[20,5])
plt.plot(range(1,201,10),scorel)
plt.show()

细调

scorel = []
for i in range(65,75):
    rfc = RandomForestClassifier(n_estimators=i,n_jobs=-1,random_state=90)
    score = cross_val_score(rfc, data.data, data.target, cv=10).mean()
    scorel.append(score)
print(max(scorel), [*range(65,75)][scorel.index(max(scorel))]) # 每10个取1个
plt.figure(figsize=[20,5])
plt.plot(range(65,75),scorel)
plt.show()

2.网格搜索
先选一个区间，在缩小范围
2.1调max_depth

# max_depth
param_grid = {"max_depth":np.arange(1,20,1)}
# 一般根据数据的大小进行试探，数据很小可以采用1-10，大型的30-50
# 最好抛出学习曲线看
rfc = RandomForestClassifier(n_estimators=73 # 之前得到的
                             ,random_state=90
                             )
GS = GridSearchCV(rfc,param_grid,cv=10)
GS.fit(data.data,data.target)
print( GS.best_params_, GS.best_score_)

max_depth测出来性能变小，往左偏了，泛化误差小
同理min_samples_leaf和min_samples_split也会使泛化误差变小
2.2更改max_features
只需要替换网格搜索的参数

param_grid = {"max_features":np.arange(5,30,1)}
# max_features本来是所有features开方，往右调增大泛化误差
# 最好抛出学习曲线看
rfc = RandomForestClassifier(n_estimators=73 # 之前得到的
                             ,random_state=90
                             )
GS = GridSearchCV(rfc,param_grid,cv=10)
GS.fit(data.data,data.target)
print(GS.best_params_, GS.best_score_)

可见max_depth升高后，模型准确率降低。说明没有参数可以左右性能了，是由噪声影响的。
可以更换算法或者数据处理。

2.3调整min_samples_leaf

param_grid = {"min_samples_leaf":np.arange(1,10,1)}

增大min_samples_leaf，泛化误差增大，往右边偏
2.4调整min_samples_split

param_grid = {"min_samples_split":np.arange(2,20,1)}

增大min_samples_split*，泛化误差增大，往右边偏

2.5调整criterion

param_grid = {"criterion":["gini","entropy"]}

深夜起床

关注

0
点赞
踩
4

收藏

觉得还不错? 一键收藏
1
评论
随机森林机器学习中调参的基本思想

有增有减，默认为aoto，特征总数的开平方，位于中间复杂度，既可以增加也可以减少。（3）对树模型和树的集成模型，树的深度越深，树叶越多，模型越复杂。增大min_samples_split*，泛化误差增大，往右边偏。（1）模型太复杂或者太简单，都会让泛化误差高，追究的中间平衡点。增大min_samples_leaf，泛化误差增大，往右边偏。（4）树和树的集成模型，都是减少模型复杂度，把模型往左偏移。有增有减，默认最小为1（最大复杂度），增大单调。有增有减，默认最小为2（最大复杂度），增大单调。
复制链接

扫一扫