Task4:建模与调参

最新推荐文章于 2022-09-11 17:47:03 发布

一梦花海

最新推荐文章于 2022-09-11 17:47:03 发布

阅读量351

点赞数

分类专栏： Python学习之路

本文链接：https://blog.csdn.net/qq_34844201/article/details/105230056

版权

Python学习之路专栏收录该内容

15 篇文章 0 订阅

订阅专栏

4.4.3 lightgbm 模型以及调参

4.1 学习目标

4.2 内容介绍

线性回归模型：
- 线性回归对于特征的要求；
- 处理长尾分布；
- 理解线性回归模型；
模型性能验证：
- 评价函数与目标函数；
- 交叉验证方法；
- 留一验证方法；
- 针对时间序列问题的验证；
- 绘制学习率曲线；
- 绘制验证曲线；
嵌入式特征选择：
- Lasso回归；
- Ridge回归；
- 决策树；
模型对比：
- 常用线性模型；
- 常用非线性模型；
模型调参：
- 贪心调参方法；
- 网格调参方法；
- 贝叶斯调参方法；

4.3 相关原理介绍与推荐

由于相关算法原理篇幅较长，本文推荐了一些博客与教材供初学者们进行学习。

4.4 代码示例

4.4.1 简单建模

from sklearn.linear_model import LinearRegression
model = LinearRegression(normalize=True)
model = model.fit(train_X, train_y)

绘制特征v_9的值与标签的散点图，图片发现模型的预测结果（蓝色点）与真实标签（黑色点）的分布差异较大，且部分预测值出现了小于0的情况，说明我们的模型存在一些问题

plt.scatter(train_X['v_9'][subsample_index], train_y[subsample_index], color='black')
plt.scatter(train_X['v_9'][subsample_index], model.predict(train_X.loc[subsample_index]), color='blue')
plt.xlabel('v_9')
plt.ylabel('price')
plt.legend(['True Price','Predicted Price'],loc='upper right')
print('The predicted price is obvious different from true price')
plt.show()

通过作图我们发现数据的标签（price）呈现长尾分布，不利于我们的建模预测。原因是很多模型都假设数据误差项符合正态分布，而长尾分布的数据违背了这一假设。参考博客：https://blog.csdn.net/Noob_daniel/article/details/76087829

import seaborn as sns
print('It is clear to see the price shows a typical exponential distribution')
plt.figure(figsize=(15,5))
plt.subplot(1,2,1)
sns.distplot(train_y)
plt.subplot(1,2,2)
sns.distplot(train_y[train_y < np.quantile(train_y, 0.9)])

在使用训练集对参数进行训练的时候，经常会发现人们通常会将一整个训练集分为三个部分（比如mnist手写训练集）。一般分为：训练集（train_set），评估集（valid_set），测试集（test_set）这三个部分。这其实是为了保证训练效果而特意设置的。其中测试集很好理解，其实就是完全不参与训练的数据，仅仅用来观测测试效果的数据。而训练集和评估集则牵涉到下面的知识了。

因为在实际的训练中，训练的结果对于训练集的拟合程度通常还是挺好的（初始条件敏感），但是对于训练集之外的数据的拟合程度通常就不那么令人满意了。因此我们通常并不会把所有的数据集都拿来训练，而是分出一部分来（这一部分不参加训练）对训练集生成的参数进行测试，相对客观的判断这些参数对训练集之外的数据的符合程度。这种思想就称为交叉验证（Cross Validation）

4.4.2 随机森林建模及调参

GradientBoostingRegressor函数的参数如下：

sklearn.ensemble.GradientBoostingRegressor(loss='ls', learning_rate=0.1, n_estimators=100, subsample=1.0, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_depth=3, init=None, random_state=None, max_features=None, alpha=0.9, verbose=0, max_leaf_nodes=None, warm_start=False, presort='auto')

网格调参代码如下：

#随机森林模型

rf = RandomForestRegressor()
n_estimators=[300,500]         #树的数量，列表可供表示选择的参数
max_depth=[6,8]				   #树的深度，列表表示可供选择的参数
param_grid = { "max_depth": max_depth, "n_estimators": n_estimators}

gs = GridSearchCV(estimator=rf, param_grid=param_grid, cv=5, n_jobs=-1, verbose=1)
gs = gs.fit(train_x, train_y)
print('best_score_:',gs.best_score_)
print('best_params_:',gs.best_params_)

通过best_params_便可以打印出最优的参数组合，网格调参的原理比较简单，就是通过不同参数的组合分别去训练模型。

缺点：费时间，可想而知如果参数的组合不是22，而是nn的话，再加上k折的迭代次数，那么势必需要等待很久

建议：如果是小数据量的数据，那么可以选择网格调参方法，但如果是比较大的数据，不太建议，等太久了，更好的选择应该是贝叶斯调参。

贝叶斯调参代码如下：

from bayes_opt import BayesianOptimization
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_absolute_error,  make_scorer
#随机森林贝叶斯调参
rf = RandomForestRegressor()
def rf_cv(n_estimators, max_depth, min_samples_split,max_features, min_samples_leaf):
    val = cross_val_score(
        RandomForestRegressor(
            min_samples_leaf=int(min_samples_leaf),
            min_samples_split=int(min_samples_split),
            n_estimators=int(n_estimators),
            max_depth=int(max_depth),
            max_features=int(max_features)
        ),
        train_x, train_y_ln, scoring=make_scorer(mean_absolute_error), cv=4,verbose=0
    ).mean()
    return 1-val
rf_bo = BayesianOptimization(
        rf_cv,
        {'n_estimators': (60, 250),
        'min_samples_leaf': (1, 80),
        'min_samples_split': (20, 150),
        'max_depth':(5, 30),
        'max_features':(4,10)}
    )

rf_bo.maximize()

最后结果打印出来了参数的各种组合，选择target最大，也就是目标函数最小的组合作为建模的参数即可

优点：参数只需要选择一个合理的范围，不需要准确的数值，同时训练时间也降低了不少，个人推荐贝叶斯调参的方法

最后调用接口函数建模预测即可：

forest = RandomForestRegressor(max_depth=20,n_estimators=63,min_samples_split=26,min_samples_leaf=5,max_features=9)
forest_2=forest.fit(train_x, train_y)
y=forest_2.predict(test_x)

4.4.3 lightgbm 模型以及调参

## LGB的参数集合：

objective = ['regression', 'regression_l1', 'mape', 'huber', 'fair']

num_leaves = [3,5,10,15,20,40, 55]
max_depth = [3,5,10,15,20,40, 55]
bagging_fraction = []
feature_fraction = []
drop_rate = []

Grid Search 调参：

from sklearn.model_selection import GridSearchCV
parameters = {'objective': objective , 'num_leaves': num_leaves, 'max_depth': max_depth}
model = LGBMRegressor()
clf = GridSearchCV(model, parameters, cv=5)
clf = clf.fit(train_X, train_y)

clf.best_params_

输出最优参数：

{'max_depth': 15, 'num_leaves': 55, 'objective': 'regression'}

输入刚才的最优参数

model = LGBMRegressor(objective='regression',
                          num_leaves=55,
                          max_depth=15)

一梦花海

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
Task4:建模与调参

目录4.1 学习目标4.2 内容介绍4.3 相关原理介绍与推荐4.3.1 线性回归模型4.3.2 决策树模型4.3.3 GBDT模型4.3.4 XGBoost模型4.3.5 LightGBM模型4.3.6 推荐教材：4.4 代码示例4.4.1 简单建模4.4.2 随机森林建模及调参4.4.3 lightgbm 模型以及调参4.1 学习目标...
复制链接

扫一扫

专栏目录