数据处理（2.1）点击数据处理-lgb 训练实战

最新推荐文章于 2023-01-26 23:30:45 发布

Sober-C

最新推荐文章于 2023-01-26 23:30:45 发布

阅读量3.8k

点赞数 3

分类专栏：数据处理与数据结构

本文链接：https://blog.csdn.net/xm961217/article/details/106615914

版权

数据处理与数据结构专栏收录该内容

9 篇文章 0 订阅

订阅专栏

这篇文章主要将上一篇文章中的 lgb 训练函数列出来，上一篇主要详细讲解预处理和后处理。

import lightgbm as lgb
import numpy as np

1. 输入参数介绍

输入参数主要有：

训练集的特征列

训练集的标签列

验证集的特征列

验证集的标签列

cate_cols 指明类别特征

任务的类型 job=“classification”

def base_train(x_train, y_train, x_test, y_test, cate_cols=None, job='classification'):

2. 识别 cate_cols 是否存在，不存在则设定为 auto

 if not cate_cols:
        cate_cols = 'auto'

3. 转化为 dataset ，并建立验证集

建立验证集需要将训练集一起输入进来

    lgb_train = lgb.Dataset(x_train, y_train, categorical_feature=cate_cols)
    lgb_eval = lgb.Dataset(x_test, y_test, reference=lgb_train, categorical_feature=cate_cols)

4. 根据 job 选择训练参数

其中我们选择的是分类任务

官方网站： https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.LGBMClassifier.html

boosting_type

参数决定使用哪种树来进行训练， ‘gbdt’ 表示使用传统的梯度下降树进行， ‘dart’ 表示使用加法式的回归树，也就是 ada？等（猜测）树进行训练，‘goss’表示基于梯度的单边采样，‘rf’表示随机森林

objective

指定学习任务以及要使用的相应学习目标或自定义目标函数，默认值：LGBMRegressor为'regression'，LGBMClassifier为'binary'或'multiclass'，LGBMRanker为'lambdarank'。

num_leaves

基础学习器的最大叶子数

learning_rate

学习率

feature_fraction

bagging_fraction

bagging_freq

verbose

use_missing
boost_from_average

（这几个没查到资料，有读者知道可以评论一下，感谢）

n_jobs

并行线程数

    if job == 'classification':
        params = {
        'boosting_type': 'gbdt',
        'objective': 'binary',
        'metric': 'binary_logloss',
        'num_leaves': 31,
        'learning_rate': 0.05,
        'feature_fraction': 0.9,
        'bagging_fraction': 0.8,
        'bagging_freq': 5,
        'verbose': 2,
        "use_missing": False,
        "boost_from_average": False,
        "n_jobs": -1
        }
    elif job == 'regression':
        params = {
            'boosting_type': 'gbdt',
            'objective': 'regression',
            'metric': {'l2', 'l1'},
            'num_leaves': 31,
            'learning_rate': 0.05,
            'feature_fraction': 0.9,
            'bagging_fraction': 0.8,
            'bagging_freq': 5,
            'verbose': 2,
            "n_jobs": -1
        }
    else:
        raise Exception("job error!")
    print('Starting training...')

5. 训练函数调用

lgb_train

训练数据

num_boost_round=1000

梯度迭代次数

valid_sets

验证数据集

early_stopping_rouds

当梯度停止下降多少轮，停止训练

    # train
    gbm = lgb.train(params,
                    lgb_train,
                    num_boost_round=1000,
                    valid_sets=lgb_eval,
                    early_stopping_rounds=5)

6. 保存模型

    print('Saving model...')
    gbm.save_model("./model.txt")

7. 使用模型预测测试集

num_iteration=gbm.best_iteration

使用最好的模型进行预测

 y_pred_prob = gbm.predict(x_test, num_iteration=gbm.best_iteration)

8. 模型评估

需要 import 的包

from sklearn.metrics import precision_score, recall_score, roc_auc_score

调用 roc_auc_score 函数

并将验证数据与预测的验证数据集的结果导入，比对产生 AUC

    if job == 'classification':
        res_auc = roc_auc_score(y_test, y_pred_prob)
        print("AUC: {}".format(res_auc))
        # if res_auc < 0.75:
        #     logging.error("auc too low, maybe some error, please recheck it. AUC过低，可能训练有误，已终止!")
        #     sys.exit(3)
        for i in np.arange(0.1, 1, 0.1):
            print("threshold is {}: ".format(i))
            evaluation(y_test, y_pred_prob, threshold=i)
    elif job == 'regression':
        pass

evaluation 函数

输入验证集的标签集和验证集预测标签集

比对两者

def evaluation(y_true, y_pred_prob, threshold=0.5):
    # # eval
    # print('The rmse of prediction is:', mean_squared_error(y_test, y_pred) ** 0.5)
    # lightgbm
    y_pred = np.where(y_pred_prob > threshold, 1, 0)

    res = precision_score(y_true, y_pred)
    print("precision_score : {}".format(res))
    res = recall_score(y_true, y_pred)
    print("recall_score : {}".format(res))
    res = roc_auc_score(y_true, y_pred_prob)
    print("roc_auc_score : {}".format(res))

precision_score =

``tp / (tp + fp)``

tp--将正类预测为正类（true positive）

fn--将正类预测为负类（false negative）

fp--将负类预测为正类（false positive）

tn--将负类预测为负类（true negative）

9. 特征重要性

feature_importance

对每个特征的重要性进行评估，并显示出来

def feature_importance(gbm):
    importance = gbm.feature_importance(importance_type='gain')
    names = gbm.feature_name()
    print("-" * 10 + 'feature_importance:')
    no_weight_cols = []
    for name, score in sorted(zip(names, importance), key=lambda x: x[1], reverse=True):
        if score <= 1e-8:
            no_weight_cols.append(name)
        else:
            print('{}: {}'.format(name, score))
    print("no weight columns: {}".format(no_weight_cols))

10. 返回 gbm 模型

结束训练

return gbm

Sober-C

关注

3
点赞
踩
11

收藏

觉得还不错? 一键收藏
2
评论
数据处理（2.1）点击数据处理-lgb 训练实战

这篇文章主要将上一篇文章中的 lgb 训练函数列出来，上一篇主要详细讲解预处理和后处理。import lightgbm as lgbimport numpy as np1. 输入参数介绍输入参数主要有：训练集的特征列训练集的标签列验证集的特征列验证集的标签列cate_cols 指明类别特征任务的类型 job=“classification”def base_train(x_train, y_train, x_test, y_test, cate_cols=..
复制链接

扫一扫