LGB+LR的实践

1 背景

相信大名鼎鼎的GBDT+LR组合很多小伙伴都听过,这种组合模型的预测效果要比单模型要好,但之前一直没有亲自实践过,最近刚好公司一个项目用到了,故抓紧时间总结一波~

2 原理

简单来说就是首先用树模型(GBDT、Xgboost、Lightgbm)来预测样本结果,然后将树模型的结果转为标准的变量形式放入LR中,最终进行预测~

  • 具有stacking思想的二分类器模型,GBDT用来对训练集提取特征作为新的训练输入数据,LR作为新训练输入数据的分类器。
  • GBDT算法的特点正好可以用来发掘有区分度的特征、特征组合,减少特征工程中人力成本。而LR则可以快速实现算法

具体的一个demo例子见下方,根据树模型的结果转为标准变量形式并放入模型~

在这里插入图片描述

下面就拿一个具体数据来看看GBDT+LR的效果,以及与其余模型的比较

3 数据的准备

3.1 读入数据

import pandas as pd
import warnings
warnings.filterwarnings('ignore')

# 读入数据
df = pd.read_csv('telecom_churn.csv')
df['churn'] = df['churn'].map(str)
churn_dic = {'True':1, 'False':0}
df['churn'] = df['churn'].map(churn_dic)
print(df.shape)
df.head()
(3333, 21)
stateaccount lengtharea codephone numberinternational planvoice mail plannumber vmail messagestotal day minutestotal day callstotal day charge...total eve callstotal eve chargetotal night minutestotal night callstotal night chargetotal intl minutestotal intl callstotal intl chargecustomer service callschurn
0KS128415382-4657noyes25265.111045.07...9916.78244.79111.0110.032.7010
1OH107415371-7191noyes26161.612327.47...10316.62254.410311.4513.733.7010
2NJ137415358-1921nono0243.411441.38...11010.30162.61047.3212.253.2900
3OH84408375-9999yesno0299.47150.90...885.26196.9898.866.671.7820
4OK75415330-6626yesno0166.711328.34...12212.61186.91218.4110.132.7330

5 rows × 21 columns

df['churn'].value_counts()
0    2850
1     483
Name: churn, dtype: int64

3.2 切分训练集测试集

X = df[['total day calls', 'total night charge', 'number vmail messages', 'total intl charge', 'total eve calls']]
y = df['churn'].values
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3,
                                                    random_state = 23)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)
(2333, 5) (1000, 5) (2333,) (1000,)

4 LR

from sklearn.linear_model import LogisticRegression
from sklearn import metrics

# 初始建模
lr = LogisticRegression(random_state = 23)
print(lr)
lr.fit(X_train, y_train)

# 计算AUC
scores = lr.predict_proba(X_test)[:,1]
LR_auc = metrics.roc_auc_score(y_test, scores) # y_test真实标签 scores为预测为1的概率
LR_auc
LogisticRegression(random_state=23)





0.5834069949026194

5 LGB

import lightgbm as lgb
from lightgbm.sklearn import LGBMClassifier # 是lightgbm的sklearn包。这个包允许我们像GBM一样使用Grid Search 和并行处理。

# 搭建模型
model_lgb = lgb.LGBMClassifier(
                                 boosting_type='gbdt',
                                 objective = 'binary',
                                 metric = 'auc',
                                 verbose = 0,
                                 learning_rate = 0.01,
                                 num_leaves = 35,
                                 feature_fraction=0.8,
                                 bagging_fraction= 0.9,
                                 bagging_freq= 8,
                                 lambda_l1= 0.6,
                                 lambda_l2= 0
                               )

#  拟合模型
model_lgb.fit(X_train, y_train)

# 计算AUC
scores = model_lgb.predict_proba(X_test)[:,1]
LGB_auc = metrics.roc_auc_score(y_test, scores) # y_test真实标签 scores为预测为1的概率
LGB_auc
[LightGBM] [Warning] feature_fraction is set=0.8, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.8
[LightGBM] [Warning] lambda_l1 is set=0.6, reg_alpha=0.0 will be ignored. Current value: lambda_l1=0.6
[LightGBM] [Warning] bagging_fraction is set=0.9, subsample=1.0 will be ignored. Current value: bagging_fraction=0.9
[LightGBM] [Warning] lambda_l2 is set=0, reg_lambda=0.0 will be ignored. Current value: lambda_l2=0
[LightGBM] [Warning] bagging_freq is set=8, subsample_freq=0 will be ignored. Current value: bagging_freq=8
[LightGBM] [Warning] Find whitespaces in feature_names, replace with underlines
[LightGBM] [Warning] feature_fraction is set=0.8, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.8
[LightGBM] [Warning] lambda_l1 is set=0.6, reg_alpha=0.0 will be ignored. Current value: lambda_l1=0.6
[LightGBM] [Warning] bagging_fraction is set=0.9, subsample=1.0 will be ignored. Current value: bagging_fraction=0.9
[LightGBM] [Warning] lambda_l2 is set=0, reg_lambda=0.0 will be ignored. Current value: lambda_l2=0
[LightGBM] [Warning] bagging_freq is set=8, subsample_freq=0 will be ignored. Current value: bagging_freq=8
[LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000353 seconds.
You can set `force_col_wise=true` to remove the overhead.





0.601792922596423

6 LGB+LR

6.1 LGB实现

import lightgbm as lgb
from lightgbm.sklearn import LGBMClassifier # 是lightgbm的sklearn包。这个包允许我们像GBM一样使用Grid Search 和并行处理。
# 搭建模型
lgb_param = {'boosting_type':'gbdt',
                                 'objective' : 'binary',
                                 'metric' : 'auc',
                                 'verbose' : 0,
                                 'learning_rate' : 0.01,
                                 'num_leaves' : 4,
                                 'feature_fraction':0.8,
                                 'bagging_fraction': 0.9,
                                 'bagging_freq': 8,
                                 'lambda_l1': 0.6,
                                 'lambda_l2': 0,
            'n_estimators' : 200}

'''
num_leaves:代表的是一棵树上的叶子数
n_estimators:代表的是多少棵树!
- 每棵树4个叶子,然后默认是100棵树!!!!本场景选择200!
'''

model = lgb.LGBMClassifier(
                                 boosting_type='gbdt',
                                 objective = 'binary',
                                 metric = 'auc',
                                 verbose = 0,
                                 learning_rate = 0.01,
                                 num_leaves = 4,
                                 feature_fraction=0.8,
                                 bagging_fraction= 0.9,
                                 bagging_freq= 8,
                                 lambda_l1= 0.6,
                                 lambda_l2= 0,
                                n_estimators = 200
                               )

#  拟合模型
model.fit(X_train, y_train)
[LightGBM] [Warning] feature_fraction is set=0.8, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.8
[LightGBM] [Warning] lambda_l1 is set=0.6, reg_alpha=0.0 will be ignored. Current value: lambda_l1=0.6
[LightGBM] [Warning] bagging_fraction is set=0.9, subsample=1.0 will be ignored. Current value: bagging_fraction=0.9
[LightGBM] [Warning] lambda_l2 is set=0, reg_lambda=0.0 will be ignored. Current value: lambda_l2=0
[LightGBM] [Warning] bagging_freq is set=8, subsample_freq=0 will be ignored. Current value: bagging_freq=8
[LightGBM] [Warning] Find whitespaces in feature_names, replace with underlines
[LightGBM] [Warning] feature_fraction is set=0.8, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.8
[LightGBM] [Warning] lambda_l1 is set=0.6, reg_alpha=0.0 will be ignored. Current value: lambda_l1=0.6
[LightGBM] [Warning] bagging_fraction is set=0.9, subsample=1.0 will be ignored. Current value: bagging_fraction=0.9
[LightGBM] [Warning] lambda_l2 is set=0, reg_lambda=0.0 will be ignored. Current value: lambda_l2=0
[LightGBM] [Warning] bagging_freq is set=8, subsample_freq=0 will be ignored. Current value: bagging_freq=8
[LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000282 seconds.
You can set `force_col_wise=true` to remove the overhead.





LGBMClassifier(bagging_fraction=0.9, bagging_freq=8, feature_fraction=0.8,
               lambda_l1=0.6, lambda_l2=0, learning_rate=0.01, metric='auc',
               n_estimators=200, num_leaves=4, objective='binary', verbose=0)
model.get_params()
{'boosting_type': 'gbdt',
 'class_weight': None,
 'colsample_bytree': 1.0,
 'importance_type': 'split',
 'learning_rate': 0.01,
 'max_depth': -1,
 'min_child_samples': 20,
 'min_child_weight': 0.001,
 'min_split_gain': 0.0,
 'n_estimators': 200,
 'n_jobs': -1,
 'num_leaves': 4,
 'objective': 'binary',
 'random_state': None,
 'reg_alpha': 0.0,
 'reg_lambda': 0.0,
 'silent': True,
 'subsample': 1.0,
 'subsample_for_bin': 200000,
 'subsample_freq': 0,
 'metric': 'auc',
 'verbose': 0,
 'feature_fraction': 0.8,
 'bagging_fraction': 0.9,
 'bagging_freq': 8,
 'lambda_l1': 0.6,
 'lambda_l2': 0}

6.2 LGB的vector导出来!

6.2.1 训练集

import numpy as np

y_pred = model.predict(X_train,pred_leaf=True) 
#  预测结果为该样本最终落在树的哪一个节点上!如果'num_leaves': 4,则可能落在 0 1 2 3 这四个位置上!
train_matrix = np.zeros([len(y_pred), len(y_pred[0])*lgb_param['num_leaves']],dtype=np.int64)
print(train_matrix.shape) # 1000行 800列 因为是1000个样本点,同时200棵树,每棵树4个节点,则800个变量
train_matrix
(2333, 800)





array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)
for i in range(len(y_pred)):
    # 对每一个样本点做循环!然后卡一个点,每隔4个设一个关卡!
    temp = np.arange(len(y_pred[0]))*lgb_param['num_leaves'] + np.array(y_pred[i])
    train_matrix[i][temp] += 1
lgb_output_vec_train = pd.DataFrame(train_matrix)
lgb_output_vec_train.columns = ['leaf_' + str(i) for i in lgb_output_vec_train.columns]
lgb_output_vec_train
leaf_0leaf_1leaf_2leaf_3leaf_4leaf_5leaf_6leaf_7leaf_8leaf_9...leaf_790leaf_791leaf_792leaf_793leaf_794leaf_795leaf_796leaf_797leaf_798leaf_799
01000001000...1000100010
11000001000...1000010010
21000100010...1000100010
31000100010...1000100010
41000100010...0000100100
..................................................................
23281000100010...1000010010
23291000100010...1000010010
23301000001000...1000100010
23311000001000...1000100010
23321000001000...0000010100

2333 rows × 800 columns

6.2.2 测试集

import numpy as np

y_pred = model.predict(X_test,pred_leaf=True) 
#  预测结果为该样本最终落在树的哪一个节点上!如果'num_leaves': 4,则可能落在 0 1 2 3 这四个位置上!
test_matrix = np.zeros([len(y_pred), len(y_pred[0])*lgb_param['num_leaves']],dtype=np.int64)
print(test_matrix.shape) # 1000行 800列 因为是1000个样本点,同时200棵树,每棵树4个节点,则800个变量
test_matrix
(1000, 800)





array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)
for i in range(len(y_pred)):
    # 对每一个样本点做循环!然后卡一个点,每隔4个设一个关卡!
    temp = np.arange(len(y_pred[0]))*lgb_param['num_leaves'] + np.array(y_pred[i])
    test_matrix[i][temp] += 1
lgb_output_vec = pd.DataFrame(test_matrix)
lgb_output_vec.columns = ['leaf_' + str(i) for i in lgb_output_vec.columns]
lgb_output_vec
leaf_0leaf_1leaf_2leaf_3leaf_4leaf_5leaf_6leaf_7leaf_8leaf_9...leaf_790leaf_791leaf_792leaf_793leaf_794leaf_795leaf_796leaf_797leaf_798leaf_799
01000100010...1000010010
10100010001...0100100001
21000001000...1000100010
31000100010...1000010010
41000001000...1000100010
..................................................................
9951000100010...1000100010
9961000100010...0010001000
9971000100010...1000010010
9980001010001...0100100001
9991000100010...1000010010

1000 rows × 800 columns

y_pred[0] # 第一个样本点在100棵树上分别落的位置!
array([0, 0, 0, 3, 3, 0, 3, 0, 3, 3, 0, 3, 3, 0, 3, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 3, 0, 0, 3, 3, 3, 2, 0, 3, 0, 2, 2, 2, 3, 0, 2, 0, 2, 0,
       0, 0, 0, 3, 3, 3, 0, 3, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3,
       3, 3, 0, 3, 3, 0, 2, 0, 2, 2, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0,
       3, 3, 3, 3, 1, 3, 3, 3, 0, 0, 0, 0, 2, 0, 0, 1, 0, 0, 2, 0, 0, 3,
       0, 2, 2, 2, 3, 2, 3, 3, 3, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0,
       3, 3, 0, 3, 1, 1, 1, 1, 1, 1, 1, 1, 3, 0, 3, 3, 0, 3, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 3, 1, 0, 0, 0, 0, 0, 0, 0, 0,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 2, 2, 3, 2, 2, 2,
       3, 2])
len(y_pred) # 表示1000个样本点!
1000
len(y_pred[0]) # 表示200棵树
200

6.3 LR+LGB

from sklearn.linear_model import LogisticRegression
from sklearn import metrics

# 初始建模
lr = LogisticRegression(random_state = 23)
print(lr)
lr.fit(lgb_output_vec_train, y_train)

# 计算AUC
scores = lr.predict_proba(lgb_output_vec)[:,1]
LR_LGB_auc = metrics.roc_auc_score(y_test, scores) # y_test真实标签 scores为预测为1的概率
LR_LGB_auc
LogisticRegression(random_state=23)





0.58792613217832

7 结果对比

df = pd.DataFrame({'model':['LR', 'LGB', 'LGB+LR'], 'AUC':[LR_auc, LGB_auc, LR_LGB_auc]})
df
modelAUC
0LR0.583407
1LGB0.601793
2LGB+LR0.587926

结论:就本案例而言,LGB+LR的效果没有LGB好,所以并不能绝对说某一个模型效果如何好,应该根据不同数据场景选择最优的模型。一般而言,在CTR预估场景下LGB+LR效果还是不错的


  • 3
    点赞
  • 14
    收藏
    觉得还不错? 一键收藏
  • 4
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 4
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值