LGB+LR的实践

最新推荐文章于 2024-07-17 20:47:21 发布

qq_27782503

最新推荐文章于 2024-07-17 20:47:21 发布

阅读量3.6k

点赞数 3

分类专栏：机器学习 Python

本文链接：https://blog.csdn.net/qq_27782503/article/details/109016913

版权

Python 同时被 2 个专栏收录

86 篇文章 6 订阅

订阅专栏

机器学习

52 篇文章 6 订阅

订阅专栏

文章目录

1 背景
2 原理
3 数据的准备
- 3.1 读入数据
- 3.2 切分训练集测试集
4 LR
5 LGB
6 LGB+LR
7 结果对比

1 背景

相信大名鼎鼎的GBDT+LR组合很多小伙伴都听过，这种组合模型的预测效果要比单模型要好，但之前一直没有亲自实践过，最近刚好公司一个项目用到了，故抓紧时间总结一波~

2 原理

简单来说就是首先用树模型（GBDT、Xgboost、Lightgbm）来预测样本结果，然后将树模型的结果转为标准的变量形式放入LR中，最终进行预测~

具有stacking思想的二分类器模型，GBDT用来对训练集提取特征作为新的训练输入数据，LR作为新训练输入数据的分类器。
GBDT算法的特点正好可以用来发掘有区分度的特征、特征组合，减少特征工程中人力成本。而LR则可以快速实现算法

具体的一个demo例子见下方，根据树模型的结果转为标准变量形式并放入模型~

在这里插入图片描述

下面就拿一个具体数据来看看GBDT+LR的效果，以及与其余模型的比较

3 数据的准备

3.1 读入数据

import pandas as pd
import warnings
warnings.filterwarnings('ignore')

# 读入数据
df = pd.read_csv('telecom_churn.csv')
df['churn'] = df['churn'].map(str)
churn_dic = {'True':1, 'False':0}
df['churn'] = df['churn'].map(churn_dic)
print(df.shape)
df.head()

(3333, 21)

	state	account length	area code	phone number	international plan	voice mail plan	number vmail messages	total day minutes	total day calls	total day charge	...	total eve calls	total eve charge	total night minutes	total night calls	total night charge	total intl minutes	total intl calls	total intl charge	customer service calls
0	KS	128	415	382-4657	no	yes	25	265.1	110	45.07	...	99	16.78	244.7	91	11.01	10.0	3	2.70	1
1	OH	107	415	371-7191	no	yes	26	161.6	123	27.47	...	103	16.62	254.4	103	11.45	13.7	3	3.70	1
2	NJ	137	415	358-1921	no	no	0	243.4	114	41.38	...	110	10.30	162.6	104	7.32	12.2	5	3.29	0
3	OH	84	408	375-9999	yes	no	0	299.4	71	50.90	...	88	5.26	196.9	89	8.86	6.6	7	1.78	2
4	OK	75	415	330-6626	yes	no	0	166.7	113	28.34	...	122	12.61	186.9	121	8.41	10.1	3	2.73	3

5 rows × 21 columns

df['churn'].value_counts()

0    2850
1     483
Name: churn, dtype: int64

3.2 切分训练集测试集

X = df[['total day calls', 'total night charge', 'number vmail messages', 'total intl charge', 'total eve calls']]
y = df['churn'].values

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3,
                                                    random_state = 23)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(2333, 5) (1000, 5) (2333,) (1000,)

4 LR

from sklearn.linear_model import LogisticRegression
from sklearn import metrics

# 初始建模
lr = LogisticRegression(random_state = 23)
print(lr)
lr.fit(X_train, y_train)

# 计算AUC
scores = lr.predict_proba(X_test)[:,1]
LR_auc = metrics.roc_auc_score(y_test, scores) # y_test真实标签 scores为预测为1的概率
LR_auc

LogisticRegression(random_state=23)





0.5834069949026194

5 LGB

import lightgbm as lgb
from lightgbm.sklearn import LGBMClassifier # 是lightgbm的sklearn包。这个包允许我们像GBM一样使用Grid Search 和并行处理。

# 搭建模型
model_lgb = lgb.LGBMClassifier(
                                 boosting_type='gbdt',
                                 objective = 'binary',
                                 metric = 'auc',
                                 verbose = 0,
                                 learning_rate = 0.01,
                                 num_leaves = 35,
                                 feature_fraction=0.8,
                                 bagging_fraction= 0.9,
                                 bagging_freq= 8,
                                 lambda_l1= 0.6,
                                 lambda_l2= 0
                               )

#  拟合模型
model_lgb.fit(X_train, y_train)

# 计算AUC
scores = model_lgb.predict_proba(X_test)[:,1]
LGB_auc = metrics.roc_auc_score(y_test, scores) # y_test真实标签 scores为预测为1的概率
LGB_auc

[LightGBM] [Warning] feature_fraction is set=0.8, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.8
[LightGBM] [Warning] lambda_l1 is set=0.6, reg_alpha=0.0 will be ignored. Current value: lambda_l1=0.6
[LightGBM] [Warning] bagging_fraction is set=0.9, subsample=1.0 will be ignored. Current value: bagging_fraction=0.9
[LightGBM] [Warning] lambda_l2 is set=0, reg_lambda=0.0 will be ignored. Current value: lambda_l2=0
[LightGBM] [Warning] bagging_freq is set=8, subsample_freq=0 will be ignored. Current value: bagging_freq=8
[LightGBM] [Warning] Find whitespaces in feature_names, replace with underlines
[LightGBM] [Warning] feature_fraction is set=0.8, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.8
[LightGBM] [Warning] lambda_l1 is set=0.6, reg_alpha=0.0 will be ignored. Current value: lambda_l1=0.6
[LightGBM] [Warning] bagging_fraction is set=0.9, subsample=1.0 will be ignored. Current value: bagging_fraction=0.9
[LightGBM] [Warning] lambda_l2 is set=0, reg_lambda=0.0 will be ignored. Current value: lambda_l2=0
[LightGBM] [Warning] bagging_freq is set=8, subsample_freq=0 will be ignored. Current value: bagging_freq=8
[LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000353 seconds.
You can set `force_col_wise=true` to remove the overhead.





0.601792922596423

6 LGB+LR

6.1 LGB实现

import lightgbm as lgb
from lightgbm.sklearn import LGBMClassifier # 是lightgbm的sklearn包。这个包允许我们像GBM一样使用Grid Search 和并行处理。

# 搭建模型
lgb_param = {'boosting_type':'gbdt',
                                 'objective' : 'binary',
                                 'metric' : 'auc',
                                 'verbose' : 0,
                                 'learning_rate' : 0.01,
                                 'num_leaves' : 4,
                                 'feature_fraction':0.8,
                                 'bagging_fraction': 0.9,
                                 'bagging_freq': 8,
                                 'lambda_l1': 0.6,
                                 'lambda_l2': 0,
            'n_estimators' : 200}

'''
num_leaves:代表的是一棵树上的叶子数
n_estimators:代表的是多少棵树！
- 每棵树4个叶子，然后默认是100棵树！！！！本场景选择200！
'''

model = lgb.LGBMClassifier(
                                 boosting_type='gbdt',
                                 objective = 'binary',
                                 metric = 'auc',
                                 verbose = 0,
                                 learning_rate = 0.01,
                                 num_leaves = 4,
                                 feature_fraction=0.8,
                                 bagging_fraction= 0.9,
                                 bagging_freq= 8,
                                 lambda_l1= 0.6,
                                 lambda_l2= 0,
                                n_estimators = 200
                               )

#  拟合模型
model.fit(X_train, y_train)

[LightGBM] [Warning] feature_fraction is set=0.8, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.8
[LightGBM] [Warning] lambda_l1 is set=0.6, reg_alpha=0.0 will be ignored. Current value: lambda_l1=0.6
[LightGBM] [Warning] bagging_fraction is set=0.9, subsample=1.0 will be ignored. Current value: bagging_fraction=0.9
[LightGBM] [Warning] lambda_l2 is set=0, reg_lambda=0.0 will be ignored. Current value: lambda_l2=0
[LightGBM] [Warning] bagging_freq is set=8, subsample_freq=0 will be ignored. Current value: bagging_freq=8
[LightGBM] [Warning] Find whitespaces in feature_names, replace with underlines
[LightGBM] [Warning] feature_fraction is set=0.8, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.8
[LightGBM] [Warning] lambda_l1 is set=0.6, reg_alpha=0.0 will be ignored. Current value: lambda_l1=0.6
[LightGBM] [Warning] bagging_fraction is set=0.9, subsample=1.0 will be ignored. Current value: bagging_fraction=0.9
[LightGBM] [Warning] lambda_l2 is set=0, reg_lambda=0.0 will be ignored. Current value: lambda_l2=0
[LightGBM] [Warning] bagging_freq is set=8, subsample_freq=0 will be ignored. Current value: bagging_freq=8
[LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000282 seconds.
You can set `force_col_wise=true` to remove the overhead.





LGBMClassifier(bagging_fraction=0.9, bagging_freq=8, feature_fraction=0.8,
               lambda_l1=0.6, lambda_l2=0, learning_rate=0.01, metric='auc',
               n_estimators=200, num_leaves=4, objective='binary', verbose=0)

model.get_params()

{'boosting_type': 'gbdt',
 'class_weight': None,
 'colsample_bytree': 1.0,
 'importance_type': 'split',
 'learning_rate': 0.01,
 'max_depth': -1,
 'min_child_samples': 20,
 'min_child_weight': 0.001,
 'min_split_gain': 0.0,
 'n_estimators': 200,
 'n_jobs': -1,
 'num_leaves': 4,
 'objective': 'binary',
 'random_state': None,
 'reg_alpha': 0.0,
 'reg_lambda': 0.0,
 'silent': True,
 'subsample': 1.0,
 'subsample_for_bin': 200000,
 'subsample_freq': 0,
 'metric': 'auc',
 'verbose': 0,
 'feature_fraction': 0.8,
 'bagging_fraction': 0.9,
 'bagging_freq': 8,
 'lambda_l1': 0.6,
 'lambda_l2': 0}

6.2 LGB的vector导出来！

6.2.1 训练集

import numpy as np

y_pred = model.predict(X_train,pred_leaf=True) 
#  预测结果为该样本最终落在树的哪一个节点上！如果'num_leaves': 4,则可能落在 0 1 2 3 这四个位置上！
train_matrix = np.zeros([len(y_pred), len(y_pred[0])*lgb_param['num_leaves']],dtype=np.int64)
print(train_matrix.shape) # 1000行 800列 因为是1000个样本点，同时200棵树，每棵树4个节点，则800个变量
train_matrix

(2333, 800)





array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

for i in range(len(y_pred)):
    # 对每一个样本点做循环！然后卡一个点，每隔4个设一个关卡！
    temp = np.arange(len(y_pred[0]))*lgb_param['num_leaves'] + np.array(y_pred[i])
    train_matrix[i][temp] += 1

lgb_output_vec_train = pd.DataFrame(train_matrix)
lgb_output_vec_train.columns = ['leaf_' + str(i) for i in lgb_output_vec_train.columns]
lgb_output_vec_train

	leaf_0	leaf_1	leaf_2	leaf_3	leaf_4	leaf_5	leaf_6	leaf_7	leaf_8	leaf_9	...	leaf_790	leaf_791	leaf_792	leaf_793	leaf_794	leaf_795	leaf_796	leaf_797	leaf_798	leaf_799
0	1	0	0	0	0	0	1	0	0	0	...	1	0	0	0	1	0	0	0	1	0
1	1	0	0	0	0	0	1	0	0	0	...	1	0	0	0	0	1	0	0	1	0
2	1	0	0	0	1	0	0	0	1	0	...	1	0	0	0	1	0	0	0	1	0
3	1	0	0	0	1	0	0	0	1	0	...	1	0	0	0	1	0	0	0	1	0
4	1	0	0	0	1	0	0	0	1	0	...	0	0	0	0	1	0	0	1	0	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
2328	1	0	0	0	1	0	0	0	1	0	...	1	0	0	0	0	1	0	0	1	0
2329	1	0	0	0	1	0	0	0	1	0	...	1	0	0	0	0	1	0	0	1	0
2330	1	0	0	0	0	0	1	0	0	0	...	1	0	0	0	1	0	0	0	1	0
2331	1	0	0	0	0	0	1	0	0	0	...	1	0	0	0	1	0	0	0	1	0
2332	1	0	0	0	0	0	1	0	0	0	...	0	0	0	0	0	1	0	1	0	0

2333 rows × 800 columns

6.2.2 测试集

import numpy as np

y_pred = model.predict(X_test,pred_leaf=True) 
#  预测结果为该样本最终落在树的哪一个节点上！如果'num_leaves': 4,则可能落在 0 1 2 3 这四个位置上！
test_matrix = np.zeros([len(y_pred), len(y_pred[0])*lgb_param['num_leaves']],dtype=np.int64)
print(test_matrix.shape) # 1000行 800列 因为是1000个样本点，同时200棵树，每棵树4个节点，则800个变量
test_matrix

(1000, 800)





array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

for i in range(len(y_pred)):
    # 对每一个样本点做循环！然后卡一个点，每隔4个设一个关卡！
    temp = np.arange(len(y_pred[0]))*lgb_param['num_leaves'] + np.array(y_pred[i])
    test_matrix[i][temp] += 1

lgb_output_vec = pd.DataFrame(test_matrix)
lgb_output_vec.columns = ['leaf_' + str(i) for i in lgb_output_vec.columns]
lgb_output_vec

	leaf_0	leaf_1	leaf_2	leaf_3	leaf_4	leaf_5	leaf_6	leaf_7	leaf_8	leaf_9	...	leaf_790	leaf_791	leaf_792	leaf_793	leaf_794	leaf_795	leaf_796	leaf_797	leaf_798	leaf_799
0	1	0	0	0	1	0	0	0	1	0	...	1	0	0	0	0	1	0	0	1	0
1	0	1	0	0	0	1	0	0	0	1	...	0	1	0	0	1	0	0	0	0	1
2	1	0	0	0	0	0	1	0	0	0	...	1	0	0	0	1	0	0	0	1	0
3	1	0	0	0	1	0	0	0	1	0	...	1	0	0	0	0	1	0	0	1	0
4	1	0	0	0	0	0	1	0	0	0	...	1	0	0	0	1	0	0	0	1	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
995	1	0	0	0	1	0	0	0	1	0	...	1	0	0	0	1	0	0	0	1	0
996	1	0	0	0	1	0	0	0	1	0	...	0	0	1	0	0	0	1	0	0	0
997	1	0	0	0	1	0	0	0	1	0	...	1	0	0	0	0	1	0	0	1	0
998	0	0	0	1	0	1	0	0	0	1	...	0	1	0	0	1	0	0	0	0	1
999	1	0	0	0	1	0	0	0	1	0	...	1	0	0	0	0	1	0	0	1	0

1000 rows × 800 columns

y_pred[0] # 第一个样本点在100棵树上分别落的位置！

array([0, 0, 0, 3, 3, 0, 3, 0, 3, 3, 0, 3, 3, 0, 3, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 3, 0, 0, 3, 3, 3, 2, 0, 3, 0, 2, 2, 2, 3, 0, 2, 0, 2, 0,
       0, 0, 0, 3, 3, 3, 0, 3, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3,
       3, 3, 0, 3, 3, 0, 2, 0, 2, 2, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0,
       3, 3, 3, 3, 1, 3, 3, 3, 0, 0, 0, 0, 2, 0, 0, 1, 0, 0, 2, 0, 0, 3,
       0, 2, 2, 2, 3, 2, 3, 3, 3, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0,
       3, 3, 0, 3, 1, 1, 1, 1, 1, 1, 1, 1, 3, 0, 3, 3, 0, 3, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 3, 1, 0, 0, 0, 0, 0, 0, 0, 0,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 2, 2, 3, 2, 2, 2,
       3, 2])

len(y_pred) # 表示1000个样本点！

len(y_pred[0]) # 表示200棵树

6.3 LR+LGB

from sklearn.linear_model import LogisticRegression
from sklearn import metrics

# 初始建模
lr = LogisticRegression(random_state = 23)
print(lr)
lr.fit(lgb_output_vec_train, y_train)

# 计算AUC
scores = lr.predict_proba(lgb_output_vec)[:,1]
LR_LGB_auc = metrics.roc_auc_score(y_test, scores) # y_test真实标签 scores为预测为1的概率
LR_LGB_auc

LogisticRegression(random_state=23)





0.58792613217832

7 结果对比

df = pd.DataFrame({'model':['LR', 'LGB', 'LGB+LR'], 'AUC':[LR_auc, LGB_auc, LR_LGB_auc]})
df

	model	AUC
0	LR	0.583407
1	LGB	0.601793
2	LGB+LR	0.587926

结论：就本案例而言，LGB+LR的效果没有LGB好，所以并不能绝对说某一个模型效果如何好，应该根据不同数据场景选择最优的模型。一般而言，在CTR预估场景下LGB+LR效果还是不错的

qq_27782503

关注

3
点赞
踩
14

收藏

觉得还不错? 一键收藏
4
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录

	leaf_0	leaf_1	leaf_2	leaf_3	leaf_4	leaf_5	leaf_6	leaf_7	leaf_8	leaf_9	...	leaf_790	leaf_791	leaf_792	leaf_793	leaf_794	leaf_795	leaf_796	leaf_797	leaf_798	leaf_799
0	1	0	0	0	1	0	0	0	1	0	...	1	0	0	0	0	1	0	0	1	0
1	0	1	0	0	0	1	0	0	0	1	...	0	1	0	0	1	0	0	0	0	1
2	1	0	0	0	0	0	1	0	0	0	...	1	0	0	0	1	0	0	0	1	0
3	1	0	0	0	1	0	0	0	1	0	...	1	0	0	0	0	1	0	0	1	0
4	1	0	0	0	0	0	1	0	0	0	...	1	0	0	0	1	0	0	0	1	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
995	1	0	0	0	1	0	0	0	1	0	...	1	0	0	0	1	0	0	0	1	0
996	1	0	0	0	1	0	0	0	1	0	...	0	0	1	0	0	0	1	0	0	0
997	1	0	0	0	1	0	0	0	1	0	...	1	0	0	0	0	1	0	0	1	0
998	0	0	0	1	0	1	0	0	0	1	...	0	1	0	0	1	0	0	0	0	1
999	1	0	0	0	1	0	0	0	1	0	...	1	0	0	0	0	1	0	0	1	0

	leaf_0	leaf_1	leaf_2	leaf_3	leaf_4	leaf_5	leaf_6	leaf_7	leaf_8	leaf_9	...	leaf_790	leaf_791	leaf_792	leaf_793	leaf_794	leaf_795	leaf_796	leaf_797	leaf_798	leaf_799
0	1	0	0	0	1	0	0	0	1	0	...	1	0	0	0	0	1	0	0	1	0
1	0	1	0	0	0	1	0	0	0	1	...	0	1	0	0	1	0	0	0	0	1
2	1	0	0	0	0	0	1	0	0	0	...	1	0	0	0	1	0	0	0	1	0
3	1	0	0	0	1	0	0	0	1	0	...	1	0	0	0	0	1	0	0	1	0
4	1	0	0	0	0	0	1	0	0	0	...	1	0	0	0	1	0	0	0	1	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
995	1	0	0	0	1	0	0	0	1	0	...	1	0	0	0	1	0	0	0	1	0
996	1	0	0	0	1	0	0	0	1	0	...	0	0	1	0	0	0	1	0	0	0
997	1	0	0	0	1	0	0	0	1	0	...	1	0	0	0	0	1	0	0	1	0
998	0	0	0	1	0	1	0	0	0	1	...	0	1	0	0	1	0	0	0	0	1
999	1	0	0	0	1	0	0	0	1	0	...	1	0	0	0	0	1	0	0	1	0

	leaf_0	leaf_1	leaf_2	leaf_3	leaf_4	leaf_5	leaf_6	leaf_7	leaf_8	leaf_9	...	leaf_790	leaf_791	leaf_792	leaf_793	leaf_794	leaf_795	leaf_796	leaf_797	leaf_798	leaf_799
0	1	0	0	0	1	0	0	0	1	0	...	1	0	0	0	0	1	0	0	1	0
1	0	1	0	0	0	1	0	0	0	1	...	0	1	0	0	1	0	0	0	0	1
2	1	0	0	0	0	0	1	0	0	0	...	1	0	0	0	1	0	0	0	1	0
3	1	0	0	0	1	0	0	0	1	0	...	1	0	0	0	0	1	0	0	1	0
4	1	0	0	0	0	0	1	0	0	0	...	1	0	0	0	1	0	0	0	1	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
995	1	0	0	0	1	0	0	0	1	0	...	1	0	0	0	1	0	0	0	1	0
996	1	0	0	0	1	0	0	0	1	0	...	0	0	1	0	0	0	1	0	0	0
997	1	0	0	0	1	0	0	0	1	0	...	1	0	0	0	0	1	0	0	1	0
998	0	0	0	1	0	1	0	0	0	1	...	0	1	0	0	1	0	0	0	0	1
999	1	0	0	0	1	0	0	0	1	0	...	1	0	0	0	0	1	0	0	1	0