文章目录
1 背景
相信大名鼎鼎的GBDT+LR组合很多小伙伴都听过,这种组合模型的预测效果要比单模型要好,但之前一直没有亲自实践过,最近刚好公司一个项目用到了,故抓紧时间总结一波~
2 原理
简单来说就是首先用树模型(GBDT、Xgboost、Lightgbm)来预测样本结果,然后将树模型的结果转为标准的变量形式放入LR中,最终进行预测~
- 具有stacking思想的二分类器模型,GBDT用来对训练集提取特征作为新的训练输入数据,LR作为新训练输入数据的分类器。
- GBDT算法的特点正好可以用来发掘有区分度的特征、特征组合,减少特征工程中人力成本。而LR则可以快速实现算法
具体的一个demo例子见下方,根据树模型的结果转为标准变量形式并放入模型~
下面就拿一个具体数据来看看GBDT+LR的效果,以及与其余模型的比较
3 数据的准备
3.1 读入数据
import pandas as pd
import warnings
warnings.filterwarnings('ignore')
# 读入数据
df = pd.read_csv('telecom_churn.csv')
df['churn'] = df['churn'].map(str)
churn_dic = {'True':1, 'False':0}
df['churn'] = df['churn'].map(churn_dic)
print(df.shape)
df.head()
(3333, 21)
state | account length | area code | phone number | international plan | voice mail plan | number vmail messages | total day minutes | total day calls | total day charge | ... | total eve calls | total eve charge | total night minutes | total night calls | total night charge | total intl minutes | total intl calls | total intl charge | customer service calls | churn | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | KS | 128 | 415 | 382-4657 | no | yes | 25 | 265.1 | 110 | 45.07 | ... | 99 | 16.78 | 244.7 | 91 | 11.01 | 10.0 | 3 | 2.70 | 1 | 0 |
1 | OH | 107 | 415 | 371-7191 | no | yes | 26 | 161.6 | 123 | 27.47 | ... | 103 | 16.62 | 254.4 | 103 | 11.45 | 13.7 | 3 | 3.70 | 1 | 0 |
2 | NJ | 137 | 415 | 358-1921 | no | no | 0 | 243.4 | 114 | 41.38 | ... | 110 | 10.30 | 162.6 | 104 | 7.32 | 12.2 | 5 | 3.29 | 0 | 0 |
3 | OH | 84 | 408 | 375-9999 | yes | no | 0 | 299.4 | 71 | 50.90 | ... | 88 | 5.26 | 196.9 | 89 | 8.86 | 6.6 | 7 | 1.78 | 2 | 0 |
4 | OK | 75 | 415 | 330-6626 | yes | no | 0 | 166.7 | 113 | 28.34 | ... | 122 | 12.61 | 186.9 | 121 | 8.41 | 10.1 | 3 | 2.73 | 3 | 0 |
5 rows × 21 columns
df['churn'].value_counts()
0 2850
1 483
Name: churn, dtype: int64
3.2 切分训练集测试集
X = df[['total day calls', 'total night charge', 'number vmail messages', 'total intl charge', 'total eve calls']]
y = df['churn'].values
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3,
random_state = 23)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)
(2333, 5) (1000, 5) (2333,) (1000,)
4 LR
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
# 初始建模
lr = LogisticRegression(random_state = 23)
print(lr)
lr.fit(X_train, y_train)
# 计算AUC
scores = lr.predict_proba(X_test)[:,1]
LR_auc = metrics.roc_auc_score(y_test, scores) # y_test真实标签 scores为预测为1的概率
LR_auc
LogisticRegression(random_state=23)
0.5834069949026194
5 LGB
import lightgbm as lgb
from lightgbm.sklearn import LGBMClassifier # 是lightgbm的sklearn包。这个包允许我们像GBM一样使用Grid Search 和并行处理。
# 搭建模型
model_lgb = lgb.LGBMClassifier(
boosting_type='gbdt',
objective = 'binary',
metric = 'auc',
verbose = 0,
learning_rate = 0.01,
num_leaves = 35,
feature_fraction=0.8,
bagging_fraction= 0.9,
bagging_freq= 8,
lambda_l1= 0.6,
lambda_l2= 0
)
# 拟合模型
model_lgb.fit(X_train, y_train)
# 计算AUC
scores = model_lgb.predict_proba(X_test)[:,1]
LGB_auc = metrics.roc_auc_score(y_test, scores) # y_test真实标签 scores为预测为1的概率
LGB_auc
[LightGBM] [Warning] feature_fraction is set=0.8, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.8
[LightGBM] [Warning] lambda_l1 is set=0.6, reg_alpha=0.0 will be ignored. Current value: lambda_l1=0.6
[LightGBM] [Warning] bagging_fraction is set=0.9, subsample=1.0 will be ignored. Current value: bagging_fraction=0.9
[LightGBM] [Warning] lambda_l2 is set=0, reg_lambda=0.0 will be ignored. Current value: lambda_l2=0
[LightGBM] [Warning] bagging_freq is set=8, subsample_freq=0 will be ignored. Current value: bagging_freq=8
[LightGBM] [Warning] Find whitespaces in feature_names, replace with underlines
[LightGBM] [Warning] feature_fraction is set=0.8, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.8
[LightGBM] [Warning] lambda_l1 is set=0.6, reg_alpha=0.0 will be ignored. Current value: lambda_l1=0.6
[LightGBM] [Warning] bagging_fraction is set=0.9, subsample=1.0 will be ignored. Current value: bagging_fraction=0.9
[LightGBM] [Warning] lambda_l2 is set=0, reg_lambda=0.0 will be ignored. Current value: lambda_l2=0
[LightGBM] [Warning] bagging_freq is set=8, subsample_freq=0 will be ignored. Current value: bagging_freq=8
[LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000353 seconds.
You can set `force_col_wise=true` to remove the overhead.
0.601792922596423
6 LGB+LR
6.1 LGB实现
import lightgbm as lgb
from lightgbm.sklearn import LGBMClassifier # 是lightgbm的sklearn包。这个包允许我们像GBM一样使用Grid Search 和并行处理。
# 搭建模型
lgb_param = {'boosting_type':'gbdt',
'objective' : 'binary',
'metric' : 'auc',
'verbose' : 0,
'learning_rate' : 0.01,
'num_leaves' : 4,
'feature_fraction':0.8,
'bagging_fraction': 0.9,
'bagging_freq': 8,
'lambda_l1': 0.6,
'lambda_l2': 0,
'n_estimators' : 200}
'''
num_leaves:代表的是一棵树上的叶子数
n_estimators:代表的是多少棵树!
- 每棵树4个叶子,然后默认是100棵树!!!!本场景选择200!
'''
model = lgb.LGBMClassifier(
boosting_type='gbdt',
objective = 'binary',
metric = 'auc',
verbose = 0,
learning_rate = 0.01,
num_leaves = 4,
feature_fraction=0.8,
bagging_fraction= 0.9,
bagging_freq= 8,
lambda_l1= 0.6,
lambda_l2= 0,
n_estimators = 200
)
# 拟合模型
model.fit(X_train, y_train)
[LightGBM] [Warning] feature_fraction is set=0.8, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.8
[LightGBM] [Warning] lambda_l1 is set=0.6, reg_alpha=0.0 will be ignored. Current value: lambda_l1=0.6
[LightGBM] [Warning] bagging_fraction is set=0.9, subsample=1.0 will be ignored. Current value: bagging_fraction=0.9
[LightGBM] [Warning] lambda_l2 is set=0, reg_lambda=0.0 will be ignored. Current value: lambda_l2=0
[LightGBM] [Warning] bagging_freq is set=8, subsample_freq=0 will be ignored. Current value: bagging_freq=8
[LightGBM] [Warning] Find whitespaces in feature_names, replace with underlines
[LightGBM] [Warning] feature_fraction is set=0.8, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.8
[LightGBM] [Warning] lambda_l1 is set=0.6, reg_alpha=0.0 will be ignored. Current value: lambda_l1=0.6
[LightGBM] [Warning] bagging_fraction is set=0.9, subsample=1.0 will be ignored. Current value: bagging_fraction=0.9
[LightGBM] [Warning] lambda_l2 is set=0, reg_lambda=0.0 will be ignored. Current value: lambda_l2=0
[LightGBM] [Warning] bagging_freq is set=8, subsample_freq=0 will be ignored. Current value: bagging_freq=8
[LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000282 seconds.
You can set `force_col_wise=true` to remove the overhead.
LGBMClassifier(bagging_fraction=0.9, bagging_freq=8, feature_fraction=0.8,
lambda_l1=0.6, lambda_l2=0, learning_rate=0.01, metric='auc',
n_estimators=200, num_leaves=4, objective='binary', verbose=0)
model.get_params()
{'boosting_type': 'gbdt',
'class_weight': None,
'colsample_bytree': 1.0,
'importance_type': 'split',
'learning_rate': 0.01,
'max_depth': -1,
'min_child_samples': 20,
'min_child_weight': 0.001,
'min_split_gain': 0.0,
'n_estimators': 200,
'n_jobs': -1,
'num_leaves': 4,
'objective': 'binary',
'random_state': None,
'reg_alpha': 0.0,
'reg_lambda': 0.0,
'silent': True,
'subsample': 1.0,
'subsample_for_bin': 200000,
'subsample_freq': 0,
'metric': 'auc',
'verbose': 0,
'feature_fraction': 0.8,
'bagging_fraction': 0.9,
'bagging_freq': 8,
'lambda_l1': 0.6,
'lambda_l2': 0}
6.2 LGB的vector导出来!
6.2.1 训练集
import numpy as np
y_pred = model.predict(X_train,pred_leaf=True)
# 预测结果为该样本最终落在树的哪一个节点上!如果'num_leaves': 4,则可能落在 0 1 2 3 这四个位置上!
train_matrix = np.zeros([len(y_pred), len(y_pred[0])*lgb_param['num_leaves']],dtype=np.int64)
print(train_matrix.shape) # 1000行 800列 因为是1000个样本点,同时200棵树,每棵树4个节点,则800个变量
train_matrix
(2333, 800)
array([[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
...,
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0]], dtype=int64)
for i in range(len(y_pred)):
# 对每一个样本点做循环!然后卡一个点,每隔4个设一个关卡!
temp = np.arange(len(y_pred[0]))*lgb_param['num_leaves'] + np.array(y_pred[i])
train_matrix[i][temp] += 1
lgb_output_vec_train = pd.DataFrame(train_matrix)
lgb_output_vec_train.columns = ['leaf_' + str(i) for i in lgb_output_vec_train.columns]
lgb_output_vec_train
leaf_0 | leaf_1 | leaf_2 | leaf_3 | leaf_4 | leaf_5 | leaf_6 | leaf_7 | leaf_8 | leaf_9 | ... | leaf_790 | leaf_791 | leaf_792 | leaf_793 | leaf_794 | leaf_795 | leaf_796 | leaf_797 | leaf_798 | leaf_799 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | ... | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 |
1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | ... | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 |
2 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | ... | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 |
3 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | ... | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 |
4 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2328 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | ... | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 |
2329 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | ... | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 |
2330 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | ... | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 |
2331 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | ... | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 |
2332 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 |
2333 rows × 800 columns
6.2.2 测试集
import numpy as np
y_pred = model.predict(X_test,pred_leaf=True)
# 预测结果为该样本最终落在树的哪一个节点上!如果'num_leaves': 4,则可能落在 0 1 2 3 这四个位置上!
test_matrix = np.zeros([len(y_pred), len(y_pred[0])*lgb_param['num_leaves']],dtype=np.int64)
print(test_matrix.shape) # 1000行 800列 因为是1000个样本点,同时200棵树,每棵树4个节点,则800个变量
test_matrix
(1000, 800)
array([[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
...,
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0]], dtype=int64)
for i in range(len(y_pred)):
# 对每一个样本点做循环!然后卡一个点,每隔4个设一个关卡!
temp = np.arange(len(y_pred[0]))*lgb_param['num_leaves'] + np.array(y_pred[i])
test_matrix[i][temp] += 1
lgb_output_vec = pd.DataFrame(test_matrix)
lgb_output_vec.columns = ['leaf_' + str(i) for i in lgb_output_vec.columns]
lgb_output_vec
leaf_0 | leaf_1 | leaf_2 | leaf_3 | leaf_4 | leaf_5 | leaf_6 | leaf_7 | leaf_8 | leaf_9 | ... | leaf_790 | leaf_791 | leaf_792 | leaf_793 | leaf_794 | leaf_795 | leaf_796 | leaf_797 | leaf_798 | leaf_799 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | ... | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 |
1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | ... | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 |
2 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | ... | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 |
3 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | ... | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 |
4 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | ... | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
995 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | ... | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 |
996 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
997 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | ... | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 |
998 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | ... | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 |
999 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | ... | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 |
1000 rows × 800 columns
y_pred[0] # 第一个样本点在100棵树上分别落的位置!
array([0, 0, 0, 3, 3, 0, 3, 0, 3, 3, 0, 3, 3, 0, 3, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 3, 0, 0, 3, 3, 3, 2, 0, 3, 0, 2, 2, 2, 3, 0, 2, 0, 2, 0,
0, 0, 0, 3, 3, 3, 0, 3, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3,
3, 3, 0, 3, 3, 0, 2, 0, 2, 2, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0,
3, 3, 3, 3, 1, 3, 3, 3, 0, 0, 0, 0, 2, 0, 0, 1, 0, 0, 2, 0, 0, 3,
0, 2, 2, 2, 3, 2, 3, 3, 3, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0,
3, 3, 0, 3, 1, 1, 1, 1, 1, 1, 1, 1, 3, 0, 3, 3, 0, 3, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 3, 1, 0, 0, 0, 0, 0, 0, 0, 0,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 2, 2, 3, 2, 2, 2,
3, 2])
len(y_pred) # 表示1000个样本点!
1000
len(y_pred[0]) # 表示200棵树
200
6.3 LR+LGB
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
# 初始建模
lr = LogisticRegression(random_state = 23)
print(lr)
lr.fit(lgb_output_vec_train, y_train)
# 计算AUC
scores = lr.predict_proba(lgb_output_vec)[:,1]
LR_LGB_auc = metrics.roc_auc_score(y_test, scores) # y_test真实标签 scores为预测为1的概率
LR_LGB_auc
LogisticRegression(random_state=23)
0.58792613217832
7 结果对比
df = pd.DataFrame({'model':['LR', 'LGB', 'LGB+LR'], 'AUC':[LR_auc, LGB_auc, LR_LGB_auc]})
df
model | AUC | |
---|---|---|
0 | LR | 0.583407 |
1 | LGB | 0.601793 |
2 | LGB+LR | 0.587926 |
结论:就本案例而言,LGB+LR的效果没有LGB好,所以并不能绝对说某一个模型效果如何好,应该根据不同数据场景选择最优的模型。一般而言,在CTR预估场景下LGB+LR效果还是不错的