推荐系统学习--全网最全讲解--GBDT+LR原理以及实现

系列文章目录


GBDR+LR论文讲解与代码实现

下面将分析我在推荐系统学习过程中的对于这篇论文的学习过程,以及自己觉得一些写的不错的回答,同时也实现了对应的算法。

一、论文讲解以及论文下载

首先论文下载方式,我已经将推荐系统相关的论文上传到百度网盘中,链接是论文提取码是kzq4
关于本篇的论文翻译推荐阅读GBDT+LR论文翻译

关于本篇论文理解参考GBDT+LR论文深入理解

了解完理论的知识之后就可以动手写代码的实现过程,下面分享一下代码的实现过程

二、实现GBDT+LR代码

1.引入库

这里我们使用的lightgbm即垂直生长树,关于具体的如何调参请参考如何对lightgbm调参

import lightgbm as lgb
import matplotlib.pyplot as plt

import pandas as pd
import numpy as np

from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

其具体的数据集和相关代码也已经上传到百度网盘中,链接为代码和数据集提取码为nbrx

2.数据预处理


print('load data....')
df_train = pd.read_csv('data/train.csv')
df_test = pd.read_csv('data/test.csv')

NUMERIC_COLS = [
    "ps_reg_01", "ps_reg_02", "ps_reg_03",
    "ps_car_12", "ps_car_13", "ps_car_14", "ps_car_15"
]

print(df_test.head(10))

# training label
y_train = df_train['target']
print(y_train)
# testing label
y_test = df_test['target'].values
# training dataset
x_train = df_train[NUMERIC_COLS].values
print(x_train)
# testing dataset
x_test = df_test[NUMERIC_COLS]

# create dataset for lightgbm
lgb_train = lgb.Dataset(x_train, y_train)
lgb_eval = lgb.Dataset(x_test, y_test, reference=lgb_train)

# number of leaves,will be used in feature transformation
num_leaf = 64

3.测试代码


params = {
    'task': 'train',
    'boosting_type': 'gbdt',
    'objective': 'binary',
    'metric': {'binary_logloss'},
    'num_leaves': 64,
    'num_trees': 110,
    'learning_rate': 0.01,
    'feature_fraction': 0.9,
    'bagging_fraction': 0.8,
    'bagging_freq': 5,
    'verbose': 0
}
print('Start training...')
# train
gbm = lgb.train(params,
                lgb_train,
                num_boost_round=100,
                valid_sets=lgb_train)

print('Save model...')
# save model to file
gbm.save_model('model.txt')

print('Start predicting...')
# predict and get data on leaves, training data
y_pred = gbm.predict(x_train, pred_leaf=True)
print(y_pred.shape)
print(np.array(y_pred).shape)
print(y_pred[:10])

print('Writing transformed training data')
transformed_training_matrix = np.zeros([len(y_pred), len(y_pred[0]) * num_leaf],
                                       dtype=np.int64)  # N * num_tress * num_leafs
for i in range(0, len(y_pred)):
    temp = np.arange(len(y_pred[0])) * num_leaf + np.array(y_pred[i])
    transformed_training_matrix[i][temp] += 1

y_pred = gbm.predict(x_test, pred_leaf=True)
print('Writing transformed testing data')
transformed_testing_matrix = np.zeros([len(y_pred), len(y_pred[0]) * num_leaf], dtype=np.int64)
for i in range(0, len(y_pred)):
    temp = np.arange(len(y_pred[0])) * num_leaf + np.array(y_pred[i])
    transformed_testing_matrix[i][temp] += 1

lm = LogisticRegression(penalty='l2', C=0.05)  # logestic model construction

lm.fit(transformed_training_matrix, y_train)  # fitting the data
y_pred_test = lm.predict_proba(transformed_testing_matrix)  # Give the probabilty on each label
y_pred_ = lm.predict(transformed_testing_matrix)

4.调最优参数


l1_train = []
l2_train = []

l1_test = []
l2_test = []
for i in np.linspace(0.05, 1.5, 20):
    lm1 = LogisticRegression(penalty='l1', solver="liblinear", C=i, max_iter=100)
    lm2 = LogisticRegression(penalty='l2', solver="liblinear", C=i, max_iter=100)

    l1_fit = lm1.fit(transformed_training_matrix, y_train)
    l1_train.append(accuracy_score(lm1.predict(transformed_training_matrix), y_train))
    l1_test.append(accuracy_score(lm1.predict(transformed_testing_matrix), y_test))

    l2_fit = lm2.fit(transformed_training_matrix, y_train)
    l2_train.append(accuracy_score(lm2.predict(transformed_training_matrix), y_train))
    l2_test.append(accuracy_score(lm2.predict(transformed_testing_matrix), y_test))

graph = [l1_train, l1_test, l2_train, l2_test]
color = ['green', 'gray', 'black', 'red']
label = ['l1_train', 'l1_test', 'l2_train', 'l2_test']

plt.figure(figsize=(6, 6))
for i in range(len(graph)):
    plt.plot(np.linspace(0.05, 1.5, 20), graph[i], color[i], label=label[i])

plt.legend(loc=0)
plt.show()
print(max(l1_test), l1_test.index(max(l1_test)))
print(accuracy_score(y_pred_, y_test))
NE = (-1) / len(y_pred_test) * sum(
    ((1 + y_test) / 2 * np.log(y_pred_test[:, 1]) + (1 - y_test) / 2 * np.log(1 - y_pred_test[:, 1])))
print("Normalized Cross Entropy " + str(NE))

总结

以上就是今天要讲的内容,本文从自身学习角度出发,简单介绍了GBDT+LR学习的整个流程,希望对大家有所帮助
  • 1
    点赞
  • 5
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值