商品类别推荐系统：LightGBM模型

最新推荐文章于 2024-08-07 14:48:28 发布

Goodsta

最新推荐文章于 2024-08-07 14:48:28 发布

阅读量3.8k

点赞数 2

文章标签：机器学习

本文链接：https://blog.csdn.net/wong2016/article/details/89288531

版权

本文介绍了如何利用LightGBM模型为巴西最大的信用卡支付机构Elo开发个性化推荐系统。通过处理历史交易记录和商家信息，进行特征工程，然后训练模型以识别顾客的忠诚度信号和兴趣。最终，探讨了模型的特征重要性。

摘要由CSDN通过智能技术生成

机器学习训练营——机器学习爱好者的自由交流空间（入群联系qq：2279055353）

案例介绍

Elo是巴西最大的信用卡支付机构。目前，Elo与巴西的很多商家建立了合作关系，负责向持卡人提供商家的促销或打折信息。但是，这些促销活动对消费者或商家有用吗？引起消费者的兴趣了吗？商家接到“回头客”了吗？要实现这些目的，个性化推荐是关键！Elo已经开发了机器学习模型去理解顾客日常生活里最重要的方面与偏爱，从饮食到购物。本案例将开发算法发现顾客忠实度信号，识别并服务于个人最相关的机会。

代码实现：Python
算法模型：LightGBM

数据描述

文件简述

train.csv: 训练集
test.csv: 检验集
historical_transactions.csv: 每一个card_id直到3个月的历史交易记录。
merchants.csv: 商家信息
new_merchant_transactions.csv: 每一个card_id在新近商家的2个月的所有交易信息。

加载数据

导入必需的库。

import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
import lightgbm as lgb
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error
import warnings
import time
import sys
import datetime
warnings.simplefilter(action='ignore', category=FutureWarning)
pd.set_option('display.max_columns', 500)

由于使用的数据集比较大，为了节省内存，将数值型变量按实际取值范围定义类型。我们定义一个函数reduce_mem_usage实现这个目的。

def reduce_mem_usage(df, verbose=True):
    numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
    start_mem = df.memory_usage().sum() / 1024**2    
    for col in df.columns:
        col_type = df[col].dtypes
        if col_type in numerics:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)    
    end_mem = df.memory_usage().sum() / 1024**2
    if verbose: print('Mem. usage decreased to {:5.2f} Mb ({:.1f}% reduction)'.format(end_mem, 100 * (start_mem - end_mem) / start_mem))
    return df

我们先加载数据集new_merchant_transactions.csv and historical_transactions.csv. 实际上，这两个数据文件包括相同的变量，区别仅仅在于关于一个参考日期的时间。而且，逻辑特征转换成数值型。

new_transactions = pd.read_csv('../input/new_merchant_transactions.csv',
                               parse_dates=['purchase_date'])

historical_transactions = pd.read_csv('../input/historical_transactions.csv',