LightFM推荐系统框架学习笔记

lightFM 首先需要注意的是其并非是FM算法的简单实现,而是可以利用隐式反馈和用户产品信息的推荐系统框架,如果不考虑用户/产品侧信息,其基本实现就是CF算法,同时融合了MF中对用户/产品进行矩阵分解的思想。

特点

  • 集成了BPR & WARP ranking losses
  • 多线程
  • incorporate both item and user metadata,可以解决用户/产品冷启动

问题记录

1.输入数据

根据lightFM的API来看,主要有三种类型的数据,iteractions,user_features, item_features,这三种数据也是推荐系统的常规数据。
需要注意的是,数据要以csr_matrix或者coo_matrix的格式存在所以需要对数据进行相关转化, 之所以这样转化是因为类别型变量one-hot编码后会庞大又稀疏,这样转化可以节省内存。

  • csr_matrix(compressed sparse row matrix)
  • coo_matrix(sparse matrix in coordinate format)
import pandas as pd
from scipy.sparse import csr_matrix, coo_matrix

fm_data = pd.pivot_table(train_data, 
                         index="user_id", 
                         columns="yewu_id", 
                         values='label', aggfunc="sum").fillna(0)
iteractions = csr_matrix(fm_data)

# iteractions
# <609885x28 sparse matrix of type '<class 'numpy.float64'>'
#	with 673172 stored elements in Compressed Sparse Row format>

除了利用scipy的接口,lightfm本身也提供了相关接口lightfm.data.dataset

from lightfm.data import Dataset

dataset = Dataset()
dataset.fit((x[1]['user_id'] for x in train.iterrows()),
            (x[1]['yewu_id'] for x in train.iterrows()))

num_users, num_items = dataset.interactions_shape()
print('Num users: {}, num_items {}.'.format(num_users, num_items))
# Num users: 609885, num_items 28.

(interactions, weights) = dataset.build_interactions(
                                ((x[1]['user_id'], x[1]['yewu_id'])
                                for x in train.iterrows()))

Or
(interactions, weights) = dataset.build_interactions(
                           (x[0], x[1]) for x in train.values())

2. fit_patial()

把源码里面的描述先粘贴下:
Unlike fit, repeated call to this method will cause training to resume from the current model state.
这里的resume可以理解为continue from last pause,是续接训练,其实直接阅读源码也可以看出端倪。

## 源码中的fit函数其实是调用了fit_partial函数
 def fit(
        self,
        interactions,
        user_features=None,
        item_features=None,
        sample_weight=None,
        epochs=1,
        num_threads=1,
        verbose=False,):
        
        self._reset_state() # Discard old results, if any
        
        return self.fit_partial(
            interactions,
            user_features=user_features,
            item_features=item_features,
            sample_weight=sample_weight,
            epochs=epochs,
            num_threads=num_threads,
            verbose=verbose,
        )

可以看出,fit函数其实是调用了fit_partial函数,但是在调用前进行了reset_state操作,清除了之前的参数状态。

3.item_cold_start

item官方以及给了相关的例子[4]:
其中测试集包含了10%的交互信息是训练中包含的,另外是训练集中没有任何交互信息的items

import numpy as np

from lightfm.datasets import fetch_stackexchange

data = fetch_stackexchange('crossvalidated',
                           test_set_fraction=0.1,
                           indicator_features=False,
                           tag_features=True)

train = data['train']
test = data['test']

train.toarray().shape
test.toarray().shape
# (3213, 72360)

分析发现测试集和训练集都是同样的shape,意味着把没有任何交互数据的item也是放到了训练集中的,这个比较有意思,理解起来就是即使我这个item没有任何交互信息,训练的时候也需要把item放到interaction_matrix
然后item冷启动是个什么概念呢,通过[5]知道,只有ID肯定是不行的,所需要的是item侧的一些特征

item_features = data['item_features']
tag_labels = data['item_feature_labels']

print('There are %s distinct tags, with values like %s.' % (item_features.shape[1], tag_labels[:3].tolist()))
# There are 1246 distinct tags, with values like [u'bayesian', u'prior', u'elicitation'].

item_features.toarray().shape
# (72360, 1246)

可以看出,训练时item_features也是和item的维度保持一致的。
下面就是正常的训练流程了

# Define a new model instance
model = LightFM(loss='warp',
                item_alpha=ITEM_ALPHA,
                no_components=NUM_COMPONENTS)

# Fit the hybrid model. Note that this time, we pass
# in the item features matrix.
model = model.fit(train,
                item_features=item_features,
                epochs=NUM_EPOCHS,
                num_threads=NUM_THREADS)
                
test_auc = auc_score(model,
                    test,
                    train_interactions=train,
                    item_features=item_features,
                    num_threads=NUM_THREADS,
                    check_intersections=False).mean()
print('Hybrid test set AUC: %s' % test_auc) 

# Hybrid test set AUC: 0.703039          

4.user cold-start和user_features

在做项目用户侧信息比较丰富,但是交互信息极其稀疏,可以说训练集中有些用户的交互信息为0,所以可以理解为用户冷启动问题,根据前面的item cold start 例子,需要明确两个问题:
1.推理用户即使没有任何交互信息,也需要在训练中进行体现
2.官方文档[2]在Building datasets这里写了一个整合数据的例子,但是item_features只有一个特征,
而我的user_features可不止一个特征,如何整合特征形成user_features成了一个大坑,后面在[6]中找到了答案。

feature_columns = ['user_id', 'yewu_id','term_brand','price_range', 'arpu_chrg_last3_avg', 'total_flow']

train, test = train_data[feature_columns], test_data[feature_columns]

# total flow & arpu_chrg_last3_avg bin cut
total_flow_b = [-1000, 1024, 3072, 5120, 10240, 20480, 30720, 50000, 100000, 30000000]
total_flow_l = [x for x in range(len(total_flow_b) - 1)]
train.loc[:, 'total_flow'] = pd.cut(train['total_flow'], bins=total_flow_b)
test.loc[:, 'total_flow'] = pd.cut(test['total_flow'], bins=total_flow_b)

arpu_b = [-10000, 30, 60, 80, 100, 120, 150, 200, 400, 100000]
arpu_l = [x for x in range(len(arpu_b) - 1)]
train.loc[:, 'arpu_chrg_last3_avg'] = pd.cut(train['arpu_chrg_last3_avg'], bins=arpu_b)
test.loc[:, 'arpu_chrg_last3_avg'] = pd.cut(test['arpu_chrg_last3_avg'], bins=arpu_b)

train.loc[:, 'price_range'] = train['price_range'].astype(int)
test.loc[:, 'price_ragne'] = test['price_range'].astype(int)

train_test = pd.concat([train, test], axis=0)

pairs = train_test[['user_id', 'yewu_id']].drop_duplicates()
user_features = train_test[['user_id', 'term_brand', 'price_range', 'arpu_chrg_last3_avg', 'total_flow']].drop_duplicates()

from lightfm.data import Dataset

dataset = Dataset()
dataset.fit(users=(x[1]['user_id'] for x in pairs.iterrows()),
            items=(x[1]['yewu_id'] for x in pairs.iterrows())
           )
(interactions, weights) = dataset.build_interactions(
                                   ((x[1]['user_id'], x[1]['yewu_id'])
                                   for x in train.iterrows()))

# user_features_matrix generation
user_features_list = list()
for tag in ['term_brand', 'price_range', 'total_flow', 'arpu_chrg_last3_avg']:
    user_features_list += list(user_features[tag].unique())

dataset.fit_partial(users=(x[1]['user_id'] for x in user_features.iterrows()),
                   user_features=user_features_list)
                   
user_features_matrix = dataset.build_user_features([(x[0], list(x[1:]))
                                                   for x in user_features[['user_id', 'term_brand', 'total_flow', 'arpu_chrg_last3_avg']].values])

# model fit&validate
from lightfm import LightFM

model = LightFM(loss='bpr')
model.fit(interactions, user_features=user_features_matrix)

from lightfm.evaluation import auc_score

train_auc = auc_score(model,
                      interactions,
                      user_features=user_features_matrix).mean()

print(train_auc)
# 0.775

(test_interactions, test_weights) = dataset.build_interactions(((x[1]['user_id'], x[1]['yewu_id'])
                                                      for x in test.iterrows()))

test_auc = auc_score(model,
                     test_interactions,
                     train_interactions=interactions,
                     user_features=user_features_matrix).mean()

print(test_auc)
# 0.774

之前在user_features_matrix这里,总是报错提示要先fit,后来发现是fit的写法有问题,改了之后,后面运行就OK了,4个特征auc能做到0.77,很可以了。

Reference:

  1. github
  2. documents
  3. Recommendation System in Python: LightFM
  4. item cold-start
  5. handling user item cold-start
  6. error bulid user features
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值