lightFM 首先需要注意的是其并非是FM算法的简单实现,而是可以利用隐式反馈和用户产品信息的推荐系统框架,如果不考虑用户/产品侧信息,其基本实现就是CF算法,同时融合了MF中对用户/产品进行矩阵分解的思想。
特点
- 集成了BPR & WARP ranking losses
- 多线程
- incorporate both item and user metadata,可以解决用户/产品冷启动
问题记录
1.输入数据
根据lightFM的API来看,主要有三种类型的数据,iteractions,user_features, item_features,这三种数据也是推荐系统的常规数据。
需要注意的是,数据要以csr_matrix
或者coo_matrix
的格式存在所以需要对数据进行相关转化, 之所以这样转化是因为类别型变量one-hot编码后会庞大又稀疏,这样转化可以节省内存。
- csr_matrix(compressed sparse row matrix)
- coo_matrix(sparse matrix in coordinate format)
import pandas as pd
from scipy.sparse import csr_matrix, coo_matrix
fm_data = pd.pivot_table(train_data,
index="user_id",
columns="yewu_id",
values='label', aggfunc="sum").fillna(0)
iteractions = csr_matrix(fm_data)
# iteractions
# <609885x28 sparse matrix of type '<class 'numpy.float64'>'
# with 673172 stored elements in Compressed Sparse Row format>
除了利用scipy的接口,lightfm本身也提供了相关接口lightfm.data.dataset
from lightfm.data import Dataset
dataset = Dataset()
dataset.fit((x[1]['user_id'] for x in train.iterrows()),
(x[1]['yewu_id'] for x in train.iterrows()))
num_users, num_items = dataset.interactions_shape()
print('Num users: {}, num_items {}.'.format(num_users, num_items))
# Num users: 609885, num_items 28.
(interactions, weights) = dataset.build_interactions(
((x[1]['user_id'], x[1]['yewu_id'])
for x in train.iterrows()))
Or
(interactions, weights) = dataset.build_interactions(
(x[0], x[1]) for x in train.values())
2. fit_patial()
把源码里面的描述先粘贴下:
Unlike fit, repeated call to this method will cause training to resume from the current model state.
这里的resume可以理解为continue from last pause,是续接训练,其实直接阅读源码也可以看出端倪。
## 源码中的fit函数其实是调用了fit_partial函数
def fit(
self,
interactions,
user_features=None,
item_features=None,
sample_weight=None,
epochs=1,
num_threads=1,
verbose=False,):
self._reset_state() # Discard old results, if any
return self.fit_partial(
interactions,
user_features=user_features,
item_features=item_features,
sample_weight=sample_weight,
epochs=epochs,
num_threads=num_threads,
verbose=verbose,
)
可以看出,fit函数其实是调用了fit_partial函数,但是在调用前进行了reset_state操作,清除了之前的参数状态。
3.item_cold_start
item官方以及给了相关的例子[4]:
其中测试集包含了10%的交互信息是训练中包含的,另外是训练集中没有任何交互信息的items
import numpy as np
from lightfm.datasets import fetch_stackexchange
data = fetch_stackexchange('crossvalidated',
test_set_fraction=0.1,
indicator_features=False,
tag_features=True)
train = data['train']
test = data['test']
train.toarray().shape
test.toarray().shape
# (3213, 72360)
分析发现测试集和训练集都是同样的shape,意味着把没有任何交互数据的item也是放到了训练集中的,这个比较有意思,理解起来就是即使我这个item没有任何交互信息,训练的时候也需要把item放到interaction_matrix
然后item冷启动是个什么概念呢,通过[5]知道,只有ID肯定是不行的,所需要的是item侧的一些特征
item_features = data['item_features']
tag_labels = data['item_feature_labels']
print('There are %s distinct tags, with values like %s.' % (item_features.shape[1], tag_labels[:3].tolist()))
# There are 1246 distinct tags, with values like [u'bayesian', u'prior', u'elicitation'].
item_features.toarray().shape
# (72360, 1246)
可以看出,训练时item_features也是和item的维度保持一致的。
下面就是正常的训练流程了
# Define a new model instance
model = LightFM(loss='warp',
item_alpha=ITEM_ALPHA,
no_components=NUM_COMPONENTS)
# Fit the hybrid model. Note that this time, we pass
# in the item features matrix.
model = model.fit(train,
item_features=item_features,
epochs=NUM_EPOCHS,
num_threads=NUM_THREADS)
test_auc = auc_score(model,
test,
train_interactions=train,
item_features=item_features,
num_threads=NUM_THREADS,
check_intersections=False).mean()
print('Hybrid test set AUC: %s' % test_auc)
# Hybrid test set AUC: 0.703039
4.user cold-start和user_features
在做项目用户侧信息比较丰富,但是交互信息极其稀疏,可以说训练集中有些用户的交互信息为0,所以可以理解为用户冷启动问题,根据前面的item cold start 例子,需要明确两个问题:
1.推理用户即使没有任何交互信息,也需要在训练中进行体现
2.官方文档[2]在Building datasets这里写了一个整合数据的例子,但是item_features只有一个特征,
而我的user_features可不止一个特征,如何整合特征形成user_features成了一个大坑,后面在[6]中找到了答案。
feature_columns = ['user_id', 'yewu_id','term_brand','price_range', 'arpu_chrg_last3_avg', 'total_flow']
train, test = train_data[feature_columns], test_data[feature_columns]
# total flow & arpu_chrg_last3_avg bin cut
total_flow_b = [-1000, 1024, 3072, 5120, 10240, 20480, 30720, 50000, 100000, 30000000]
total_flow_l = [x for x in range(len(total_flow_b) - 1)]
train.loc[:, 'total_flow'] = pd.cut(train['total_flow'], bins=total_flow_b)
test.loc[:, 'total_flow'] = pd.cut(test['total_flow'], bins=total_flow_b)
arpu_b = [-10000, 30, 60, 80, 100, 120, 150, 200, 400, 100000]
arpu_l = [x for x in range(len(arpu_b) - 1)]
train.loc[:, 'arpu_chrg_last3_avg'] = pd.cut(train['arpu_chrg_last3_avg'], bins=arpu_b)
test.loc[:, 'arpu_chrg_last3_avg'] = pd.cut(test['arpu_chrg_last3_avg'], bins=arpu_b)
train.loc[:, 'price_range'] = train['price_range'].astype(int)
test.loc[:, 'price_ragne'] = test['price_range'].astype(int)
train_test = pd.concat([train, test], axis=0)
pairs = train_test[['user_id', 'yewu_id']].drop_duplicates()
user_features = train_test[['user_id', 'term_brand', 'price_range', 'arpu_chrg_last3_avg', 'total_flow']].drop_duplicates()
from lightfm.data import Dataset
dataset = Dataset()
dataset.fit(users=(x[1]['user_id'] for x in pairs.iterrows()),
items=(x[1]['yewu_id'] for x in pairs.iterrows())
)
(interactions, weights) = dataset.build_interactions(
((x[1]['user_id'], x[1]['yewu_id'])
for x in train.iterrows()))
# user_features_matrix generation
user_features_list = list()
for tag in ['term_brand', 'price_range', 'total_flow', 'arpu_chrg_last3_avg']:
user_features_list += list(user_features[tag].unique())
dataset.fit_partial(users=(x[1]['user_id'] for x in user_features.iterrows()),
user_features=user_features_list)
user_features_matrix = dataset.build_user_features([(x[0], list(x[1:]))
for x in user_features[['user_id', 'term_brand', 'total_flow', 'arpu_chrg_last3_avg']].values])
# model fit&validate
from lightfm import LightFM
model = LightFM(loss='bpr')
model.fit(interactions, user_features=user_features_matrix)
from lightfm.evaluation import auc_score
train_auc = auc_score(model,
interactions,
user_features=user_features_matrix).mean()
print(train_auc)
# 0.775
(test_interactions, test_weights) = dataset.build_interactions(((x[1]['user_id'], x[1]['yewu_id'])
for x in test.iterrows()))
test_auc = auc_score(model,
test_interactions,
train_interactions=interactions,
user_features=user_features_matrix).mean()
print(test_auc)
# 0.774
之前在user_features_matrix这里,总是报错提示要先fit,后来发现是fit的写法有问题,改了之后,后面运行就OK了,4个特征auc能做到0.77,很可以了。
Reference: