一款强大的推荐系统框架,可以处理用户/产品冷启动,之前介绍了数据处理和冷启动方法,这里补充下原理和接口。
1.原理
- 会根据iteractions求取user和item的Embedding表示
- 预测的时候计算user和item表示的点积,然后根据user和item的偏置修正结果分数,分数越高,用户和物品匹配度越高 r u i ∧ = f ( q u ⋅ p i + b u + b i ) r_{ui}^\land= f(q_u \cdot p_i + b_u + b_i) rui∧=f(qu⋅pi+bu+bi)
- 如果存在user_features和item_features, 那么user和item的表示是综合的结果: The user and item representations are expressed in terms of representations of their features: an embedding is estimated for every feature, and these features are then summed together to arrive at representations for users and items, 注意这里是every feature,意思是每一个feature都有一个向量表示。
2.接口
LigthFM类
lightfm.LightFM(no_components=10, k=5, n=10, learning_schedule=‘adagrad’, loss=‘logistic’, learning_rate=0.05, rho=0.95, epsilon=1e-06, item_alpha=0.0, user_alpha=0.0, max_sampled=10, random_state=None)
参数:
- no_components: embedding维度,default 10
- k,n K-OS train中的样本抽取,暂时用不到。
- loss: 损失定义,这里很关键,默认为"logistic", 在处理排序问题时候,可以选择"bpr"和"warp", 尤其是“warp”对topk precision优化效果明显
- learning_rate:顾名思义
- item_alpha (float, optional) – L2 penalty on item features. Tip: setting this number too high can slow down training. One good way to check is if the final weights in the embeddings turned out to be mostly zero. The same idea applies to the user_alpha parameter.
- user_alpha (float, optional) – L2 penalty on user features.
- max_sampled: maximum number of negative samples used during WARP fitting. default 10.
- random_state: default None
loss选择:
logitstic:useful when both positive (1) and negative (-1) interactions are present
bpr: Useful when only positive interactions are present and optimising ROC AUC is desired
warp:Useful when only positive interactions are present and optimising the top of the recommendation list (precision@k) is desired.
fit方法
fit(interactions, user_features=None, item_features=None, sample_weight=None, epochs=1, num_threads=1, verbose=False)
- interactions:交互矩阵
- user_features:用户特征矩阵
- item_features:产品特征矩阵
- sample_weight:权重矩阵,可以赋予不同的行为不同的权重,比如点击,收藏,购买分别1,3,5,以此可以进行多种行为特征学习。
get_item_representations
得到的是item的latent表示,维度应该是(n_item, n_embedding),类型是array
get_user_representations
得到的是user的latent表示,维度应该是(n_user, n_embedding),类型是array
predict
predict(user_ids, item_ids, item_features=None, user_features=None, num_threads=1)
此方法需要注意一点: 源码中assert: len(user_ids == item_ids).
得到的是用户和产品的距离,不是0,1的分数,不能直接用来做ctr或者cvr预测。
predict_rank方法
pred = model.predict_rank(test_interactions,
train_interactions=train_interactions)
预测出来的值大部分为0,很奇怪
array([0., 0., 0., 0., 0., 0., 0., 0., 4., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], dtype=float32)
看了下源码,ranks矩阵初始值就是0.
ranks = sp.csr_matrix(
(np.zeros_like(test_interactions.data),
test_interactions.indices,
test_interactions.indptr,
),
shape=test_interactions.shape,
)
# ranks本来就开始全部赋值为0
lightfm_data = self._get_lightfm_data()
predict_ranks(
CSRMatrix(item_features),
CSRMatrix(user_features),
CSRMatrix(test_interactions),
CSRMatrix(train_interactions),
ranks.data,
lightfm_data,
num_thread
Performs best when only a handful of interactions need to be evaluated per user. If you need to compute predictions for many items for every user, use the predict method instead. 在全量评估的时候,官网建议使用predict。
model_save
可以使用pickle进行模型的保存加载
import pickle
with open('savefile.pickle', 'wb') as fle:
pickle.dump(model, fle, protocol=pickle.HIGHEST_PROTOCOL)
with open('savefile.pickle', 'rb') as fle:
model_loaded = pickle.load(fle)
test_rank = model_loaded.predict_rank(
test_interactions,
train_interactions=interactions,
user_features=user_features_matrix)