【lightfm】推荐算法（协同过滤）实践 - 视频喜好推荐

最新推荐文章于 2025-06-08 09:18:14 发布

duanlianvip

最新推荐文章于 2025-06-08 09:18:14 发布

阅读量3.6k

点赞数 2

CC 4.0 BY-SA版权

分类专栏： lightfm 机器学习文章标签：推荐算法协同过滤 lightfm

本文链接：https://blog.csdn.net/duanlianvip/article/details/101195332

机器学习同时被 2 个专栏收录

6 篇文章

订阅专栏

lightfm

1 篇文章

订阅专栏

目的

本实验基于开源的MovieLens数据集，此数据集的结构和电信视频领域的真实数据结构类似，因此可以采用其数据训练视频喜好的推荐算法模型。

通过训练协同过滤算法，给客户推荐对应的电影。

数据集

MovieLens数据集的结构和电信视频领域的真实数据结构类似，其是由名尼苏达大学University of Minnesota的GroupLens Research Project在1997-1998年所收集，数据集包括943用户关于1682部电影的10W评分信息（1-5），每个用户至少评价了20部电影，分数在1至5之间，以字典的形式存储。用户的简单特征包括age、gender、occupation、zip等，

代码实现

import numpy as np
import pandas as pd
import matplotlib.pylab as plt

from lightfm.datasets import fetch_movielens  # 导入数据集
from lightfm import LightFM
% matplotlib inline

# 读取评分4分以上的数据
data = fetch_movielens(min_rating=4.0)

# 显示数据信息
for key, value in data.items():
    print(key, type(value), value.shape)

执行结果：

train <class 'scipy.sparse.coo.coo_matrix'> (943, 1682)
test <class 'scipy.sparse.coo.coo_matrix'> (943, 1682)
item_features <class 'scipy.sparse.csr.csr_matrix'> (1682, 1682)
item_feature_labels <class 'numpy.ndarray'> (1682,)
item_labels <class 'numpy.ndarray'> (1682,)

# 显示test数据的100列
# The train and test elements are the most important: they contain the raw rating data, split into a train and a test set.
# 每一行代表一个用户，每一列代表一个特征.
# 评分区间为1-5.
# 显示测试数据的第1行和500列
data['test'].todense()[:1, :100]  # todense()返回一个矩阵，toarray()返回一个ndarray

执行结果：

matrix([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])

# 显示电影的标签名称
print(data['item_labels'])
# print(data['item_features'])
# print(data['item_feature_labels'])

执行结果：

['Toy Story (1995)' 'GoldenEye (1995)' 'Four Rooms (1995)' ...
 'Sliding Doors (1998)' 'You So Crazy (1994)'
 'Scream of Stone (Schrei aus Stein) (1991)']

# 数据探索
# 转换数据格式
train_df = pd.DataFrame(data['train'].todense(), columns=data['item_labels'])  # 每个用户对所有电影的打分。每行：一个用户；每列：一个电影的所有打分
print(train_df.head())
# .transpose()对矩阵进行行列互换
train_df = train_df.transpose()  # 每个电影对应的所有用户打分。每行：一个电影的所有得分；每列：每个用户对电影的打分
print('---------')
print(train_df.head())

执行结果：

   Toy Story (1995)  GoldenEye (1995)  Four Rooms (1995)  Get Shorty (1995)  \
0                 5                 0                  4                  0   
1                 4                 0                  0                  0   
2                 0                 0                  0                  0   
3                 0                 0                  0                  0   
4                 0                 0                  0                  0   

   Copycat (1995)  Shanghai Triad (Yao a yao yao dao waipo qiao) (1995)  \
0               0                                                  5      
1               0                                                  0      
2               0                                                  0      
3               0                                                  0      
4               0                                                  0      

   Twelve Monkeys (1995)  Babe (1995)  Dead Man Walking (1995)  \
0                      4            0                        5   
1                      0            0                        0   
2                      0            0                        0   
3                      0            0                        0   
4                      0            0                        0   

   Richard III (1995)  ...  Mirage (1995)  Mamma Roma (1962)  \
0                   0  ...              0                  0   
1                   0  ...              0                  0   
2                   0  ...              0                  0   
3                   0  ...              0                  0   
4                   0  ...              0                  0   

   Sunchaser, The (1996)  War at Home, The (1996)  Sweet Nothing (1995)  \
0                      0                        0                     0   
1                      0                        0                     0   
2                      0                        0                     0   
3                      0                        0                     0   
4                      0                        0                     0   

   Mat' i syn (1997)  B. Monkey (1998)  Sliding Doors (1998)  \
0                  0                 0                     0   
1                  0                 0                     0   
2                  0                 0                     0   
3                  0                 0                     0   
4                  0                 0                     0   

   You So Crazy (1994)  Scream of Stone (Schrei aus Stein) (1991)  
0                    0                                          0  
1                    0                                          0  
2                    0                                          0  
3                    0                                          0  
4                    0                                          0  

[5 rows x 1682 columns]
---------
                   0    1    2    3    4    5    6    7    8    9    ...  933  \
Toy Story (1995)     5    4    0    0    0    4    0    0    0    4  ...    0   
GoldenEye (1995)     0    0    0    0    0    0    0    0    0    0  ...    4   
Four Rooms (1995)    4    0    0    0    0    0    0    0    0    0  ...    0   
Get Shorty (1995)    0    0    0    0    0    0    5    0    0    4  ...    5   
Copycat (1995)       0    0    0    0    0    0    0    0    0    0  ...    0   

                   934  935  936  937  938  939  940  941  942  
Toy Story (1995)     0    4    0    4    0    0    5    0    0  
GoldenEye (1995)     0    0    0    0    0    0    0    0    5  
Four Rooms (1995)    0    4    0    0    0    0    0    0    0  
Get Shorty (1995)    0    0    0    0    0    0    0    0    0  
Copycat (1995)       0    0    0    0    0    0    0    0    0  

[5 rows x 943 columns]

# 显示评分最多的电影
def count_label(x):
    if x > 0:
        return True
    else:
        return False

# applymap(func) 是DF的属性, 对整个DF所有元素应用func操作
item_count = train_df.applymap(count_label).sum(axis=1)
item_count.sort_values(ascending=False)[0: 20][:: -1].plot.barh()  # ascending=False降序排列，取前20个，画条形图
plt.xlabel('count of item')

执行结果：

# 训练推荐算法
# 导入推荐算法包
from lightfm.evaluation import precision_at_k
from lightfm.evaluation import auc_score

# 分割测试集
train = data['train']
test = data['test']

# 用loss function='bpr'训练模型
model = LightFM(learning_rate=0.05, loss='bpr')
model.fit(train, epochs=10)

train_precision = precision_at_k(model, train, k=10).mean()
test_precision = precision_at_k(model, test, k=10).mean()

train_auc = auc_score(model, train).mean()
test_auc = auc_score(model, test).mean()

print('Precision: train %.2f, test %.2f.' % (train_precision, test_precision))
print('AUC: train %.2f, test %.2f.' % (train_auc, test_auc))

执行结果：

Precision: train 0.49, test 0.07.
AUC: train 0.89, test 0.84.

# 用loss function='warp'训练模型
# The WARP model, on the other hand, optimises for precision
# we should expect its performance to be better on precision.
model = LightFM(learning_rate=0.05, loss='warp')
model.fit_partial(train, epochs=10)

train_precision = precision_at_k(model, train, k=10).mean()
test_precision = precision_at_k(model, test, k=10).mean()

# AUC的全称是Area under the Curve of ROC，也就是ROC曲线下方的面积
train_auc = auc_score(model, train).mean()  # AUC值就是一个用来评价二分类模型优劣的常用指标， AUC 值越高通常表明模型的效果越好
test_auc = auc_score(model, test).mean()

print('Precision: train %.2f, test %.2f.' % (train_precision, test_precision))
print('AUC: train %.2f, test %.2f.' % (train_auc, test_auc))

执行结果：

Precision: train 0.49, test 0.08.
AUC: train 0.94, test 0.91.

# 选择更优模型重新训练模型
# create a model with loss function of Weighted apporoximate-rank Pairwise
model = LightFM(loss='warp')
model.fit(data['train'], epochs=30, num_threads=2)

<lightfm.lightfm.LightFM at 0x12b67b00>

# 为用户推荐电影
# 设置推荐函数
def sample_recommedation(model, data, user_ids):
    n_users, n_movies = data['train'].shape
    for user_id in user_ids:
        # find the positive rating items for user_id
        known_positive = data['item_labels'][data['train'].tocsr()[user_id].indices]
        # compute the recommendation score for user-item pairs.
        scores = model.predict(user_id, np.arange(n_movies))
        # sort the movie by the scores
        top_movies = data['item_labels'][np.argsort(-scores)]
        print('User %s' % user_id)
        print(' Known positives:')
        for x in known_positive[0: 3]:
            print('     %s' % x)
        print(' Top movies:')
        for x in top_movies[0: 3]:
            print('     %s' % x)

sample_recommedation(model, data, [0, 10, 450])

执行结果：

User 0
 Known positives:
     Toy Story (1995)
     Four Rooms (1995)
     Shanghai Triad (Yao a yao yao dao waipo qiao) (1995)
 Top movies:
     Usual Suspects, The (1995)
     Star Wars (1977)
     Pulp Fiction (1994)
User 10
 Known positives:
     Babe (1995)
     Dead Man Walking (1995)
     Mr. Holland's Opus (1995)
 Top movies:
     Star Wars (1977)
     Fargo (1996)
     English Patient, The (1996)
User 450
 Known positives:
     Contact (1997)
     George of the Jungle (1997)
     Event Horizon (1997)
 Top movies:
     Air Force One (1997)
     Conspiracy Theory (1997)
     Kiss the Girls (1997)