【lightfm】推荐算法(协同过滤)实践 - 视频喜好推荐

推荐算法的种类

推荐算法是个大的范围,里面包括了很多小的算法,具体算法分类见下:

协同过滤是推荐算法中目前最主流的种类,花样繁多,在工业界已经有了很多广泛的应用。它的优点是不需要太多特定的领域知识,可以通过基于统计的机器学习算法来得到较好的推荐效果。最大的优点是工程上容易实现,可以方便应用到产品中。目前绝大多数实际应用的推荐算法都是协同过滤推荐算法。

目的

本实验基于开源的MovieLens数据集,此数据集的结构和电信视频领域的真实数据结构类似,因此可以采用其数据训练视频喜好的推荐算法模型。

通过训练协同过滤算法,给客户推荐对应的电影。

数据集

MovieLens数据集的结构和电信视频领域的真实数据结构类似,其是由名尼苏达大学University of Minnesota的GroupLens Research Project在1997-1998年所收集,数据集包括943用户关于1682部电影的10W评分信息(1-5),每个用户至少评价了20部电影,分数在1至5之间,以字典的形式存储。用户的简单特征包括age、gender、occupation、zip等,

代码实现

import numpy as np
import pandas as pd
import matplotlib.pylab as plt

from lightfm.datasets import fetch_movielens  # 导入数据集
from lightfm import LightFM
% matplotlib inline
# 读取评分4分以上的数据
data = fetch_movielens(min_rating=4.0)
# 显示数据信息
for key, value in data.items():
    print(key, type(value), value.shape)

执行结果: 

train <class 'scipy.sparse.coo.coo_matrix'> (943, 1682)
test <class 'scipy.sparse.coo.coo_matrix'> (943, 1682)
item_features <class 'scipy.sparse.csr.csr_matrix'> (1682, 1682)
item_feature_labels <class 'numpy.ndarray'> (1682,)
item_labels <class 'numpy.ndarray'> (1682,)
# 显示test数据的100列
# The train and test elements are the most important: they contain the raw rating data, split into a train and a test set.
# 每一行代表一个用户,每一列代表一个特征.
# 评分区间为1-5.
# 显示测试数据的第1行和500列
data['test'].todense()[:1, :100]  # todense()返回一个矩阵,toarray()返回一个ndarray

执行结果:

matrix([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])
# 显示电影的标签名称
print(data['item_labels'])
# print(data['item_features'])
# print(data['item_feature_labels'])

执行结果:

['Toy Story (1995)' 'GoldenEye (1995)' 'Four Rooms (1995)' ...
 'Sliding Doors (1998)' 'You So Crazy (1994)'
 'Scream of Stone (Schrei aus Stein) (1991)']
# 数据探索
# 转换数据格式
train_df = pd.DataFrame(data['train'].todense(), columns=data['item_labels'])  # 每个用户对所有电影的打分。每行:一个用户;每列:一个电影的所有打分
print(train_df.head())
# .transpose()对矩阵进行行列互换
train_df = train_df.transpose()  # 每个电影对应的所有用户打分。每行:一个电影的所有得分;每列:每个用户对电影的打分
print('---------')
print(train_df.head())

执行结果:

   Toy Story (1995)  GoldenEye (1995)  Four Rooms (1995)  Get Shorty (1995)  \
0                 5                 0                  4                  0   
1                 4                 0                  0                  0   
2                 0                 0                  0                  0   
3                 0                 0                  0                  0   
4                 0                 0                  0                  0   

   Copycat (1995)  Shanghai Triad (Yao a yao yao dao waipo qiao) (1995)  \
0               0                                                  5      
1               0                                                  0      
2               0                                                  0      
3               0                                                  0      
4               0                                                  0      

   Twelve Monkeys (1995)  Babe (1995)  Dead Man Walking (1995)  \
0                      4            0                        5   
1                      0            0                        0   
2                      0            0                        0   
3                      0            0                        0   
4                      0            0                        0   

   Richard III (1995)  ...  Mirage (1995)  Mamma Roma (1962)  \
0                   0  ...              0                  0   
1                   0  ...              0                  0   
2                   0  ...              0                  0   
3                   0  ...              0                  0   
4                   0  ...              0                  0   

   Sunchaser, The (1996)  War at Home, The (1996)  Sweet Nothing (1995)  \
0                      0                        0                     0   
1                      0                        0                     0   
2                      0                        0                     0   
3                      0                        0                     0   
4                      0                        0                     0   

   Mat' i syn (1997)  B. Monkey (1998)  Sliding Doors (1998)  \
0                  0                 0                     0   
1                  0                 0                     0   
2                  0                 0                     0   
3                  0                 0                     0   
4                  0                 0                     0   

   You So Crazy (1994)  Scream of Stone (Schrei aus Stein) (1991)  
0                    0                                          0  
1                    0                                          0  
2                    0                                          0  
3                    0                                          0  
4                    0                                          0  

[5 rows x 1682 columns]
---------
                   0    1    2    3    4    5    6    7    8    9    ...  933  \
Toy Story (1995)     5    4    0    0    0    4    0    0    0    4  ...    0   
GoldenEye (1995)     0    0    0    0    0    0    0    0    0    0  ...    4   
Four Rooms (1995)    4    0    0    0    0    0    0    0    0    0  ...    0   
Get Shorty (1995)    0    0    0    0    0    0    5    0    0    4  ...    5   
Copycat (1995)       0    0    0    0    0    0    0    0    0    0  ...    0   

                   934  935  936  937  938  939  940  941  942  
Toy Story (1995)     0    4    0    4    0    0    5    0    0  
GoldenEye (1995)     0    0    0    0    0    0    0    0    5  
Four Rooms (1995)    0    4    0    0    0    0    0    0    0  
Get Shorty (1995)    0    0    0    0    0    0    0    0    0  
Copycat (1995)       0    0    0    0    0    0    0    0    0  

[5 rows x 943 columns]
# 显示评分最多的电影
def count_label(x):
    if x > 0:
        return True
    else:
        return False

# applymap(func) 是DF的属性, 对整个DF所有元素应用func操作
item_count = train_df.applymap(count_label).sum(axis=1)
item_count.sort_values(ascending=False)[0: 20][:: -1].plot.barh()  # ascending=False降序排列,取前20个,画条形图
plt.xlabel('count of item')

执行结果:

# 训练推荐算法
# 导入推荐算法包
from lightfm.evaluation import precision_at_k
from lightfm.evaluation import auc_score
# 分割测试集
train = data['train']
test = data['test']
# 用loss function='bpr'训练模型
model = LightFM(learning_rate=0.05, loss='bpr')
model.fit(train, epochs=10)

train_precision = precision_at_k(model, train, k=10).mean()
test_precision = precision_at_k(model, test, k=10).mean()

train_auc = auc_score(model, train).mean()
test_auc = auc_score(model, test).mean()

print('Precision: train %.2f, test %.2f.' % (train_precision, test_precision))
print('AUC: train %.2f, test %.2f.' % (train_auc, test_auc))

执行结果:

Precision: train 0.49, test 0.07.
AUC: train 0.89, test 0.84.
# 用loss function='warp'训练模型
# The WARP model, on the other hand, optimises for precision
# we should expect its performance to be better on precision.
model = LightFM(learning_rate=0.05, loss='warp')
model.fit_partial(train, epochs=10)

train_precision = precision_at_k(model, train, k=10).mean()
test_precision = precision_at_k(model, test, k=10).mean()

# AUC的全称是Area under the Curve of ROC,也就是ROC曲线下方的面积
train_auc = auc_score(model, train).mean()  # AUC值就是一个用来评价二分类模型优劣的常用指标, AUC 值越高通常表明模型的效果越好
test_auc = auc_score(model, test).mean()

print('Precision: train %.2f, test %.2f.' % (train_precision, test_precision))
print('AUC: train %.2f, test %.2f.' % (train_auc, test_auc))

执行结果:

Precision: train 0.49, test 0.08.
AUC: train 0.94, test 0.91.
# 选择更优模型重新训练模型
# create a model with loss function of Weighted apporoximate-rank Pairwise
model = LightFM(loss='warp')
model.fit(data['train'], epochs=30, num_threads=2)
<lightfm.lightfm.LightFM at 0x12b67b00>
# 为用户推荐电影
# 设置推荐函数
def sample_recommedation(model, data, user_ids):
    n_users, n_movies = data['train'].shape
    for user_id in user_ids:
        # find the positive rating items for user_id
        known_positive = data['item_labels'][data['train'].tocsr()[user_id].indices]
        # compute the recommendation score for user-item pairs.
        scores = model.predict(user_id, np.arange(n_movies))
        # sort the movie by the scores
        top_movies = data['item_labels'][np.argsort(-scores)]
        print('User %s' % user_id)
        print(' Known positives:')
        for x in known_positive[0: 3]:
            print('     %s' % x)
        print(' Top movies:')
        for x in top_movies[0: 3]:
            print('     %s' % x)
sample_recommedation(model, data, [0, 10, 450])

执行结果: 

User 0
 Known positives:
     Toy Story (1995)
     Four Rooms (1995)
     Shanghai Triad (Yao a yao yao dao waipo qiao) (1995)
 Top movies:
     Usual Suspects, The (1995)
     Star Wars (1977)
     Pulp Fiction (1994)
User 10
 Known positives:
     Babe (1995)
     Dead Man Walking (1995)
     Mr. Holland's Opus (1995)
 Top movies:
     Star Wars (1977)
     Fargo (1996)
     English Patient, The (1996)
User 450
 Known positives:
     Contact (1997)
     George of the Jungle (1997)
     Event Horizon (1997)
 Top movies:
     Air Force One (1997)
     Conspiracy Theory (1997)
     Kiss the Girls (1997)
​

 

  • 2
    点赞
  • 17
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值