推荐算法的种类
推荐算法是个大的范围,里面包括了很多小的算法,具体算法分类见下:
协同过滤是推荐算法中目前最主流的种类,花样繁多,在工业界已经有了很多广泛的应用。它的优点是不需要太多特定的领域知识,可以通过基于统计的机器学习算法来得到较好的推荐效果。最大的优点是工程上容易实现,可以方便应用到产品中。目前绝大多数实际应用的推荐算法都是协同过滤推荐算法。
目的
本实验基于开源的MovieLens数据集,此数据集的结构和电信视频领域的真实数据结构类似,因此可以采用其数据训练视频喜好的推荐算法模型。
通过训练协同过滤算法,给客户推荐对应的电影。
数据集
MovieLens数据集的结构和电信视频领域的真实数据结构类似,其是由名尼苏达大学University of Minnesota的GroupLens Research Project在1997-1998年所收集,数据集包括943用户关于1682部电影的10W评分信息(1-5),每个用户至少评价了20部电影,分数在1至5之间,以字典的形式存储。用户的简单特征包括age、gender、occupation、zip等,
代码实现
import numpy as np
import pandas as pd
import matplotlib.pylab as plt
from lightfm.datasets import fetch_movielens # 导入数据集
from lightfm import LightFM
% matplotlib inline
# 读取评分4分以上的数据
data = fetch_movielens(min_rating=4.0)
# 显示数据信息
for key, value in data.items():
print(key, type(value), value.shape)
执行结果:
train <class 'scipy.sparse.coo.coo_matrix'> (943, 1682)
test <class 'scipy.sparse.coo.coo_matrix'> (943, 1682)
item_features <class 'scipy.sparse.csr.csr_matrix'> (1682, 1682)
item_feature_labels <class 'numpy.ndarray'> (1682,)
item_labels <class 'numpy.ndarray'> (1682,)
# 显示test数据的100列
# The train and test elements are the most important: they contain the raw rating data, split into a train and a test set.
# 每一行代表一个用户,每一列代表一个特征.
# 评分区间为1-5.
# 显示测试数据的第1行和500列
data['test'].todense()[:1, :100] # todense()返回一个矩阵,toarray()返回一个ndarray
执行结果:
matrix([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])
# 显示电影的标签名称
print(data['item_labels'])
# print(data['item_features'])
# print(data['item_feature_labels'])
执行结果:
['Toy Story (1995)' 'GoldenEye (1995)' 'Four Rooms (1995)' ...
'Sliding Doors (1998)' 'You So Crazy (1994)'
'Scream of Stone (Schrei aus Stein) (1991)']
# 数据探索
# 转换数据格式
train_df = pd.DataFrame(data['train'].todense(), columns=data['item_labels']) # 每个用户对所有电影的打分。每行:一个用户;每列:一个电影的所有打分
print(train_df.head())
# .transpose()对矩阵进行行列互换
train_df = train_df.transpose() # 每个电影对应的所有用户打分。每行:一个电影的所有得分;每列:每个用户对电影的打分
print('---------')
print(train_df.head())
执行结果:
Toy Story (1995) GoldenEye (1995) Four Rooms (1995) Get Shorty (1995) \
0 5 0 4 0
1 4 0 0 0
2 0 0 0 0
3 0 0 0 0
4 0 0 0 0
Copycat (1995) Shanghai Triad (Yao a yao yao dao waipo qiao) (1995) \
0 0 5
1 0 0
2 0 0
3 0 0
4 0 0
Twelve Monkeys (1995) Babe (1995) Dead Man Walking (1995) \
0 4 0 5
1 0 0 0
2 0 0 0
3 0 0 0
4 0 0 0
Richard III (1995) ... Mirage (1995) Mamma Roma (1962) \
0 0 ... 0 0
1 0 ... 0 0
2 0 ... 0 0
3 0 ... 0 0
4 0 ... 0 0
Sunchaser, The (1996) War at Home, The (1996) Sweet Nothing (1995) \
0 0 0 0
1 0 0 0
2 0 0 0
3 0 0 0
4 0 0 0
Mat' i syn (1997) B. Monkey (1998) Sliding Doors (1998) \
0 0 0 0
1 0 0 0
2 0 0 0
3 0 0 0
4 0 0 0
You So Crazy (1994) Scream of Stone (Schrei aus Stein) (1991)
0 0 0
1 0 0
2 0 0
3 0 0
4 0 0
[5 rows x 1682 columns]
---------
0 1 2 3 4 5 6 7 8 9 ... 933 \
Toy Story (1995) 5 4 0 0 0 4 0 0 0 4 ... 0
GoldenEye (1995) 0 0 0 0 0 0 0 0 0 0 ... 4
Four Rooms (1995) 4 0 0 0 0 0 0 0 0 0 ... 0
Get Shorty (1995) 0 0 0 0 0 0 5 0 0 4 ... 5
Copycat (1995) 0 0 0 0 0 0 0 0 0 0 ... 0
934 935 936 937 938 939 940 941 942
Toy Story (1995) 0 4 0 4 0 0 5 0 0
GoldenEye (1995) 0 0 0 0 0 0 0 0 5
Four Rooms (1995) 0 4 0 0 0 0 0 0 0
Get Shorty (1995) 0 0 0 0 0 0 0 0 0
Copycat (1995) 0 0 0 0 0 0 0 0 0
[5 rows x 943 columns]
# 显示评分最多的电影
def count_label(x):
if x > 0:
return True
else:
return False
# applymap(func) 是DF的属性, 对整个DF所有元素应用func操作
item_count = train_df.applymap(count_label).sum(axis=1)
item_count.sort_values(ascending=False)[0: 20][:: -1].plot.barh() # ascending=False降序排列,取前20个,画条形图
plt.xlabel('count of item')
执行结果:
# 训练推荐算法
# 导入推荐算法包
from lightfm.evaluation import precision_at_k
from lightfm.evaluation import auc_score
# 分割测试集
train = data['train']
test = data['test']
# 用loss function='bpr'训练模型
model = LightFM(learning_rate=0.05, loss='bpr')
model.fit(train, epochs=10)
train_precision = precision_at_k(model, train, k=10).mean()
test_precision = precision_at_k(model, test, k=10).mean()
train_auc = auc_score(model, train).mean()
test_auc = auc_score(model, test).mean()
print('Precision: train %.2f, test %.2f.' % (train_precision, test_precision))
print('AUC: train %.2f, test %.2f.' % (train_auc, test_auc))
执行结果:
Precision: train 0.49, test 0.07.
AUC: train 0.89, test 0.84.
# 用loss function='warp'训练模型
# The WARP model, on the other hand, optimises for precision
# we should expect its performance to be better on precision.
model = LightFM(learning_rate=0.05, loss='warp')
model.fit_partial(train, epochs=10)
train_precision = precision_at_k(model, train, k=10).mean()
test_precision = precision_at_k(model, test, k=10).mean()
# AUC的全称是Area under the Curve of ROC,也就是ROC曲线下方的面积
train_auc = auc_score(model, train).mean() # AUC值就是一个用来评价二分类模型优劣的常用指标, AUC 值越高通常表明模型的效果越好
test_auc = auc_score(model, test).mean()
print('Precision: train %.2f, test %.2f.' % (train_precision, test_precision))
print('AUC: train %.2f, test %.2f.' % (train_auc, test_auc))
执行结果:
Precision: train 0.49, test 0.08.
AUC: train 0.94, test 0.91.
# 选择更优模型重新训练模型
# create a model with loss function of Weighted apporoximate-rank Pairwise
model = LightFM(loss='warp')
model.fit(data['train'], epochs=30, num_threads=2)
<lightfm.lightfm.LightFM at 0x12b67b00>
# 为用户推荐电影
# 设置推荐函数
def sample_recommedation(model, data, user_ids):
n_users, n_movies = data['train'].shape
for user_id in user_ids:
# find the positive rating items for user_id
known_positive = data['item_labels'][data['train'].tocsr()[user_id].indices]
# compute the recommendation score for user-item pairs.
scores = model.predict(user_id, np.arange(n_movies))
# sort the movie by the scores
top_movies = data['item_labels'][np.argsort(-scores)]
print('User %s' % user_id)
print(' Known positives:')
for x in known_positive[0: 3]:
print(' %s' % x)
print(' Top movies:')
for x in top_movies[0: 3]:
print(' %s' % x)
sample_recommedation(model, data, [0, 10, 450])
执行结果:
User 0
Known positives:
Toy Story (1995)
Four Rooms (1995)
Shanghai Triad (Yao a yao yao dao waipo qiao) (1995)
Top movies:
Usual Suspects, The (1995)
Star Wars (1977)
Pulp Fiction (1994)
User 10
Known positives:
Babe (1995)
Dead Man Walking (1995)
Mr. Holland's Opus (1995)
Top movies:
Star Wars (1977)
Fargo (1996)
English Patient, The (1996)
User 450
Known positives:
Contact (1997)
George of the Jungle (1997)
Event Horizon (1997)
Top movies:
Air Force One (1997)
Conspiracy Theory (1997)
Kiss the Girls (1997)