推荐系统案例

摘要

本文将介绍如下几种推荐算法以及调优过程

1.基线算法baseline

2.item协同过滤

3. 结合基线算法baseline的item协同过滤算法

4. item协同过滤(topK+ baseline)


电影数据集地址:

http://files.grouplens.org/datasets/movielens/ml-100k.zip

基线算法baseline

baseline算法的主要原理:使用公式item_mean+ user_mean[user] - all_mean填充用户评分矩阵Nan值预测用户对未知item的评分,其中item_mean是所有用户对指定item的评分平均值,user_mean是指定用户又有定影评分的平均值,all_mean则是所有item的评分平均值

首先看下测试数据的结构[user_id,movie_id,rating,timestamp]

1	1	5	874965758
1	2	3	876893171
1	3	4	878542960

用pandas读入数据

import numpy as np
import pandas as pd
title=['user_id', 'item_id', 'rating', 'timestamp']
df = pd.read_csv("C://Users/Administrator/Desktop/ml-100k/u1.base",sep='\t',names = title)

查看user和item去重后的个数

print np.max(df['user_id']),np.max(df['item_id'])
943 1682

构造评分矩阵ratings

ratings = np.zeros((943, 1682))
for row in df.itertuples():
    ratings[row[1]-1,row[2]-1] = row[3]

查看评分矩阵稠密度

sparsity = float(len(ratings.nonzero()[0]))
sparsity /= (ratings.shape[0] * ratings.shape[1])
sparsity *= 100
print('训练集矩阵密度为: {:4.2f}%'.format(sparsity))
训练集矩阵密度为: 5.04%

可以看出来评分矩阵是个非常稀疏的矩阵,95%的数据都是空值

开始baseline算法,首先要计算的是item_mean,user_mean, all_mean

all_mean = np.mean(ratings[ratings!=0])
user_mean = sum(ratings.T)/sum((ratings!=0).T)
item_mean = sum(ratings)/sum((ratings!=0))
#用all_mean填充user_mean和item_mean可能存在的空值Nan
user_mean = np.where(np.isnan(user_mean), all_mean, user_mean)
item_mean = np.where(np.isnan(item_mean), all_mean, item_mean)

预测用户user对item的评分

def predict_naive(user, item):
    prediction = item_mean[item] + user_mean[user] - all_mean
    return prediction

用均方根误差衡量算法准确率

def rmse(pred, actual):
    '''计算预测结果的rmse'''
    from sklearn.metrics import mean_squared_error
    pred = pred[actual.nonzero()].flatten()
    actual = actual[actual.nonzero()].flatten()
    return np.sqrt(mean_squared_error(pred, actual))

用测试集测试算法

# 用测试集测试
for row in test_df.itertuples():
    user,item,actual = row[1]-1,row[2]-1,row[3]
    predictions.append(predict_naive(user, item))
    actuals.append(actual)
print('测试结果的rmse为 %.4f' % rmse(np.array(predictions), np.array(actuals)))

测试结果的rmse为 0.9344

item协同过滤

# 计算item和user相似度矩阵
user_s = ratings.dot(ratings.T)
item_s = ratings.T.dot(ratings)
user_norm = np.array([np.sqrt(np.diagonal(user_s))])
item_norm = np.array([np.sqrt(np.diagonal(item_s))])
user_sim = (user_s/user_norm/user_norm.T)
item_sim = (item_s/item_norm/item_norm.T)
print np.round_(item_sim[:10,:10], 3)

[[ 1.     0.296  0.279  0.388  0.252  0.114  0.518  0.41   0.416  0.199]
 [ 0.296  1.     0.177  0.405  0.211  0.099  0.331  0.31   0.207  0.152]
 [ 0.279  0.177  1.     0.275  0.118  0.104  0.311  0.125  0.207  0.121]
 [ 0.388  0.405  0.275  1.     0.265  0.091  0.411  0.391  0.357  0.219]
 [ 0.252  0.211  0.118  0.265  1.     0.016  0.28   0.214  0.202  0.031]
 [ 0.114  0.099  0.104  0.091  0.016  1.     0.128  0.065  0.164  0.139]
 [ 0.518  0.331  0.311  0.411  0.28   0.128  1.     0.342  0.43   0.279]
 [ 0.41   0.31   0.125  0.391  0.214  0.065  0.342  1.     0.364  0.166]
 [ 0.416  0.207  0.207  0.357  0.202  0.164  0.43   0.364  1.     0.25 ]
 [ 0.199  0.152  0.121  0.219  0.031  0.139  0.279  0.166  0.25   1.   ]]

评分预测方法

def predict_itemCF(user, item, k=100):
    '''item协同过滤算法,预测rating'''
    nzero = ratings[user].nonzero()[0]
    prediction = ratings[user, nzero].dot(item_sim[item, nzero])\
                / sum(item_sim[item, nzero])
    return prediction

测试预测结果

test_df = pd.read_csv("C://Users/Administrator/Desktop/ml-100k/u3.test", sep='\t', names=title)
test_df.head()
predictions = []
targets = []
print('测试集大小为 %d' % len(test_df))
print('采用item-based协同过滤算法进行预测...')
for row in test_df.itertuples():
    user, item, actual = row[1]-1, row[2]-1, row[3]
    predictions.append(predict_itemCF(user, item))
    targets.append(actual)

print('测试结果的rmse为 %.4f' % rmse(np.array(predictions), np.array(targets)))

测试集大小为 20000
采用item-based协同过滤算法进行预测...
测试结果的rmse为 0.9534

结合基线算法baseline的item协同过滤算法

def predict_itemCF_baseline(user, item):
    '''结合baseline的item-basedCF算法,预测rating'''
    nzero = ratings[user].nonzero()[0]
    baseline = item_mean + user_mean[user] - all_mean
    prediction = (ratings[user, nzero] - baseline[nzero]).dot(item_sim[item, nzero])\
                / sum(item_sim[item, nzero]) + baseline[item]
    return prediction 

test_df = pd.read_csv("C://Users/Administrator/Desktop/ml-100k/u3.test", sep='\t', names=title)
test_df.head()
predictions = []
targets = []
print('测试集大小为 %d' % len(test_df))
print('采用结合baseline的item-item协同过滤算法进行预测...')
for row in test_df.itertuples():
    user, item, actual = row[1]-1, row[2]-1, row[3]
    predictions.append(predict_itemCF_baseline(user, item))
    targets.append(actual)

print('测试结果的rmse为 %.4f' % rmse(np.array(predictions), np.array(targets)))

测试集大小为 20000
采用item-based协同过滤算法进行预测...
测试结果的rmse为 0.8794

修正非法评分,将预测评分大于5的取值5,小于1的评分取值1

def predict_itemCF_baseline(user, item, k=100):
    '''结合基线算法的item-based CF算法,预测rating'''
    nzero = ratings[user].nonzero()[0]
    baseline = item_mean + user_mean[user] - all_mean
    prediction = (ratings[user, nzero] - baseline[nzero]).dot(item_sim[item, nzero])\
                / sum(item_sim[item, nzero]) + baseline[item]
    if prediction > 5:
        prediction = 5
    if prediction < 1:
        prediciton = 1
    return prediction

test_df = pd.read_csv("C://Users/Administrator/Desktop/ml-100k/u3.test", sep='\t', names=title)
test_df.head()
predictions = []
targets = []
print('测试集大小为 %d' % len(test_df))
print('采用结合baseline的item-item协同过滤算法进行预测...')
for row in test_df.itertuples():
    user, item, actual = row[1]-1, row[2]-1, row[3]
    predictions.append(predict_biasCF(user, item))
    targets.append(actual)

print('修正评分后的测试结果的rmse为 %.4f' % rmse(np.array(predictions), np.array(targets)))


测试集大小为 20000
采用结合baseline的item-item协同过滤算法进行预测...
修正评分后的测试结果的rmse为 0.8793

 item协同过滤(topK+ baseline)

print('------ Top-k协同过滤(item-based + baseline)------')
def predict_topkCF(user, item, k=10):
    '''top-k CF算法,以item-based协同过滤为基础,结合baseline,预测rating'''
    nzero = ratings[user].nonzero()[0]
    baseline = item_mean + user_mean[user] - all_mean
    choice = nzero[item_sim[item, nzero].argsort()[::-1][:k]]
    prediction = (ratings[user, choice] - baseline[choice]).dot(item_sim[item, choice])\
                / sum(item_sim[item, choice]) + baseline[item]
    if prediction > 5: prediction = 5
    if prediction < 1: prediction = 1
    return prediction 

print('载入测试集...')
test_df = pd.read_csv("C://Users/Administrator/Desktop/ml-100k/u3.test", sep='\t', names=title)
test_df.head()
predictions = []
targets = []
print('测试集大小为 %d' % len(test_df))
print('采用top K协同过滤算法进行预测...')
k = 20
print('选取的K值为%d.' % k)
for row in test_df.itertuples():
    user, item, actual = row[1]-1, row[2]-1, row[3]
    predictions.append(predict_topkCF(user, item, k))
    targets.append(actual)
print('测试结果的rmse为 %.4f' % rmse(np.array(predictions), np.array(targets)))

------ Top-k协同过滤(item-based + baseline)------
载入测试集...
测试集大小为 20000
采用top K协同过滤算法进行预测...
选取的K值为20.
测试结果的rmse为 0.7799

 

  • 1
    点赞
  • 6
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值