实验背景:
中国矿业大学(北京)Python程序设计课程
所用教科书为《Python语言程序设计》(机械工业出版社)
实验详情在第9章 综合实例 9.3 电影推荐模型 p225-234
实验环境:大数据教学平台 (或jupyter notebook)
ps:本人个人学习总结和与朋友交流学习记得流水账,请勿直接抄袭和转载,怕误人子弟。
有任何疑问请来太华的主页,邮箱私信我,工作时间早上8点到晚上。
一、首先介绍下算法(可以直接翻书):
基于用户相似度的推荐算法UserCF:
①评分矩阵:该程序所用的数据是943名用户对1682部电影的评分数据,共100,000条评分数据,每名用户都对至少20部电影作了评价。评分矩阵是一个矩阵Mm*n,其中m=943对应用户数,n=1682对应电影数,Mij是用户i对电影j的评分。如果用户i没有观看电影j,则Mij为0。评分矩阵M的第i行称为用户i的行为向量。
PS: 基于物品相似度的推荐算法ItemCF与UserCF类似,只是评分矩阵Mn*m中,n=1682对应电影数,m=943对应用户数。其他计算方法都相同。
②余弦相似度:
![](https://i-blog.csdnimg.cn/blog_migrate/c4823d0b411a1ac3b977120b3643a6b5.png)
③余弦相关矩阵:余弦相关矩阵Cm*m(m=943)中,元素Cij为用户i与用户j的余弦相似度。由计算余弦相似度的方法可知,对于评分矩阵M的每一行分别作L2范数单位化后得到矩阵H,基于H可直接计算C:
C=HHT
④top-k推荐:
![](https://i-blog.csdnimg.cn/blog_migrate/0dc70066341acaaf465f4f0c2a3b7911.png)
⑤评价指标:
![](https://i-blog.csdnimg.cn/blog_migrate/6fbbf27a19792b71b262d8559b08446c.png)
二、基于大数据平台上的实验:
#先吐槽一下:矿大的大数据平台在性能和运行方面的确有待提高。这三次实验耗时共计9小时,这个实验占了7个小时。这三个实验很简单,甚至课本上已经给出了源代码,但是这个大数据平台的虚拟机实在是令人大呼精彩管理(bushi),所以只要认真的去做,就一定会很好的完成的。
步骤:
先简单说下案例然后再说出现的各种问题
①:首先需要布置下Python3的环境:
参考实验 图书管理系统 内容,按书上的步骤来。
具体操作如视频:大数据Python实验—如何配置Python3环境
②:然后第一步是检察一下实验环境是否能打开jupyter notebook
![](https://i-blog.csdnimg.cn/blog_migrate/fc49ae644800649eb7cf2429322ac2af.png)
如果不能打开可以按以下步骤解决:
-
打开Linux操作界面:输入指令jupyter notebook --allow-root。
-
再打开一个Linux界面:输入指令jupyter lab --ip 0.0.0.0 --port 8888 --allow-root
-
还可能需要
![](https://i-blog.csdnimg.cn/blog_migrate/8cb71073316dfb2d951a1be76c780622.jpeg)
成功打开后如图:
![](https://i-blog.csdnimg.cn/blog_migrate/3a817afd06d3bce2dd279b2ec1a0dcc7.png)
就可以打开notebook笔记本了:
![](https://i-blog.csdnimg.cn/blog_migrate/5fe09f72aaeb1854045ad3fb37380b8e.png)
然后上传文件
ml-100k:下载地址http://files.grouplens.org/datasets/movielens/ml-100k.zip
movie_recom.zip:部署item_matrices、itemcf_test、user_matrices和usercf_test四个空文件夹,在压缩文件movie_recom.zip中。
上传成功后再打开一个Linux操作窗口输入指令:
cd /root/tools/
unzip movie_recom.zip -d /root/tools/
unzip ml-100k.zip -d /root/tools/
如图所示:
![](https://i-blog.csdnimg.cn/blog_migrate/3b1c732a4cadc6d73a9a8d63ceadd9c9.png)
切记ml-100k与main.py应该在同一个文件夹下面
源代码如下:
可在大数据平台无修改直接用。
注释部分请翻课本。
from docopt import docopt
import argparse
import numpy as np
import pandas as pd
import os
USER_NUMBER = 943
ITEM_NUMBER = 1682
def get_vector_dict(filename, mode):
with open('./ml-100k/' + filename, 'r') as f:
if mode == 'u':
user_vector_dict = {}
for line in f:
split_line = line.split('\t')
userid,itemid,rate = list(map(int, split_line[:3]))
if not userid in user_vector_dict:
user_vector_dict[userid] = {itemid: rate}
else:
user_vector_dict[userid][itemid] = rate
return user_vector_dict
else:
item_vector_dict = {}
for line in f:
split_line = line.split('\t')
userid,itemid,rate = list(map(int, split_line[:3]))
if not itemid in item_vector_dict:
item_vector_dict[itemid] = {userid: rate}
else:
item_vector_dict[itemid][userid] = rate
return item_vector_dict
def get_ui_matrix(filename, mode):
print('Getting rating matrix...')
ui_matrix = np.zeros((USER_NUMBER, ITEM_NUMBER))
vector_dict = get_vector_dict(filename,'u')
for userid in vector_dict:
ui_vector = vector_dict[userid]
for itemid in ui_vector:
ui_matrix[userid - 1][itemid - 1] = ui_vector[itemid]
if mode == 'u':
return ui_matrix
else:
return ui_matrix.transpose()
#获取用户或者电影(根据传入的 ui_matrix )之间的余弦相似性矩阵
#获取评分矩阵 M 中每一行的L2范数,使 M 的每一行变为单位向量, M(M^T) 即为余弦相似性矩阵
def get_cos_correlation(ui_matrix):
print('Getting cos correlation matrix...')
row_norm2 = list(map(np.linalg.norm, ui_matrix))
for i in range(len(row_norm2)):
if not row_norm2[i] == 0:
ui_matrix[i] /= row_norm2[i]
cos_correlation = np.dot(ui_matrix, ui_matrix.transpose())
return cos_correlation
def train(model):
#训练 usercf 模型 (获取用户间的余弦相似度矩阵)
#分别读取 5 个划分好的训练集, 分别进行训练,并写到文件中
if model == 'usercf' or model =='u':
for i in range(5):
print('Training set %s' % (i+1) + 'of model usercf...')
ui_matrix = get_ui_matrix('u' + str(i + 1) + '.base', mode='u')
user_cos_correlation = get_cos_correlation(ui_matrix)
with open('./user_matrices/user_cor_matrix'+str(i+1),'w') as f:
for row in user_cos_correlation:
row_ = list(map(round, row, [8] * len(row)))
f.write('\t'.join(map(str,row_)) + '\n')
#训练 itemcf 模型(获取电影间的余弦相似度矩阵)
#分别读取5个划分好的训练集,分别进行训练,并写到文件中
elif model == 'itemcf' or model == 'i':
for i in range(5):
print('Training set %s' % (i + 1) + ' of model itemcf...')
ui_matrix = get_ui_matrix('u' + str(i + 1) + '.base', mode='i')
item_cos_correlation = get_cos_correlation(ui_matrix)
with open('./item_matrices/item_cor_matrix'+str(i+1),'w') as f:
for row in item_cos_correlation:
row_ = list(map(round, row, [8]*len(row)))
f.write('\t'.join(map(str,row_)) + '\n')
else:
print("Argument -m <model> error! Training interupt.")
exit()
#读取由训练获得的用户(或电影)的余弦相似性矩阵的文件
def load_cos_correlation(filename, mode):
print('Getting cos correlation matrix...')
cos_correlation = []
if mode == 'u':
f = open('./user_matrices/' + filename, 'r')
else:
f = open('./item_matrices/' + filename, 'r')
for line in f:
cos_correlation.append(list(map(float, line.split('\t'))))
f.close()
return np.array(cos_correlation)
#选出与用户 id (或电影 id_)最相似的(即余弦相似度最大的,且非 0 ) k 位用户(或电影)
#是用户还是电影由传入的 cos_correlation 决定
def choose_k_ids(cos_correlation, id_, k):
id_cor = cos_correlation[id_ - 1]
#此处转为 pd.Series 的形式是为了方便取出最大的 k 个用户(电影)的 index(id) 与对应余弦相似性的值
id_cor_series = pd.Series(id_cor)
k_ids_series = id_cor_series.sort_values(ascending=False)[:k]
k_ids_series = k_ids_series[k_ids_series > 0]
k_ids_series.index += 1
k_ids_dict = dict(k_ids_series)
#除去用户(电影)自身
if k_ids_dict:
k_ids_dict.pop(id_)
return k_ids_dict
#选出最值得推荐的前 n 个用户或电影, train_vector_dict 由函数 get_vector_dict获取
#取出训练集中用户的评分作为权重令余弦相似性值作加权和,将和值作为推荐系数(即根据该值大小进行推荐)
def get_topn_recom_list(n,k_ids_dict,train_vector_dict):
recom_dict={}
for id1 in k_ids_dict:
if id1 in train_vector_dict:
id1_vector_dict=train_vector_dict[id1]
for id2 in id1_vector_dict:
if not id2 in recom_dict:
recom_dict[id2]=k_ids_dict[id1]*id1_vector_dict[id2]
else:
recom_dict[id2]+=k_ids_dict[id1]*id1_vector_dict[id2]
recom_list=[[itemid,recom_dict[itemid]] for itemid in recom_dict]
recom_list=sorted(recom_list,key=lambda x:x[1],reverse=True)
recom_list=recom_list[:n]
recom_list=[i[0] for i in recom_list]
return recom_list
def get_prf(i,mode):
k=100
if mode =='u':
print('Testing Usercf.Getting train/test-set%s'%i+"'s evaluating scores...Please wait a minute.")
n=20
matrix_filename='user_cor_matrix'+str(i)
else:
print('Testing Itemcf.Getting train/test-set%s'%i+"'s evaluating scores...Please wait a minute.")
n=10
matrix_filename='item_cor_matrix'+str(i)
train_filename='u'+str(i)+'.base'
test_filename='u'+str(i)+'.test'
train_vector_dict=get_vector_dict(train_filename,mode)
test_vector_dict=get_vector_dict(test_filename,mode)
cos_correlation=load_cos_correlation(matrix_filename,mode)
can_recom_ids=sorted(list(set(train_vector_dict)&set(test_vector_dict)))
pres=[]
recs=[]
fs=[]
for id_ in can_recom_ids:
k_ids_dict=choose_k_ids(cos_correlation,id_,k)
recom_list=get_topn_recom_list(n,k_ids_dict,train_vector_dict)
have_dict=test_vector_dict[id_]
int_num=0
recom_num=0
have_num=0
for id_ in recom_list:
if id_ in have_dict:
int_num+=have_dict[id_]
recom_num+=have_dict[id_]
else:
recom_num+=1
for id_ in have_dict:
have_num+=have_dict[id_]
pre=int_num/recom_num
rec=int_num/have_num
pres.append(pre)
recs.append(rec)
if pre==0 and rec==0:
fs.append(0)
else:
fs.append(2*pre*rec/(pre+rec))
return pres,recs,fs
def write_prf(i, mode):
prf = get_prf(i, mode)
max_prf_ = list(map(max,prf))
average_prf_ = list(map(np.mean, prf))
max_prf = list(map(str, max_prf_))
average_prf = list(map(str,average_prf_))
if mode == 'u':
f = open('./usercf_test/user_test' + str(i) + '_conclusion', 'w')
else:
f = open('./itemcf_test/item_test' + str(i) + '_conclusion', 'w')
f.write('precision_max: ' + max_prf[0] + '\n')
f.write('precision_average: ' + average_prf[0] + '\n')
f.write('recall_max: ' + max_prf[1] + '\n')
f.write('recall_average: ' + average_prf[1] + '\n')
f.write('f1_score_max: ' + max_prf[2] + '\n')
f.write('f1_score_mean: ' + average_prf[2])
f.close()
return max_prf_, average_prf_
#交叉验证
def cross_valid(prfs,mode):
print('Calculating cross validation...')
max_pre = []
mean_pre = []
max_rec = []
mean_rec = []
max_f = []
mean_f = []
for i in range(5):
max_pre.append(prfs[i][0][0])
max_rec.append(prfs[i][0][1])
max_f.append(prfs[i][0][2])
mean_pre.append(prfs[i][1][0])
mean_rec.append(prfs[i][1][1])
mean_f.append(prfs[i][1][2])
if mode == 'u':
f = open('./usercf_test/user_cross_valid', 'w')
else:
f = open('./itemcf_test/item_cross_valid', 'w')
f.write('precision_max: ' + str(np.mean(max_pre)) + '\n')
f.write('precision_average: ' + str(np.mean(mean_pre)) + '\n')
f.write('recall_max: ' + str(np.mean(max_rec)) + '\n')
f.write('recall_average: ' + str(np.mean(mean_rec)) + '\n')
f.write('f1_score_max: ' + str(np.mean(max_f)) + '\n')
f.write('f1_score_mean: ' + str(np.mean(mean_f)))
f.close()
def test(model):
#测试 usercf 模型
if model == 'usercf' or model == 'u':
prfs = []
for i in range(1,6):
prfs.append(write_prf(i, 'u'))
cross_valid(prfs, 'u')
#测试 itemcf 模型
elif model == 'itemcf' or model == 'i':
prfs =[]
for i in range(1,6):
prfs.append(write_prf(i, 'i'))
cross_valid(prfs, 'i')
else:
print("Argument -m <model> error! Training interrupt.")
def recommend(id_, model):
print(model)
if model == 'user':
try:
id_ = int(id_)
except:
print('Please input valid user id!')
exit()
if not 1 <= id_ <= USER_NUMBER:
print('User with id_ %s'% id_ +' does not exist!')
exit()
if not os.path.exists('./user_matrices/all_user_cor_matrix'):
print("Getting all users' cos correlation matrix...please wait a minute.")
ui_matrix = get_ui_matrix('u.data',mode='u')
cos_correlation=get_cos_correlation(ui_matrix)
with open('./user_matrices/all_user_cor_matrix','w') as f:
for row in cos_correlation:
row_=list(map(round, row ,[8] * len(row)))
f.write('\t'.join(map(str,row_))+'\n')
n=20
k=100
user_cor_matrix=load_cos_correlation('all_user_cor_matrix', mode='u')
train_vector_dict=get_vector_dict('u.data', mode='u')
k_ids=choose_k_ids(user_cor_matrix, id_, k)
recom_list=get_topn_recom_list(n,k_ids,train_vector_dict)
print('Recommend movies to the user with id %s' % id_+':')
print(recom_list)
if model == 'item':
try:
id_=int(id_)
except:
print('please input valid movie id!')
exit()
if not 1 <= id_ <= ITEM_NUMBER:
print('user with id %s'% id_+ 'does not exist!')
exit()
if not os.path.exists('./item matrices/all_item_cor_matrix'):
print("Getting all items' cos correlation matrix...Please wait a minute.")
ui_matrix = get_ui_matrix('u.data',mode='i')
cos_correlation=get_cos_correlation(ui_matrix)
with open('./item_matrices/all_item_cor_matrix','w') as f:
for row in cos_correlation:
row_= list(map(round, row, [8]*len(row)))
f.write('\t'.join(map(str, row_)) + '\n')
n =10
k=100
user_cor_matrix = load_cos_correlation('all_item_cor_matrix', mode='i')
train_vector_dict = get_vector_dict('u.data', mode='i')
k_ids = choose_k_ids(user_cor_matrix, id_, k)
recom_list = get_topn_recom_list(n, k_ids, train_vector_dict)
print('Recommend users to the movie with id %s' % id_+ ':')
print(recom_list)
def main():
parser = argparse.ArgumentParser('传入参数:main.py')
parser.add_argument('-t','--train', action='store_true', default=False)
parser.add_argument('-p','--test', action='store_true', default=False)
parser.add_argument('-m','--model', default=False)
parser.add_argument('-u','--usercf', default=False)
parser.add_argument('-i','--itemcf', default=False)
args = parser.parse_args()
print(args)
print(type(args))
if args.train:
if args.model:
train(args.model)
else:
train('usercf')
train('itemcf')
if args.test:
if args.model:
test(args.model)
else:
test('usercf')
test('itemcf')
if args.usercf:
recommend(args.usercf, 'user')
if args.itemcf:
recommend(args.itemcf,'item')
if __name__=='__main__':
main()
所有文件布置完成后如下图:
![](https://i-blog.csdnimg.cn/blog_migrate/9032f1e1484ac2d876d197ab1c0635c6.png)
现在就可以正式进行实验了
1首先输入:
pip install docopt==0.6.2
下载好docopt模块
ps:如果下载失败
请打开Linux操作窗口更新pip:
wget "https://pypi.python.org/packages/source/p/pip/pip-1.5.4.tar.gz#md5=834b2904f92d46aaa333267fb1c922bb" --no-check-certificate
tar -xf pip-1.5.4.tar.gz
cd pip-1.5.4
python setup.py install
![](https://i-blog.csdnimg.cn/blog_migrate/b771f24778db5cfb4f140468fb991c4f.png)
2然后输入:
%run main.py -h
查看docopt是否下载完成
3然后开始训练模型:
%run main.py -t
![](https://i-blog.csdnimg.cn/blog_migrate/4bfb767b6609dabcb826d8fb454832ca.png)
4然后开始测试模型:
%run main.py -p
%run main.py -u 3
%run main.py -i 3
![](https://i-blog.csdnimg.cn/blog_migrate/773ae7c898bdd567b4e6c4d2e3454e71.png)
实验到此结束。
![](https://i-blog.csdnimg.cn/blog_migrate/294c5cebeb7799b6e9c7f430275897ce.png)
![](https://i-blog.csdnimg.cn/blog_migrate/d5d126a3f587623815277ff6b894dca6.png)
(原本打算做个完整的视频的,但是太懒,所以将就着看,有问题发我邮箱太华个人主页或者帖子下方留言和CSDN私信)
三、直接电脑jupyter进行实验
实际操作步骤和上面一模一样。
但是代码部分需要细致一些。
![](https://i-blog.csdnimg.cn/blog_migrate/c7c0d83f810fbbb8ea8810a9875bcfbd.png)
解压部分可以直接放在一个文件夹里面就行
然后直接改代码:
例如:
#!/usr/bin/env python
# coding: utf-8
from docopt import docopt
import argparse
import numpy as np
import pandas as pd
import os
USER_NUMBER = 943
ITEM_NUMBER = 1682
# 读取数据文件,返回一个字典。
# 若 mode 是'u',返回{user1:{item1:ratel,item2:rate2,...},user2:{...}...}的形式
# 若 mode 不是'u',则返回类似的字典{item1:{user1:ratel,...},item2:{...}...}
# 与下面的矩阵 M 不一样,为了节省空间,该函数仅保留了非零的值,而且能很方便地取出每一行非零的值
def get_vector_dict(filename, mode):
with open('C:/Users/安全工程4班王鑫/Desktop/ACM练习题/王鑫--电影推荐模型./ml-100k/', 'r') as f:
if mode == 'u':
user_vector_dict ={}
for line in f:
split_line = line.split('\t')
userid, itemid, rate = list(map(int, split_line[:3]))
if not userid in user_vector_dict:
user_vector_dict[userid] = {itemid: rate}
else:
user_vector_dict[userid][itemid] = rate
return user_vector_dict
else:
item_vector_dict = {}
for line in f:
split_line = line.split('\t')
userid, itemid, rate = list(map(int, split_line[:3]))
if not itemid in item_vector_dict:
item_vector_dict[itemid] = {userid: rate}
else:
item_vector_dict[itemid][userid] = rate
return item_vector_dict
#返回评分矩阵m,若mode是'u',则矩阵M的i行i列表示 userid为i的用户# 对itemid为j的电影的评分,若评分不存在则为0# 若 mode 不是 'u',则返回上述矩阵的转置
def get_ui_matrix(filename, mode):
print('Getting rating matrix...')
ui_matrix = np.zeros((USER_NUMBER, ITEM_NUMBER))
vector_dict = get_vector_dict(filename, 'u')
for userid in vector_dict:
ui_vector = vector_dict[userid]
for itemid in ui_vector:
ui_matrix[userid - 1][itemid - 1] = ui_vector[itemid]
if mode == 'u':
return ui_matrix
else:
return ui_matrix.transpose()
def get_cos_correlation(ui_matrix):
print('Getting cos correlation matrix...')
row_norm2 = list(map(np.linalg.norm, ui_matrix))
for i in range(len(row_norm2)):
if not row_norm2[i] == 0:
ui_matrix[i] /= row_norm2[i]
cos_correlation = np.dot(ui_matrix, ui_matrix.transpose())
return cos_correlation
def train(model):
# 训练 usercf 模型(获取用户间的余弦相似度矩阵)
# 分别读取 5 个划分好的训练集,分别进行训练,并写到文件中
if model == 'usercf' or model == 'u':
for i in range(5):
print('Training set%s' % (i+1) + ' of model usercf...')
ui_matrix = get_ui_matrix('u'+ str(i + 1) + '.base', mode='u')
user_cos_correlation = get_cos_correlation(ui_matrix)
with open('C:/Users/安全工程4班王鑫/Desktop/ACM练习题/王鑫--电影推荐模型/movie_recom/user_matrices/user_cor_matrix'+str(i+1), 'w') as f:
for row in user_cos_correlation:
row = list(map(round, row, [8] * len(row)))
f.write('\t'.join(map(str,row )) + '\n')
# 训练 itemcf 模型(获取电影间的余弦相似度矩阵)
# 分别读取 5 个划分好的训练集,分别进行训练,并写到文件中
elif model == 'itemcf' or model == 'i':
for i in range(5):
print('Training set%s' % (i + 1) + ' of model itemcf...')
ui_matrix = get_ui_matrix('u'+ str(i + 1) + '.base', mode='i')
item_cos_correlation = get_cos_correlation(ui_matrix)
with open('C:/Users/安全工程4班王鑫/Desktop/ACM练习题/王鑫--电影推荐模型/movie_recom/item_matrices/item_cor_matrix'+str(i+1), 'w') as f:
for row in item_cos_correlation:
row_ = list(map(round, row, [8]*len(row)))
f.write('\t'.join(map(str, row_)) + '\n')
else:
print("Argument -m <model> error! Training interrupt.")
exit()
#读取由训练获得的用户(或电影)的余弦相似性矩阵的文件
def load_cos_correlation(filename, mode):
print('Getting cos correlation matrix...')
cos_correlation = []
if mode == 'u':
f = open('C:/Users/安全工程4班王鑫/Desktop/ACM练习题/王鑫--电影推荐模型/movie_recom/user_matrices/user_cor_matrix' , 'r')
else:
f = open('C:/Users/安全工程4班王鑫/Desktop/ACM练习题/王鑫--电影推荐模型/movie_recom/item_matrices/item_cor_matrix' , 'r')
for line in f:
cos_correlation.append(list(map(float, line.split('\t'))))
f.close()
return np.array(cos_correlation)
# 选出与用户id(或电影id)最相似的(即余弦相似度最大的,且非0)k位用户(或电影)#是用户还是电影由传入的 cos correlation 决定
def choose_k_ids(cos_correlation, id_, k):
id_cor = cos_correlation[id_ - 1]
id_cor_series = pd.Series(id_cor)
k_ids_series = id_cor_series.sort_values(ascending=False)[:k]
k_ids_series = k_ids_series[k_ids_series > 0]
k_ids_series.index += 1
k_ids_dict = dict(k_ids_series) # 除去用户(电影)自身
if k_ids_dict:
k_ids_dict.pop(id_)
return k_ids_dict
# 选出最值得推荐的前 n个用户或电影,train vector dict 由函数 get vector dict 获取
# 取出训练集中用户的评分作为权重令余弦相似性值作加权和,将和值作为推荐系数(即根据该值大小进行推荐)# 例如电影 1 被用户 1 和 用户 2 评分分别为 3 和 4
# 而用户 3 与用户 1 和用户 2 的余弦相似性分别为 0.2 和 0.5则电影1推荐给用户3的推荐系数为3*0.2+4*0.5=2.6# 把用户推荐给电影道理相同
def get_topn_recom_list(n, k_ids_dict, train_vector_dict):
recom_dict = {}
for id1 in k_ids_dict:
if id1 in train_vector_dict:
id1_vector_dict = train_vector_dict[id1]
for id2 in id1_vector_dict:
if not id2 in recom_dict:
recom_dict[id2] = k_ids_dict[id1] * id1_vector_dict[id2]
else:
recom_dict[id2] += k_ids_dict[id1] * id1_vector_dict[id2]
recom_list = [[itemid, recom_dict[itemid]] for itemid in recom_dict]
recom_list = sorted(recom_list, key=lambda x: x[1], reverse=True)
recom_list = recom_list[:n]
recom_list = [i[0] for i in recom_list]
return recom_list
def get_prf(i, mode):
k=100
if mode == 'u':
print('Testing Usercf.Getting train/test-set%s' % i + "'s evaluating scores..Please wait a minute.")
n = 20
matrix_filename = 'user_cor_matrix' + str(i)
else:
print('Testing Itemcf.Getting train/test set%s' % i +"'s evaluating scores...Please wait a minute.")
n = 10
matrix_filename = 'item_cor_matrix' + str(i)
train_filename = 'u' + str(i)+'.base'
test_filename = 'u' + str(i) + '.test'
train_vector_dict = get_vector_dict(train_filename, mode)
test_vector_dict = get_vector_dict(test_filename, mode)
cos_correlation = load_cos_correlation(matrix_filename, mode)
# 由于划分数据集的原因,只有同时在训练集和测试集的 id 才能够用于推荐,故取交集
can_recom_ids = sorted(list(set(train_vector_dict) & set(test_vector_dict)))
pres = []
recs = []
fs = []
# 对可以推荐的每个用户(电影)算出推荐列表后,根据推荐列表与测试列表的情况计算出precision与recall
for id_ in can_recom_ids:
k_ids_dict = choose_k_ids(cos_correlation, id_, k)
recom_list = get_topn_recom_list(n, k_ids_dict, train_vector_dict)
have_dict = test_vector_dict[id_]
int_num = 0
recom_num = 0
have_num = 0
for id_ in recom_list:
if id_ in have_dict:
int_num += have_dict[id_]
recom_num += have_dict[id_]
else:
recom_num += 1
for id_ in have_dict:
have_num += have_dict[id_]
pre = int_num / recom_num
rec = int_num / have_num
pres.append(pre)
recs.append(rec)
if pre == 0 and rec == 0:
fs.append(0)
else:
fs.append(2 * pre * rec / (pre + rec))
return pres, recs, fs
def write_prf(i, mode):
prf = get_prf(i, mode)
max_prf_ = list(map(max, prf))
average_prf_ = list(map(np.mean, prf))
max_prf = list(map(str, max_prf_))
average_prf = list(map(str, average_prf_))
if mode == 'u':
f = open('C:/Users/安全工程4班王鑫/Desktop/ACM练习题/王鑫--电影推荐模型/movie_recom/usercf_test/user_test'+ str(i) + 'conclusion', 'w')
else:
f = open('C:/Users/安全工程4班王鑫/Desktop/ACM练习题/王鑫--电影推荐模型/movie_recom/itemcf_test/item_test'+ str(i) + ' conclusion', 'w')
f.write('precision max: ' + max_prf[0] + '\n')
f.write('precision average:' + average_prf[0] + '\n')
f.write('recall max: ' + max_prf[1] + '\n')
f.write('recall average: ' + average_prf[1] + '\n')
f.write('fl score max: ' + max_prf[2] + '\n')
f.write('fl score mean:' + average_prf[2])
f.close()
return max_prf_, average_prf
# 由m种划分数据的方法分别计算出各指标后,求出平均值作为整个模型的评估,叫做交叉验证
def cross_valid(prfs, mode):
print('calculating cross validation...')
max_pre = []
mean_pre = []
max_rec = []
mean_rec =[]
max_f = []
mean_f = []
# 记录5种划分方法得到的各个指标,之后用 no.mean 求出下面每一个列表平均值记录到文件中
for i in range(5):
max_pre.append(prfs[i][0][0])
max_rec.append(prfs[i][0][1])
max_f.append(prfs[i][0][2])
mean_pre.append(prfs[i][1][0])
mean_rec.append(prfs[i][1][1])
mean_f.append(prfs[i][1][2])
if mode == 'u':
f = open('C:/Users/安全工程4班王鑫/Desktop/ACM练习题/王鑫--电影推荐模型/movie_recom/usercf_test/user_cross_valid', 'w')
else :
f = open('C:/Users/安全工程4班王鑫/Desktop/ACM练习题/王鑫--电影推荐模型/movie_recom/itemcf_test/item_cross_valid', 'w')
f.write('precision max:' + str(np.mean(max_pre)) + '\n')
f.write('precision averaqe:' + str(np.mean(mean_pre)) + '\n')
f.write('recall max: ' + str(np.mean(max_rec)) + 'n')
f.write('recall averaqe: ' + str(np.mean(mean_rec)) + 'in')
f.write('fl score max: ' + str(np.mean(max_f)) + '\n')
f.write('fl score mean:' + str(np.mean(mean_f)))
f.close()
def test(model):
if model == 'usercf' or model == 'u':
prfs = []
for i in range(1, 6):
prfs.append(write_prf(i, 'u'))
cross_valid(prfs, 'u')
# 测试 itemcf 模型
elif model == 'itemcf' or model == 'i':
prfs = []
for i in range(1, 6):
prfs.append(write_prf(i, 'i'))
cross_valid(prfs, 'i')
else:
print("Arqument -m <model> error! Training interrupt.")
# 根据用Pid来给出推荐的电影id列表,或根据电影Ya米给出推荐的用Pid列表,用的是get_topn_ recom_ist函数
def recommend(id_, model):
print(model)
if model == 'user':
try:
id_ = int(id_)
except:
print('please input valid user id!')
exit()
if not 1 <= id_ <= USER_NUMBER:
print('User with id %s'%id_ + 'does not exist!')
exit()
# 跟训练(目的是调整n和k等参数)不同的是,推荐用的是全部数据,故需要得到所有数据对应余弦相似矩阵
if not os.path.exists('C:/Users/安全工程4班王鑫/Desktop/ACM练习题/王鑫--电影推荐模型/movie_recom/user_matrices/all_user_cor_matrix'):
print("Getting all users' cos correlation matrix.. Please wait a minute.")
ui_matrix = get_ui_matrix('u. data', mode='u')
cos_correlation = get_cos_correlation(ui_matrix)
with open('C:/Users/安全工程4班王鑫/Desktop/ACM练习题/王鑫--电影推荐模型/movie_recom/user_matrices/all_user_cor_matrix', 'w') as f:
for row in cos_correlation:
row_ = list(map(round, row, [8] * len(row)))
f.write('\t'.join(map(str, row_)) + '\n')
n = 20
k = 100
user_cor_matrix = load_cos_correlation('all_user_cor_matrix', mode='u')
train_vector_dict = get_vector_dict('u. data', mode='u')
k_ids = choose_k_ids(user_cor_matrix, id_, k)
recom_list = get_topn_recom_list(n, k_ids, train_vector_dict)
print('Recommend movies to the user with id %s' % id_ + ':')
print(recom_list)
if model == 'item':
try:
id_ = int(id_)
except:
print('please input valid movie id!')
exit()
if not 1 <= id_ <= ITEM_NUMBER:
print('User with id %s' % id_ + 'does not exist!')
exit()
if not os.path.exists('C:/Users/安全工程4班王鑫/Desktop/ACM练习题/王鑫--电影推荐模型/movie_recom/item matrices/all_item_cor_matrix'):
print("Getting all items' cos correlation matrix.. Please wait a minute.")
ui_matrix = get_ui_matrix('u. data', mode='i')
cos_correlation = get_cos_correlation(ui_matrix)
with open('C:/Users/安全工程4班王鑫/Desktop/ACM练习题/王鑫--电影推荐模型/movie_recom/item_matrices/all_item_cor_matrix', 'w') as f:
for row in cos_correlation:
row_ = list(map(round, row, [8] * len(row)))
f.write('\t'.join(map(str, row_)) + '\n')
n = 10
k = 100
user_cor_matrix = load_cos_correlation('all_item_cor_matrix', mode='i')
train_vector_dict = get_vector_dict('u.data', mode='i')
k_ids = choose_k_ids(user_cor_matrix, id_, k)
recom_list = get_topn_recom_list(n, k_ids, train_vector_dict)
print('Recommend users to the movie with id %s' % id_ + ':')
print(recom_list)
def main():
parser = argparse.ArgumentParser('传入参数:main.py')
parser.add_argument('-t', '--train', action='store_true', default=False)
parser.add_argument('-p', '--test', action='store_true', default=False)
parser.add_argument('-m', '--model', default=False)
parser.add_argument('-u', '--usercf', default=False)
parser.add_argument('-i', '--itemcf', default=False)
args = parser.parse_args()
print(args)
print(type(args))
if args.train:
if args.model:
train(args.model)
else:
train('usercf')
train('itemcf')
if args.test:
if args.model:
test(args.model)
else:
test('usercf')
test('itemcf')
if args.usercf:
recommend(args.usercf, 'user')
if args.itemcf:
recommend(args.itemcf, 'item')
if __name__ == '__main__':
main()
![](https://i-blog.csdnimg.cn/blog_migrate/4cff917cb3e069e4092480386b3f64af.png)
最终历尽千辛万苦成功了。
如上面代码,每一步的文件位置(路径)必须一点也不可以出错,不然会出现各种问题。
这也算补上了注释。
我遇到过的问题如图:
1很明显文件位置错了
![](https://i-blog.csdnimg.cn/blog_migrate/ff1a1f08502c369df19e4ac098c71269.png)
2文件权限问题:
![](https://i-blog.csdnimg.cn/blog_migrate/403b8115cc518ddb9b7d28377fad5188.png)
3文件系统识别不到
![](https://i-blog.csdnimg.cn/blog_migrate/f64d3c0563b5c96ce67da914e78ada45.png)
4崩溃合集:大数据Python实验——崩溃合集
基本上只有这么多了,看个笑话就行了,真实际操作还得找老师。