电影推荐模型—Python

实验背景:

中国矿业大学(北京)Python程序设计课程

所用教科书为《Python语言程序设计》(机械工业出版社)

实验详情在第9章 综合实例 9.3 电影推荐模型 p225-234

实验环境:大数据教学平台 (或jupyter notebook)

ps:本人个人学习总结和与朋友交流学习记得流水账,请勿直接抄袭和转载,怕误人子弟。

有任何疑问请来太华的主页,邮箱私信我,工作时间早上8点到晚上。

一、首先介绍下算法(可以直接翻书):

基于用户相似度的推荐算法UserCF:

①评分矩阵:该程序所用的数据是943名用户对1682部电影的评分数据,共100,000条评分数据,每名用户都对至少20部电影作了评价。评分矩阵是一个矩阵Mm*n,其中m=943对应用户数,n=1682对应电影数,Mij是用户i对电影j的评分。如果用户i没有观看电影j,则Mij为0。评分矩阵M的第i行称为用户i的行为向量。

PS: 基于物品相似度的推荐算法ItemCF与UserCF类似,只是评分矩阵Mn*m中,n=1682对应电影数,m=943对应用户数。其他计算方法都相同。

②余弦相似度:

③余弦相关矩阵:余弦相关矩阵Cm*mm=943)中,元素Cij为用户i与用户j的余弦相似度。由计算余弦相似度的方法可知,对于评分矩阵M的每一行分别作L2范数单位化后得到矩阵H,基于H可直接计算C

C=HHT

④top-k推荐:

⑤评价指标:

二、基于大数据平台上的实验:

#先吐槽一下:矿大的大数据平台在性能和运行方面的确有待提高。这三次实验耗时共计9小时,这个实验占了7个小时。这三个实验很简单,甚至课本上已经给出了源代码,但是这个大数据平台的虚拟机实在是令人大呼精彩管理(bushi),所以只要认真的去做,就一定会很好的完成的。

步骤:

先简单说下案例然后再说出现的各种问题

①:首先需要布置下Python3的环境:

参考实验 图书管理系统 内容,按书上的步骤来。

具体操作如视频:大数据Python实验—如何配置Python3环境

②:然后第一步是检察一下实验环境是否能打开jupyter notebook

如果不能打开可以按以下步骤解决:

  1. 打开Linux操作界面:输入指令jupyter notebook --allow-root。

  1. 再打开一个Linux界面:输入指令jupyter lab --ip 0.0.0.0 --port 8888 --allow-root

  1. 还可能需要

成功打开后如图:

就可以打开notebook笔记本了:

然后上传文件

ml-100k:下载地址http://files.grouplens.org/datasets/movielens/ml-100k.zip

movie_recom.zip:部署item_matrices、itemcf_test、user_matrices和usercf_test四个空文件夹,在压缩文件movie_recom.zip中。

上传成功后再打开一个Linux操作窗口输入指令:


cd /root/tools/

unzip movie_recom.zip -d /root/tools/

unzip ml-100k.zip -d /root/tools/

如图所示:

切记ml-100k与main.py应该在同一个文件夹下面

源代码如下:

可在大数据平台无修改直接用。

注释部分请翻课本。


from docopt import docopt
import argparse
import numpy as np
import pandas as pd
import os
USER_NUMBER = 943
ITEM_NUMBER = 1682




def get_vector_dict(filename, mode):
    with open('./ml-100k/' + filename, 'r') as f:
        if mode == 'u':
            user_vector_dict = {}
            for line in f:
                split_line = line.split('\t')
                userid,itemid,rate = list(map(int, split_line[:3]))
                if not userid in user_vector_dict:
                    user_vector_dict[userid] = {itemid: rate}
                else:
                    user_vector_dict[userid][itemid] = rate
            return user_vector_dict 
        else:
            item_vector_dict = {}
            for line in f:
                split_line = line.split('\t')
                userid,itemid,rate = list(map(int, split_line[:3]))
                if not itemid in item_vector_dict:
                    item_vector_dict[itemid] = {userid: rate}
                else:
                    item_vector_dict[itemid][userid] = rate
            return item_vector_dict

def  get_ui_matrix(filename, mode):
    print('Getting rating matrix...')
    ui_matrix = np.zeros((USER_NUMBER, ITEM_NUMBER))
    vector_dict = get_vector_dict(filename,'u')
    for userid in vector_dict:
        ui_vector = vector_dict[userid]
        for itemid in ui_vector:
            ui_matrix[userid - 1][itemid - 1] = ui_vector[itemid]
    if mode == 'u':
         return ui_matrix
    else:
        return ui_matrix.transpose()

#获取用户或者电影(根据传入的 ui_matrix )之间的余弦相似性矩阵
#获取评分矩阵 M 中每一行的L2范数,使 M 的每一行变为单位向量, M(M^T) 即为余弦相似性矩阵
def get_cos_correlation(ui_matrix):
    print('Getting cos correlation matrix...')
    row_norm2 = list(map(np.linalg.norm, ui_matrix))
    for i in range(len(row_norm2)):
        if not row_norm2[i] == 0:
            ui_matrix[i] /= row_norm2[i]
    cos_correlation = np.dot(ui_matrix, ui_matrix.transpose())
    return cos_correlation

def train(model):
    #训练 usercf 模型 (获取用户间的余弦相似度矩阵)
    #分别读取 5 个划分好的训练集, 分别进行训练,并写到文件中
    if model == 'usercf' or model =='u':
        for i in range(5):
            print('Training set %s' % (i+1) + 'of model usercf...')
            ui_matrix = get_ui_matrix('u' + str(i + 1) + '.base', mode='u')
            user_cos_correlation = get_cos_correlation(ui_matrix)
            with open('./user_matrices/user_cor_matrix'+str(i+1),'w') as f:
                for row in user_cos_correlation:
                    row_ = list(map(round, row, [8] * len(row)))
                    f.write('\t'.join(map(str,row_)) + '\n')

    #训练 itemcf 模型(获取电影间的余弦相似度矩阵)
    #分别读取5个划分好的训练集,分别进行训练,并写到文件中
    elif model == 'itemcf' or model == 'i':
        for i in range(5):
            print('Training set %s' % (i + 1) + ' of model itemcf...')
            ui_matrix = get_ui_matrix('u' + str(i + 1) + '.base', mode='i')
            item_cos_correlation = get_cos_correlation(ui_matrix)
            with open('./item_matrices/item_cor_matrix'+str(i+1),'w') as f:
                for row in item_cos_correlation:
                    row_ = list(map(round, row, [8]*len(row)))
                    f.write('\t'.join(map(str,row_)) + '\n')
    else:
        print("Argument -m <model> error! Training interupt.")
        exit()

#读取由训练获得的用户(或电影)的余弦相似性矩阵的文件
def load_cos_correlation(filename, mode):
    print('Getting cos correlation matrix...')
    cos_correlation = []
    if mode == 'u':
        f = open('./user_matrices/' + filename, 'r')
    else:
        f = open('./item_matrices/' + filename, 'r')
    for line in f:
        cos_correlation.append(list(map(float, line.split('\t'))))
    f.close()
    return np.array(cos_correlation)
#选出与用户 id (或电影 id_)最相似的(即余弦相似度最大的,且非 0 ) k 位用户(或电影)
#是用户还是电影由传入的 cos_correlation 决定

def choose_k_ids(cos_correlation, id_, k):
    id_cor = cos_correlation[id_ - 1]
    #此处转为 pd.Series 的形式是为了方便取出最大的 k 个用户(电影)的 index(id) 与对应余弦相似性的值
    id_cor_series = pd.Series(id_cor)
    k_ids_series = id_cor_series.sort_values(ascending=False)[:k]
    k_ids_series = k_ids_series[k_ids_series > 0]
    k_ids_series.index += 1
    k_ids_dict = dict(k_ids_series)
    #除去用户(电影)自身
    if k_ids_dict:
        k_ids_dict.pop(id_)
    return k_ids_dict
#选出最值得推荐的前 n 个用户或电影, train_vector_dict 由函数 get_vector_dict获取
#取出训练集中用户的评分作为权重令余弦相似性值作加权和,将和值作为推荐系数(即根据该值大小进行推荐)

def get_topn_recom_list(n,k_ids_dict,train_vector_dict):
    recom_dict={}
    for id1 in k_ids_dict:
        if id1 in train_vector_dict:
            id1_vector_dict=train_vector_dict[id1]
            for id2 in id1_vector_dict:
                if not id2 in recom_dict:
                    recom_dict[id2]=k_ids_dict[id1]*id1_vector_dict[id2]
                else:
                    recom_dict[id2]+=k_ids_dict[id1]*id1_vector_dict[id2]
    recom_list=[[itemid,recom_dict[itemid]] for itemid in recom_dict]
    recom_list=sorted(recom_list,key=lambda x:x[1],reverse=True)
    recom_list=recom_list[:n]
    recom_list=[i[0] for i in recom_list]
    return recom_list


def get_prf(i,mode):
    k=100
    if mode =='u':
        print('Testing Usercf.Getting train/test-set%s'%i+"'s evaluating scores...Please wait a minute.")
        n=20
        matrix_filename='user_cor_matrix'+str(i)
    else:
        print('Testing Itemcf.Getting train/test-set%s'%i+"'s evaluating scores...Please wait a minute.")
        n=10
        matrix_filename='item_cor_matrix'+str(i)
    train_filename='u'+str(i)+'.base'
    test_filename='u'+str(i)+'.test'
    train_vector_dict=get_vector_dict(train_filename,mode)
    test_vector_dict=get_vector_dict(test_filename,mode)
    cos_correlation=load_cos_correlation(matrix_filename,mode)
    
    can_recom_ids=sorted(list(set(train_vector_dict)&set(test_vector_dict)))
    pres=[]
    recs=[]
    fs=[]
    
    for id_ in can_recom_ids:
        k_ids_dict=choose_k_ids(cos_correlation,id_,k)
        recom_list=get_topn_recom_list(n,k_ids_dict,train_vector_dict)
        have_dict=test_vector_dict[id_]
        int_num=0
        recom_num=0
        have_num=0
        for id_ in recom_list:
            if id_ in have_dict:
                int_num+=have_dict[id_]
                recom_num+=have_dict[id_]
            else:
                recom_num+=1
        for id_ in have_dict:
            have_num+=have_dict[id_]
        pre=int_num/recom_num
        rec=int_num/have_num
        pres.append(pre)
        recs.append(rec)
        if pre==0 and rec==0:
            fs.append(0)
        else:
            fs.append(2*pre*rec/(pre+rec))
    return pres,recs,fs


def write_prf(i, mode):
    prf = get_prf(i, mode)
    max_prf_ = list(map(max,prf))
    average_prf_ = list(map(np.mean, prf))
    max_prf = list(map(str, max_prf_))
    average_prf = list(map(str,average_prf_))
    if mode == 'u':
        f = open('./usercf_test/user_test' + str(i) + '_conclusion', 'w')
    else:
        f = open('./itemcf_test/item_test' + str(i) + '_conclusion', 'w')
    f.write('precision_max: ' + max_prf[0] + '\n')
    f.write('precision_average: ' + average_prf[0] + '\n')
    f.write('recall_max: ' + max_prf[1] + '\n')
    f.write('recall_average: ' + average_prf[1] + '\n')
    f.write('f1_score_max: ' + max_prf[2] + '\n')
    f.write('f1_score_mean: ' + average_prf[2])
    f.close()
    return max_prf_, average_prf_

#交叉验证
def cross_valid(prfs,mode):
    print('Calculating cross validation...')
    max_pre = []
    mean_pre = []
    max_rec = []
    mean_rec = []
    max_f = []
    mean_f = []
    
    for i in range(5):
        max_pre.append(prfs[i][0][0])
        max_rec.append(prfs[i][0][1])
        max_f.append(prfs[i][0][2])
        mean_pre.append(prfs[i][1][0])
        mean_rec.append(prfs[i][1][1])
        mean_f.append(prfs[i][1][2])
    if mode == 'u':
        f = open('./usercf_test/user_cross_valid', 'w')
    else:
        f = open('./itemcf_test/item_cross_valid', 'w')
    f.write('precision_max: ' + str(np.mean(max_pre)) + '\n')
    f.write('precision_average: ' + str(np.mean(mean_pre)) + '\n')
    f.write('recall_max: ' + str(np.mean(max_rec)) + '\n')
    f.write('recall_average: ' + str(np.mean(mean_rec)) + '\n')
    f.write('f1_score_max: ' + str(np.mean(max_f)) + '\n')
    f.write('f1_score_mean: ' + str(np.mean(mean_f)))
    f.close()

def test(model):
    #测试 usercf 模型
    if model == 'usercf' or model == 'u':
        prfs = []
        for i in range(1,6):
            prfs.append(write_prf(i, 'u'))
        cross_valid(prfs, 'u')

    #测试 itemcf 模型
    elif model == 'itemcf' or model == 'i':
        prfs =[]
        for i in range(1,6):
            prfs.append(write_prf(i, 'i'))
        cross_valid(prfs, 'i')
    else:
        print("Argument -m <model> error! Training interrupt.")

def recommend(id_, model):
    print(model)
    if model == 'user':
        try:
            id_ = int(id_)
        except:
            print('Please input valid user id!')
            exit()
        if not 1 <= id_ <= USER_NUMBER:
            print('User with id_ %s'% id_ +' does not exist!')
            exit()

        if not os.path.exists('./user_matrices/all_user_cor_matrix'):
            print("Getting all users' cos correlation matrix...please wait a minute.")
            ui_matrix = get_ui_matrix('u.data',mode='u')
            cos_correlation=get_cos_correlation(ui_matrix)
            with open('./user_matrices/all_user_cor_matrix','w') as f:
                for row in cos_correlation:
                    row_=list(map(round, row ,[8] * len(row)))
                    f.write('\t'.join(map(str,row_))+'\n')
        n=20
        k=100
        user_cor_matrix=load_cos_correlation('all_user_cor_matrix', mode='u')
        train_vector_dict=get_vector_dict('u.data', mode='u')
        k_ids=choose_k_ids(user_cor_matrix, id_, k)
        recom_list=get_topn_recom_list(n,k_ids,train_vector_dict)
        print('Recommend movies to the user with id %s' % id_+':')
        print(recom_list)

    if model == 'item':
        try:
            id_=int(id_)
        except:
            print('please input valid movie id!')
            exit()
        if not 1 <= id_ <= ITEM_NUMBER:
            print('user with id %s'% id_+ 'does not exist!')
            exit()
        if not os.path.exists('./item matrices/all_item_cor_matrix'):
            print("Getting all items' cos correlation matrix...Please wait a minute.")
            ui_matrix = get_ui_matrix('u.data',mode='i')
            cos_correlation=get_cos_correlation(ui_matrix)
            with open('./item_matrices/all_item_cor_matrix','w') as f:
                for row in cos_correlation:
                    row_= list(map(round, row, [8]*len(row)))
                    f.write('\t'.join(map(str, row_)) + '\n')
        n =10
        k=100
        user_cor_matrix = load_cos_correlation('all_item_cor_matrix', mode='i')
        train_vector_dict = get_vector_dict('u.data', mode='i')
        k_ids = choose_k_ids(user_cor_matrix, id_, k)
        recom_list = get_topn_recom_list(n, k_ids, train_vector_dict)
        print('Recommend users to the movie with id %s' % id_+ ':')
        print(recom_list)

def main():
    parser = argparse.ArgumentParser('传入参数:main.py')
    parser.add_argument('-t','--train', action='store_true', default=False)
    parser.add_argument('-p','--test', action='store_true', default=False)
    parser.add_argument('-m','--model', default=False)
    parser.add_argument('-u','--usercf', default=False)
    parser.add_argument('-i','--itemcf', default=False)
    args = parser.parse_args()
    print(args)
    print(type(args))
    if args.train:
        if args.model:
            train(args.model)
        else:
            train('usercf')
            train('itemcf')
    if args.test:
        if args.model:
            test(args.model)
        else:
            test('usercf')
            test('itemcf')
    if args.usercf:
        recommend(args.usercf, 'user')
    if args.itemcf:
        recommend(args.itemcf,'item')

if __name__=='__main__':
    main()

所有文件布置完成后如下图:

现在就可以正式进行实验了

1首先输入:


pip install docopt==0.6.2

下载好docopt模块

ps:如果下载失败

请打开Linux操作窗口更新pip:


wget "https://pypi.python.org/packages/source/p/pip/pip-1.5.4.tar.gz#md5=834b2904f92d46aaa333267fb1c922bb" --no-check-certificate
tar -xf pip-1.5.4.tar.gz 
cd pip-1.5.4 
python setup.py install

2然后输入:


%run main.py -h

查看docopt是否下载完成

3然后开始训练模型:


%run main.py -t

4然后开始测试模型:


%run main.py -p


%run main.py -u 3

%run main.py -i 3

实验到此结束。

(原本打算做个完整的视频的,但是太懒,所以将就着看,有问题发我邮箱太华个人主页或者帖子下方留言和CSDN私信)

三、直接电脑jupyter进行实验

实际操作步骤和上面一模一样。

但是代码部分需要细致一些。

解压部分可以直接放在一个文件夹里面就行

然后直接改代码:

例如:


#!/usr/bin/env python
# coding: utf-8


from docopt import docopt
import argparse
import numpy as np
import pandas as pd
import os
USER_NUMBER = 943
ITEM_NUMBER = 1682

# 读取数据文件,返回一个字典。
# 若 mode 是'u',返回{user1:{item1:ratel,item2:rate2,...},user2:{...}...}的形式
# 若 mode 不是'u',则返回类似的字典{item1:{user1:ratel,...},item2:{...}...}
# 与下面的矩阵 M 不一样,为了节省空间,该函数仅保留了非零的值,而且能很方便地取出每一行非零的值
def get_vector_dict(filename, mode):
    with open('C:/Users/安全工程4班王鑫/Desktop/ACM练习题/王鑫--电影推荐模型./ml-100k/', 'r') as f:
        if mode == 'u':
            user_vector_dict ={}
            for line in f:
                split_line = line.split('\t')
                userid, itemid, rate = list(map(int, split_line[:3]))
                if not userid in user_vector_dict:
                    user_vector_dict[userid] = {itemid: rate}
                else:
                    user_vector_dict[userid][itemid] = rate
            return user_vector_dict
        else:
            item_vector_dict = {}
            for line in f:
                split_line = line.split('\t')
                userid, itemid, rate = list(map(int, split_line[:3]))
                if not itemid in item_vector_dict:
                    item_vector_dict[itemid] = {userid: rate}
                else:
                    item_vector_dict[itemid][userid] = rate
            return item_vector_dict
#返回评分矩阵m,若mode是'u',则矩阵M的i行i列表示 userid为i的用户# 对itemid为j的电影的评分,若评分不存在则为0# 若 mode 不是 'u',则返回上述矩阵的转置 
def get_ui_matrix(filename, mode):
    print('Getting rating matrix...')
    ui_matrix = np.zeros((USER_NUMBER, ITEM_NUMBER))
    vector_dict = get_vector_dict(filename, 'u')
    for userid in vector_dict:
        ui_vector = vector_dict[userid]
        for itemid in ui_vector:
            ui_matrix[userid - 1][itemid - 1] = ui_vector[itemid]
    if mode == 'u':
        return ui_matrix
    else:
        return ui_matrix.transpose()
def get_cos_correlation(ui_matrix):
    print('Getting cos correlation matrix...')
    row_norm2 = list(map(np.linalg.norm, ui_matrix))
    for i in range(len(row_norm2)):
        if not row_norm2[i] == 0:
            ui_matrix[i] /= row_norm2[i]
            cos_correlation = np.dot(ui_matrix, ui_matrix.transpose())
            return cos_correlation
def train(model): 
# 训练 usercf 模型(获取用户间的余弦相似度矩阵)
# 分别读取 5 个划分好的训练集,分别进行训练,并写到文件中
    if model == 'usercf' or model == 'u':
        for i in range(5):
            print('Training set%s' % (i+1) + ' of model usercf...')
            ui_matrix = get_ui_matrix('u'+ str(i + 1) + '.base', mode='u')
            user_cos_correlation = get_cos_correlation(ui_matrix)
            with open('C:/Users/安全工程4班王鑫/Desktop/ACM练习题/王鑫--电影推荐模型/movie_recom/user_matrices/user_cor_matrix'+str(i+1), 'w') as f:
                for row in user_cos_correlation:
                    row = list(map(round, row, [8] * len(row)))
                    f.write('\t'.join(map(str,row )) + '\n')
# 训练 itemcf 模型(获取电影间的余弦相似度矩阵)
# 分别读取 5 个划分好的训练集,分别进行训练,并写到文件中
    elif model == 'itemcf' or model == 'i':
        for i in range(5):
            print('Training set%s' % (i + 1) + ' of model itemcf...')
            ui_matrix = get_ui_matrix('u'+ str(i + 1) + '.base', mode='i')
            item_cos_correlation = get_cos_correlation(ui_matrix)
            with open('C:/Users/安全工程4班王鑫/Desktop/ACM练习题/王鑫--电影推荐模型/movie_recom/item_matrices/item_cor_matrix'+str(i+1), 'w') as f:
                for row in item_cos_correlation:
                    row_ = list(map(round, row, [8]*len(row)))
                    f.write('\t'.join(map(str, row_)) + '\n')
    else:
        print("Argument -m <model> error! Training interrupt.")
        exit()
#读取由训练获得的用户(或电影)的余弦相似性矩阵的文件
def load_cos_correlation(filename, mode):
    print('Getting cos correlation matrix...')
    cos_correlation = []
    if mode == 'u':
        f = open('C:/Users/安全工程4班王鑫/Desktop/ACM练习题/王鑫--电影推荐模型/movie_recom/user_matrices/user_cor_matrix' , 'r')
    else:
        f = open('C:/Users/安全工程4班王鑫/Desktop/ACM练习题/王鑫--电影推荐模型/movie_recom/item_matrices/item_cor_matrix' , 'r')
    for line in f:
        cos_correlation.append(list(map(float, line.split('\t'))))
    f.close()
    return np.array(cos_correlation)
# 选出与用户id(或电影id)最相似的(即余弦相似度最大的,且非0)k位用户(或电影)#是用户还是电影由传入的 cos correlation 决定
def choose_k_ids(cos_correlation, id_, k):
    id_cor = cos_correlation[id_ - 1]
    id_cor_series = pd.Series(id_cor)
    k_ids_series = id_cor_series.sort_values(ascending=False)[:k]
    k_ids_series = k_ids_series[k_ids_series > 0]
    k_ids_series.index += 1
    k_ids_dict = dict(k_ids_series)  # 除去用户(电影)自身
    if k_ids_dict:
        k_ids_dict.pop(id_)
    return k_ids_dict
# 选出最值得推荐的前 n个用户或电影,train vector dict 由函数 get vector dict 获取
# 取出训练集中用户的评分作为权重令余弦相似性值作加权和,将和值作为推荐系数(即根据该值大小进行推荐)# 例如电影 1 被用户 1 和 用户 2 评分分别为 3 和 4
# 而用户 3 与用户 1 和用户 2 的余弦相似性分别为 0.2 和 0.5则电影1推荐给用户3的推荐系数为3*0.2+4*0.5=2.6# 把用户推荐给电影道理相同
def get_topn_recom_list(n, k_ids_dict, train_vector_dict):
    recom_dict = {}
    for id1 in k_ids_dict:
        if id1 in train_vector_dict:
            id1_vector_dict = train_vector_dict[id1]
            for id2 in id1_vector_dict:
                if not id2 in recom_dict:
                    recom_dict[id2] = k_ids_dict[id1] * id1_vector_dict[id2]
                else:
                    recom_dict[id2] += k_ids_dict[id1] * id1_vector_dict[id2]
    recom_list = [[itemid, recom_dict[itemid]] for itemid in recom_dict]
    recom_list = sorted(recom_list, key=lambda x: x[1], reverse=True)
    recom_list = recom_list[:n]
    recom_list = [i[0] for i in recom_list]
    return recom_list
def get_prf(i, mode): 
    k=100
    if mode == 'u':
        print('Testing Usercf.Getting train/test-set%s' % i + "'s evaluating scores..Please wait a minute.")
        n = 20
        matrix_filename = 'user_cor_matrix' + str(i)
    else:
        print('Testing Itemcf.Getting train/test set%s' % i +"'s evaluating scores...Please wait a minute.")
        n = 10
        matrix_filename = 'item_cor_matrix' + str(i)
    train_filename = 'u' + str(i)+'.base'
    test_filename = 'u' + str(i) + '.test'
    train_vector_dict = get_vector_dict(train_filename, mode)
    test_vector_dict = get_vector_dict(test_filename, mode)
    cos_correlation = load_cos_correlation(matrix_filename, mode)
    # 由于划分数据集的原因,只有同时在训练集和测试集的 id 才能够用于推荐,故取交集
    can_recom_ids = sorted(list(set(train_vector_dict) & set(test_vector_dict)))
    pres = []
    recs = []
    fs = []
    # 对可以推荐的每个用户(电影)算出推荐列表后,根据推荐列表与测试列表的情况计算出precision与recall
    for id_ in can_recom_ids:
        k_ids_dict = choose_k_ids(cos_correlation, id_, k)
        recom_list = get_topn_recom_list(n, k_ids_dict, train_vector_dict)
        have_dict = test_vector_dict[id_]
        int_num = 0
        recom_num = 0
        have_num = 0
        for id_ in recom_list:
            if id_ in have_dict:
                int_num += have_dict[id_]
                recom_num += have_dict[id_]
            else:
                recom_num += 1
        for id_ in have_dict:
            have_num += have_dict[id_]
        pre = int_num / recom_num
        rec = int_num / have_num
        pres.append(pre)
        recs.append(rec)
        if pre == 0 and rec == 0:
            fs.append(0)
        else:
            fs.append(2 * pre * rec / (pre + rec))
    return pres, recs, fs
def write_prf(i, mode):
    prf = get_prf(i, mode)
    max_prf_ = list(map(max, prf))
    average_prf_ = list(map(np.mean, prf))
    max_prf = list(map(str, max_prf_))
    average_prf = list(map(str, average_prf_))
    if mode == 'u':
        f = open('C:/Users/安全工程4班王鑫/Desktop/ACM练习题/王鑫--电影推荐模型/movie_recom/usercf_test/user_test'+ str(i) + 'conclusion', 'w')
    else:
        f = open('C:/Users/安全工程4班王鑫/Desktop/ACM练习题/王鑫--电影推荐模型/movie_recom/itemcf_test/item_test'+ str(i) + ' conclusion', 'w')
    f.write('precision max: ' + max_prf[0] + '\n')
    f.write('precision average:' + average_prf[0] + '\n')
    f.write('recall max: ' + max_prf[1] + '\n')
    f.write('recall average: ' + average_prf[1] + '\n')
    f.write('fl score max: ' + max_prf[2] + '\n')
    f.write('fl score mean:' + average_prf[2])
    f.close()
    return max_prf_, average_prf

# 由m种划分数据的方法分别计算出各指标后,求出平均值作为整个模型的评估,叫做交叉验证
def cross_valid(prfs, mode):
    print('calculating cross validation...')
    max_pre = []
    mean_pre = []
    max_rec = []
    mean_rec =[]
    max_f = []
    mean_f = []
# 记录5种划分方法得到的各个指标,之后用 no.mean 求出下面每一个列表平均值记录到文件中
    for i in range(5):
        max_pre.append(prfs[i][0][0])
        max_rec.append(prfs[i][0][1])
        max_f.append(prfs[i][0][2])
        mean_pre.append(prfs[i][1][0])
        mean_rec.append(prfs[i][1][1])
        mean_f.append(prfs[i][1][2])
    if mode == 'u':
        f = open('C:/Users/安全工程4班王鑫/Desktop/ACM练习题/王鑫--电影推荐模型/movie_recom/usercf_test/user_cross_valid', 'w')
    else :
        f = open('C:/Users/安全工程4班王鑫/Desktop/ACM练习题/王鑫--电影推荐模型/movie_recom/itemcf_test/item_cross_valid', 'w')
    f.write('precision max:' + str(np.mean(max_pre)) + '\n')
    f.write('precision averaqe:' + str(np.mean(mean_pre)) + '\n')
    f.write('recall max: ' + str(np.mean(max_rec)) + 'n')
    f.write('recall averaqe: ' + str(np.mean(mean_rec)) + 'in')
    f.write('fl score max: ' + str(np.mean(max_f)) + '\n')
    f.write('fl score mean:' + str(np.mean(mean_f)))
    f.close()

def test(model): 
    if model == 'usercf' or model == 'u':
        prfs = []
        for i in range(1, 6):
            prfs.append(write_prf(i, 'u'))
        cross_valid(prfs, 'u')
# 测试 itemcf 模型
    elif model == 'itemcf' or model == 'i':
        prfs = []
        for i in range(1, 6):
            prfs.append(write_prf(i, 'i'))
        cross_valid(prfs, 'i')
    else:
        print("Arqument -m <model> error! Training interrupt.")

# 根据用Pid来给出推荐的电影id列表,或根据电影Ya米给出推荐的用Pid列表,用的是get_topn_ recom_ist函数
def recommend(id_, model):
    print(model)
    if model == 'user':
        try:
            id_ = int(id_)
        except:
            print('please input valid user id!')
            exit()
        if not 1 <= id_ <= USER_NUMBER:
            print('User with id %s'%id_ + 'does not exist!')
            exit()
        # 跟训练(目的是调整n和k等参数)不同的是,推荐用的是全部数据,故需要得到所有数据对应余弦相似矩阵
        if not os.path.exists('C:/Users/安全工程4班王鑫/Desktop/ACM练习题/王鑫--电影推荐模型/movie_recom/user_matrices/all_user_cor_matrix'):
                print("Getting all users' cos correlation matrix.. Please wait a minute.")
                ui_matrix = get_ui_matrix('u. data', mode='u')
                cos_correlation = get_cos_correlation(ui_matrix)
                with open('C:/Users/安全工程4班王鑫/Desktop/ACM练习题/王鑫--电影推荐模型/movie_recom/user_matrices/all_user_cor_matrix', 'w') as f:
                    for row in cos_correlation:
                        row_ = list(map(round, row, [8] * len(row)))
                        f.write('\t'.join(map(str, row_)) + '\n')
        n = 20
        k = 100
        user_cor_matrix = load_cos_correlation('all_user_cor_matrix', mode='u')
        train_vector_dict = get_vector_dict('u. data', mode='u')
        k_ids = choose_k_ids(user_cor_matrix, id_, k)
        recom_list = get_topn_recom_list(n, k_ids, train_vector_dict)
        print('Recommend movies to the user with id %s' % id_ + ':')
        print(recom_list)

    if model == 'item':
        try:
            id_ = int(id_)
        except:
            print('please input valid movie id!')
            exit()
        if not 1 <= id_ <= ITEM_NUMBER:
            print('User with id %s' % id_ + 'does not exist!')
            exit()
        if not os.path.exists('C:/Users/安全工程4班王鑫/Desktop/ACM练习题/王鑫--电影推荐模型/movie_recom/item matrices/all_item_cor_matrix'):
            print("Getting all items' cos correlation matrix.. Please wait a minute.")
            ui_matrix = get_ui_matrix('u. data', mode='i')
            cos_correlation = get_cos_correlation(ui_matrix)
            with open('C:/Users/安全工程4班王鑫/Desktop/ACM练习题/王鑫--电影推荐模型/movie_recom/item_matrices/all_item_cor_matrix', 'w') as f:
                for row in cos_correlation:
                    row_ = list(map(round, row, [8] * len(row)))
                    f.write('\t'.join(map(str, row_)) + '\n')
        n = 10
        k = 100
        user_cor_matrix = load_cos_correlation('all_item_cor_matrix', mode='i')
        train_vector_dict = get_vector_dict('u.data', mode='i')
        k_ids = choose_k_ids(user_cor_matrix, id_, k)
        recom_list = get_topn_recom_list(n, k_ids, train_vector_dict)
        print('Recommend users to the movie with id %s' % id_ + ':')
        print(recom_list)


def main():
    parser = argparse.ArgumentParser('传入参数:main.py')
    parser.add_argument('-t', '--train', action='store_true', default=False)
    parser.add_argument('-p', '--test', action='store_true', default=False)
    parser.add_argument('-m', '--model', default=False)
    parser.add_argument('-u', '--usercf', default=False)
    parser.add_argument('-i', '--itemcf', default=False)
    args = parser.parse_args()
    print(args)
    print(type(args))
    if args.train:
        if args.model:
            train(args.model)
        else:
            train('usercf')
            train('itemcf')
    if args.test:
        if args.model:
            test(args.model)
        else:
            test('usercf')
            test('itemcf')
    if args.usercf:
        recommend(args.usercf, 'user')
    if args.itemcf:
        recommend(args.itemcf, 'item')


if __name__ == '__main__':
    main()

最终历尽千辛万苦成功了。

如上面代码,每一步的文件位置(路径)必须一点也不可以出错,不然会出现各种问题。

这也算补上了注释。

我遇到过的问题如图:

1很明显文件位置错了

2文件权限问题:

3文件系统识别不到

4崩溃合集:大数据Python实验——崩溃合集

基本上只有这么多了,看个笑话就行了,真实际操作还得找老师。

  • 1
    点赞
  • 9
    收藏
    觉得还不错? 一键收藏
  • 4
    评论
评论 4
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值