推荐系统导论笔记（五）——Assignment 2

最新推荐文章于 2020-10-03 23:59:11 发布

dyc941126

最新推荐文章于 2020-10-03 23:59:11 发布

阅读量473

点赞数

分类专栏：推荐系统课程笔记

本文链接：https://blog.csdn.net/dyc941126/article/details/50121977

版权

推荐系统同时被 2 个专栏收录

6 篇文章 0 订阅

订阅专栏

课程笔记

5 篇文章 0 订阅

订阅专栏

数据描述

数据1：给定一张 $20\times 10$ 的表，其中，每一行是一个Doc，若该文档包含某个Topic，则对应的单元格为1，否则为0。
数据2：给定两个用户对文档的评分，若用户喜欢某文档，则对应单元格为1，否则为-1，且评分有缺失。

问题描述

构建基本用户档案

使用上述数据构建一个最基本的档案，不用考虑关键字凸显等问题。完成下面两个问题：

预测用户1最喜欢的文档，并给出相应的分数
预测用户2不喜欢的文档

考虑主题权重

直观来考虑，如果一个文档只说明一个问题，另一个文档也说了这个问题，但它同时涉及到其他9个问题，那么对关心这个问题的用户来说，肯定会优先考虑第一个文档。这就是主题权重思想。
用每个文档的主题数的平方根倒数对文档进行加权，并重做上面练习。

考虑IDF权重

对于一些常见的话题，有必要对其进行惩罚，而TF-IDF算法则提供了一个方便的手段。

问题分析

显然，这里的Item就是一个一个的文档，而其对应的向量空间模型就是每个Topic出现与否所组成的向量。对于问题1，只需对向量空间用用户的评分进行加权即可，即若用户喜欢该文档，则权重为1；若不喜欢，则为-1；若没有评分，则为0。最后进行累加即可得出用户偏好档案，上一篇文章给出了较为详细的解释。当我们预测时，即可直接拿文档的Topic向量与用户偏好向量求夹角余弦值即可（课程实验文档中要求用点积）。
对于问题2和问题3，是在问题1的向量空间模型的基础之上再乘以每个Topic对于的权重即可。

编码解决

数据定义与读入

    topics = [] #主题矩阵，即对原表进行按列（Topic）存储
    topicNames = [] #主题名称列表
    userRatings = [] #用户评分矩阵
    IDF=[] #IDF计算结果

    #加载主题矩阵数据，line为一行数据
    def loadTopic(self, line): 
        loadHeadText = len(self.topics) == 0
        for index in range(1, 11):
            if loadHeadText: #若此行是第一行表头数据，则：
                self.topics.append([]) #开辟主题列
                self.topicNames.append(line[index]) #保存主题名称
            else: #否则，说明该行是文档数据
                self.topics[index - 1].append(int(line[index]))#将第index-1个主题的指示变量放入主题矩阵
    #加载用户评分数据，line为一行数据
    def loadUserRating(self, line):
        for index in range(14, 16):
            cell = line[index]
            if cell.startswith('User'):
                self.userRatings.append([])
            elif cell == '':
                self.userRatings[index - 14].append(0)
            else:
                self.userRatings[index - 14].append(int(cell))

问题1

# 生成用户档案
def generateUserProfiles(self, userIndex):
        #得到用户有效评分的文档下标
        validRatingsIndex = [i for i in range(0,len( self.userRatings[userIndex])) if self.userRatings[userIndex][i] != 0]
        #初始化用户档案向量为空
        userProfile = []
        #遍历所有Topic
        for topic in self.topics:
            sum = 0
            #遍历所有有效评分的文档下标
            for index in validRatingsIndex:
                #如果该文档没有涉及到该Topic，则跳过
                if topic[index] == 0:
                    continue
                #否则，将用户评分加权后累加至sum中
                sum = sum + self.userRatings[userIndex][index]*topic[index]
            #得出用户对一个Topic的喜欢程度，并加入用户偏好向量中
            userProfile.append(sum)
        return userProfile
#计算用户对文档偏好程度
 def evaluateDocument(self, documentIndex, userProfile):
        result = 0
        #遍历该文档的Topic，并与用户偏好向量中对应Topic值相乘后累加（即计算点积）
        for topicKey, profileKey in zip(self.topics, userProfile):
            result += topicKey[documentIndex] * profileKey
        return result
#计算某用户最喜欢的文档
 def cloestMatches(self, userProfile):
        cloestScore = -1
        cloestIndex = -1
        #遍历所有文档
        for index in range(0, len(self.topics[0])):
            #用上述函数评价文档
            evaluateScore = self.evaluateDocument(index, userProfile)
            #选择最大者返回
            if evaluateScore > cloestScore:
                cloestScore = evaluateScore
                cloestIndex = index
        return [cloestIndex,cloestScore]
#计算某用户不喜欢的文档
 def farthestMatches(self,userProfile):
        farthestScore=[]
        farthestIndex=[]
        for index in range(0,len(self.topics[0])):            evaluateScore=self.evaluateDocument(index,userProfile)
            if evaluateScore<0:
                farthestScore.append(evaluateScore)
                farthestIndex.append(index)
        return [farthestIndex,farthestScore]

问题2

#用主题个数的平方根修正Topic矩阵
    def refineTopicMatrix(self):
        #遍历所有文档
        for docIndex in range(0,len(self.topics[0])):
            topicOfDoc=0
            #计算文档中涉及到的主题个数
            for topic in self.topics:
                if topic[docIndex]==1:
                    topicOfDoc=topicOfDoc+1
             #计算修正因子
            factor=1/math.sqrt(topicOfDoc)
            #修正该文档
            for topicIndex in range(0,len(self.topics)):
                self.topics[topicIndex][docIndex]=self.topics[topicIndex][docIndex]*factor

问题3

    #计算Topic的IDF
    def __calcuIDF__(self):
        for topic in self.topics:
            validTopicIndex=[index for index in range(0,len(topic)) if topic[index]!=0]
            self.IDF.append(1.0/len(validTopicIndex))
    #用IDF修正预测结果
    def evaluateDocumentByIDF(self, documentIndex, userProfile):
        if len(self.IDF)==0:
            self.__calcuIDF__()
        result = 0
        index=0
        for topicKey, profileKey in zip(self.topics, userProfile):
            result += topicKey[documentIndex] * profileKey*self.IDF[index]
            index += 1
        return result