利用Python实现基于协同过滤算法的影片推荐

最新推荐文章于 2024-08-15 07:30:00 发布

aptxaaaa

最新推荐文章于 2024-08-15 07:30:00 发布

阅读量7.2k

点赞数 4

分类专栏： Python与机器学习文章标签： python 协同过滤欧氏距离皮尔逊相关度推荐算法

本文链接：https://blog.csdn.net/weixin_37325825/article/details/72952744

版权

Python与机器学习专栏收录该内容

4 篇文章 1 订阅

订阅专栏

协同过滤算法即对一大群人进行搜索，找出其中品味与我们相近的一小群人，并将这一小群人的偏好进行组合来构造一个推荐列表。
本文利用Python3.5分别实现了基于用户和基于物品的协同过滤算法的影片推荐。具体过程如下：先建立了一个涉及人员、物品和评价值的字典，然后利用两种相似度测量算法（欧几里得距离和皮尔逊相关度）分别基于用户和基于物品进行影片推荐及评论者推荐，最后对两种协同过滤方式的选择提出了建议。

使用字典收集偏好

新建 recommendations.py 文件，并加入以下代码构建一个数据集：

# A dictionary of movie critics and their ratings of a small
# set of movies
critics={'Lisa Rose': {'Lady in the Water': 2.5, 'Snakes on a Plane': 3.5,
 'Just My Luck': 3.0, 'Superman Returns': 3.5, 'You, Me and Dupree': 2.5, 
 'The Night Listener': 3.0},
'Gene Seymour': {'Lady in the Water': 3.0, 'Snakes on a Plane': 3.5, 
 'Just My Luck': 1.5, 'Superman Returns': 5.0, 'The Night Listener': 3.0, 
 'You, Me and Dupree': 3.5}, 
'Michael Phillips': {'Lady in the Water': 2.5, 'Snakes on a Plane': 3.0,
 'Superman Returns': 3.5, 'The Night Listener': 4.0},
'Claudia Puig': {'Snakes on a Plane': 3.5, 'Just My Luck': 3.0,
 'The Night Listener': 4.5, 'Superman Returns': 4.0, 
 'You, Me and Dupree': 2.5},
'Mick LaSalle': {'Lady in the Water': 3.0, 'Snakes on a Plane': 4.0, 
 'Just My Luck': 2.0, 'Superman Returns': 3.0, 'The Night Listener': 3.0,
 'You, Me and Dupree': 2.0}, 
'Jack Matthews': {'Lady in the Water': 3.0, 'Snakes on a Plane': 4.0,
 'The Night Listener': 3.0, 'Superman Returns': 5.0, 'You, Me and Dupree': 3.5},
'Toby': {'Snakes on a Plane':4.5,'You, Me and Dupree':1.0,'Superman Returns':4.0}}

上面的字典清晰的展示了一位影评者对若干部电影的打分，分值为1-5。
这样就很容易对其进行查询和修改，如查询某人对某部影片的评分。代码如下：

>>> from recommendations import critics
>>> critics['Lisa Rose']['Snakes on a Plane']
3.5

寻找相似用户

寻找相似用户，即确定人们在品味方面的相似度。这需要将每个人与其他所有人进行对比，并计算相似度评价值。这里采用了欧几里得距离和皮尔逊相关度两套算法来计算相似度评价值。

欧几里得距离评价

欧几里得距离是多维空间中两点之间的距离，用来衡量二者的相似度。距离越小，相似度越高。
欧氏距离公式： $dist(X,Y) = \sqrt{\sum_{i=1}^n (x_i-y_i)^2}$
代码实现：

from math import sqrt

# Returns a distance-based similarity score for person1 and person2
def sim_distance(prefs,person1,person2):
  # Get the list of shared_items
  si={}
  for item in prefs[person1]: 
    if item in prefs[person2]: si[item]=1

  # if they have no ratings in common, return 0
  if len(si)==0: return 0

  # Add up the squares of all the differences
  sum_of_squares=sum([pow(prefs[person1][item]-prefs[person2][item],2) 
                      for item in prefs[person1] if item in prefs[person2]])

  return 1/(1+sum_of_squares)

这一函数返回介于0到1之间的值。调用该函数，传入两个人的名字，可计算相似度评价值。代码如下：

>>> import recommendations
>>> recommendations.sim_distance(recommendations.critics,'Lisa Rose','Gene Seymour')
0.14814814814814814

皮尔逊相关度评价

皮尔逊相关系数是判断两组数据与某一直线拟合程度的一种度量，修正了“夸大分值”，在数据不是很规范的时候（如影评者对影片的评价总是相对于平均水平偏离很大时），会给出更好的结果。相关系数越大，相似度越高。

皮尔逊相关系数公式： $r(X,Y) = \dfrac{\sum XY - \dfrac{\sum X \sum Y}{N}}{(\sum X^2 - \dfrac{(\sum X)^2}{N})(\sum Y^2 - \dfrac{(\sum Y)^2}{N})}$
代码实现：

# Returns the Pearson correlation coefficient for p1 and p2
def sim_pearson(prefs,p1,p2):
  # Get the list of mutually rated items
  si={}
  for item in prefs[p1]: 
    if item in prefs[p2]: si[item]=1

  # if they are no ratings in common, return 0
  if len(si)==0: return 0

  # Sum calculations
  n=len(si)

  # Sums of all the preferences
  sum1=sum([prefs[p1][it] for it in si])
  sum2=sum([prefs[p2][it] for it in si])

  # Sums of the squares
  sum1Sq=sum([pow(prefs[p1][it],2) for it in si])
  sum2Sq=sum([pow(prefs[p2][it],2) for it in si])   

  # Sum of the products
  pSum=sum([prefs[p1][it]*prefs[p2][it] for it in si])

  # Calculate r (Pearson score)
  num=pSum-(sum1*sum2/n)
  den=sqrt((sum1Sq-pow(sum1,2)/n)*(sum2Sq-pow(sum2,2)/n))
  if den==0: return 0

  r=num/den

  return r

这一函数返回介于-1到1之间的值。调用该函数，传入两个人的名字，可计算相似度评价值。代码如下：

>>> import recommendations
>>> recommendations.sim_pearson(recommendations.critics,'Lisa Rose','Gene Seymour')
0.39605901719066977

基于用户提供推荐

评论者	相似度	Night	S.xNight	Lady	S.xLady	Luck	S.xLuck
Rose	0.99	3.0	2.97	2.5	2.48	3.0	2.97
Seymour	0.38	3.0	1.14	3.0	1.14	1.5	0.57
Puig	0.89	4.5	4.42			3.0	2.68
LaSalle	0.92	3.0	2.77	3.0	2.77	2.0	1.85
Matthews	0.66	3.0	1.99	3.0	1.99
总计			12.89		8.38		8.07
Sim. Sum			3.84		2.95		3.18
总计/Sim. Sum			3.35		2.83		2.53

基于物品提供推荐

两种协同过滤方式的选择

基于物品的过滤方式推荐结果更加个性化，反映用户自己的兴趣传承，对于稀疏数据集在精准度上更优，而且针对大数据集生成推荐列表时明显更快，不过有维护物品相似度的额外开销。
但是，基于用户的过滤方法更易于实现，推荐结果着重于反应和用户兴趣相似的小群体的热点，着重于维系用户的历史兴趣，更适合于规模较小的变化非常频繁的内存数据集，或者有推荐相近偏好用户给指定用户的需求。