协同过滤-找出电影的相似度

0.理论

找出电影之间的相关度,相关度高的电影,认为它是相似电影。

相关度的衡量指标:
pearson相关系数:pearson相关系数衡量的是线性相关关系。若r=0,只能说x与y之间无线性相关关系,不能说无相关关系。相关系数的绝对值越大,相关性越强:相关系数越接近于1或-1,相关度越强,相关系数越接近于0,相关度越弱。
spearman相关系数:它是衡量两个变量的依赖性的 非参数 指标。 它利用单调方程评价两个统计变量的相关性。 如果数据中没有重复值, 并且当两个变量完全单调相关时,斯皮尔曼相关系数则为+1或−1。

1.找出电影的相似度

import pandas as pd

r_cols = ['user_id', 'movie_id', 'rating']
ratings = pd.read_csv('E:/python/python数据科学与机器学习/《Python数据科学与机器学习:从入门到实践》源代码/ml-100k/u.data', sep='\t', names=r_cols, usecols=range(3), encoding="ISO-8859-1")

m_cols = ['movie_id', 'title']
movies = pd.read_csv('E:/python/python数据科学与机器学习/《Python数据科学与机器学习:从入门到实践》源代码/ml-100k/u.item', sep='|', names=m_cols, usecols=range(2), encoding="ISO-8859-1")
# 连接两张表
ratings = pd.merge(movies, ratings)
ratings.head()
movie_idtitleuser_idrating
01Toy Story (1995)3084
11Toy Story (1995)2875
21Toy Story (1995)1484
31Toy Story (1995)2804
41Toy Story (1995)663
# 生成数据透视表
movieRatings = ratings.pivot_table(index=['user_id'],columns=['title'],values='rating')
movieRatings.head()
title'Til There Was You (1997)1-900 (1994)101 Dalmatians (1996)12 Angry Men (1957)187 (1997)2 Days in the Valley (1996)20,000 Leagues Under the Sea (1954)2001: A Space Odyssey (1968)3 Ninjas: High Noon At Mega Mountain (1998)39 Steps, The (1935)...Yankee Zulu (1994)Year of the Horse (1997)You So Crazy (1994)Young Frankenstein (1974)Young Guns (1988)Young Guns II (1990)Young Poisoner's Handbook, The (1995)Zeus and Roxanne (1997)unknownÁ köldum klaka (Cold Fever) (1994)
user_id
0NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
1NaNNaN2.05.0NaNNaN3.04.0NaNNaN...NaNNaNNaN5.03.0NaNNaNNaN4.0NaN
2NaNNaNNaNNaNNaNNaNNaNNaN1.0NaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
3NaNNaNNaNNaN2.0NaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
4NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN

5 rows × 1664 columns

# 查看《星球大战》的评分
starWarsRatings = movieRatings['Star Wars (1977)']
starWarsRatings.head()
user_id
0    5.0
1    5.0
2    5.0
3    NaN
4    5.0
Name: Star Wars (1977), dtype: float64
# 找出《星球大战》相关的电影
similarMovies = movieRatings.corrwith(starWarsRatings)
similarMovies = similarMovies.dropna()
df = pd.DataFrame(similarMovies)
df.head(10)
C:\Users\Administrator\anaconda3\lib\site-packages\numpy\lib\function_base.py:2845: RuntimeWarning: Degrees of freedom <= 0 for slice
  c = cov(x, y, rowvar, dtype=dtype)
C:\Users\Administrator\anaconda3\lib\site-packages\numpy\lib\function_base.py:2704: RuntimeWarning: divide by zero encountered in divide
  c *= np.true_divide(1, fact)
0
title
'Til There Was You (1997)0.872872
1-900 (1994)-0.645497
101 Dalmatians (1996)0.211132
12 Angry Men (1957)0.184289
187 (1997)0.027398
2 Days in the Valley (1996)0.066654
20,000 Leagues Under the Sea (1954)0.289768
2001: A Space Odyssey (1968)0.230884
39 Steps, The (1935)0.106453
8 1/2 (1963)-0.142977
# 倒序排列
similarMovies.sort_values(ascending=False)
title
Commandments (1997)           1.0
Cosi (1996)                   1.0
No Escape (1994)              1.0
Stripes (1981)                1.0
Man of the Year (1995)        1.0
                             ... 
For Ever Mozart (1996)       -1.0
Frankie Starlight (1995)     -1.0
I Like It Like That (1994)   -1.0
American Dream (1990)        -1.0
Theodore Rex (1995)          -1.0
Length: 1410, dtype: float64
# 我们的结果可能被那些只被少数碰巧喜欢《星球大战》的人看过的电影搞砸了。因此,我们需要剔除那些只有少数人观看、产生虚假结果的电影。
# 让我们构建一个新的数据框架,计算出每部电影有多少评级,以及平均评级,这在以后也会派上用场。

import numpy as np
movieStats = ratings.groupby('title').agg({'rating': [np.size, np.mean]})
movieStats.head()
rating
sizemean
title
'Til There Was You (1997)92.333333
1-900 (1994)52.600000
101 Dalmatians (1996)1092.908257
12 Angry Men (1957)1254.344000
187 (1997)413.024390
# 剔除评分数少于100的电影
popularMovies = movieStats.loc[movieStats['rating']['size'] >= 100]
popularMovies.sort_values([('rating', 'mean')], ascending=False)[:15]
rating
sizemean
title
Close Shave, A (1995)1124.491071
Schindler's List (1993)2984.466443
Wrong Trousers, The (1993)1184.466102
Casablanca (1942)2434.456790
Shawshank Redemption, The (1994)2834.445230
Rear Window (1954)2094.387560
Usual Suspects, The (1995)2674.385768
Star Wars (1977)5844.359589
12 Angry Men (1957)1254.344000
Citizen Kane (1941)1984.292929
To Kill a Mockingbird (1962)2194.292237
One Flew Over the Cuckoo's Nest (1975)2644.291667
Silence of the Lambs, The (1991)3904.289744
North by Northwest (1959)1794.284916
Godfather, The (1972)4134.283293
# 电影相似度
similarMovies = pd.DataFrame(similarMovies, columns=['similarity'])
similarMovies.head()
similarity
title
'Til There Was You (1997)0.872872
1-900 (1994)-0.645497
101 Dalmatians (1996)0.211132
12 Angry Men (1957)0.184289
187 (1997)0.027398
# 连接两张表,获取相似度
df = popularMovies.join(similarMovies).reset_index()
df.head()
C:\Users\Administrator\AppData\Local\Temp\ipykernel_888\955805574.py:2: FutureWarning: merging between different levels is deprecated and will be removed in a future version. (2 levels on the left, 1 on the right)
  df = popularMovies.join(similarMovies).reset_index()
title(rating, size)(rating, mean)similarity
0101 Dalmatians (1996)1092.9082570.211132
112 Angry Men (1957)1254.3440000.184289
22001: A Space Odyssey (1968)2593.9691120.230884
3Absolute Power (1997)1273.3700790.085440
4Abyss, The (1989)1513.5894040.203709
# 按相似度倒序排列,得到推荐结果
df.sort_values('similarity', ascending=False)[:15]
title(rating, size)(rating, mean)similarity
295Star Wars (1977)5844.3595891.000000
99Empire Strikes Back, The (1980)3684.2065220.748353
255Return of the Jedi (1983)5074.0078900.672556
247Raiders of the Lost Ark (1981)4204.2523810.536117
24Austin Powers: International Man of Mystery (1...1303.2461540.377433
298Sting, The (1973)2414.0580910.367538
162Indiana Jones and the Last Crusade (1989)3313.9305140.350107
235Pinocchio (1940)1013.6732670.347868
119Frighteners, The (1996)1153.2347830.332729
176L.A. Confidential (1997)2974.1616160.319065
326Wag the Dog (1997)1373.5109490.318645
94Dumbo (1941)1233.4959350.317656
49Bridge on the River Kwai, The (1957)1654.1757580.316580
232Philadelphia Story, The (1940)1044.1153850.314272
200Miracle on 34th Street (1994)1013.7227720.310921

2.参考资料

Python数据科学与机器学习:从入门到实践
作者:
[美]弗兰克•凯恩(Frank Kane)

源代码下载:
https://www.ituring.com.cn/book/2426

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值