推荐系统-基于项目的协同过滤

0. 理论

推荐系统使用基于项目的协同过滤,优点是:
1)人是善变的,但项目不随着时间发生变化。
2)推荐的项目比人少得多,可以节省大量的计算能力,可以使用更加复杂的算法。

思路:建立基于项目的协同过滤系统,即实现“看了这部电影的人也看了…”和“对这部电影高度评价的人也高度评价了…”这些功能,即建立电影之间的联系。

1.用python3实现电影推荐

import pandas as pd 
import warnings
warnings.filterwarnings ("ignore")

r_cols = ['user_id', 'movie_id', 'rating']
ratings = pd.read_csv('E:/python/python数据科学与机器学习/《Python数据科学与机器学习:从入门到实践》源代码/ml-100k/u.data', sep = '\t', names=r_cols, usecols=range(3), encoding="ISO-8859-1")
ratings.head()
user_idmovie_idrating
00505
101725
201331
31962423
41863023
m_cols = ['movie_id', 'title']
movies = pd.read_csv('E:/python/python数据科学与机器学习/《Python数据科学与机器学习:从入门到实践》源代码/ml-100k/u.item', sep='|', names=m_cols, usecols=range(2), encoding="ISO-8859-1")
movies.head()
movie_idtitle
01Toy Story (1995)
12GoldenEye (1995)
23Four Rooms (1995)
34Get Shorty (1995)
45Copycat (1995)
# 连接两张表
ratings = pd.merge(movies, ratings, how = 'inner', on = 'movie_id')
ratings.head()
movie_idtitleuser_idrating
01Toy Story (1995)3084
11Toy Story (1995)2875
21Toy Story (1995)1484
31Toy Story (1995)2804
41Toy Story (1995)663
# 转置表格
userRatings = ratings.pivot_table(index=['user_id'],columns=['title'],values='rating')
userRatings.head()
title'Til There Was You (1997)1-900 (1994)101 Dalmatians (1996)12 Angry Men (1957)187 (1997)2 Days in the Valley (1996)20,000 Leagues Under the Sea (1954)2001: A Space Odyssey (1968)3 Ninjas: High Noon At Mega Mountain (1998)39 Steps, The (1935)...Yankee Zulu (1994)Year of the Horse (1997)You So Crazy (1994)Young Frankenstein (1974)Young Guns (1988)Young Guns II (1990)Young Poisoner's Handbook, The (1995)Zeus and Roxanne (1997)unknownÁ köldum klaka (Cold Fever) (1994)
user_id
0NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
1NaNNaN2.05.0NaNNaN3.04.0NaNNaN...NaNNaNNaN5.03.0NaNNaNNaN4.0NaN
2NaNNaNNaNNaNNaNNaNNaNNaN1.0NaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
3NaNNaNNaNNaN2.0NaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
4NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN

5 rows × 1664 columns

# 皮尔逊相关度, 筛选数据量大于等于100的样本(至少有100人打分)
corrMatrix = userRatings.corr(method='pearson', min_periods=100)
corrMatrix.head()
title'Til There Was You (1997)1-900 (1994)101 Dalmatians (1996)12 Angry Men (1957)187 (1997)2 Days in the Valley (1996)20,000 Leagues Under the Sea (1954)2001: A Space Odyssey (1968)3 Ninjas: High Noon At Mega Mountain (1998)39 Steps, The (1935)...Yankee Zulu (1994)Year of the Horse (1997)You So Crazy (1994)Young Frankenstein (1974)Young Guns (1988)Young Guns II (1990)Young Poisoner's Handbook, The (1995)Zeus and Roxanne (1997)unknownÁ köldum klaka (Cold Fever) (1994)
title
'Til There Was You (1997)NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
1-900 (1994)NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
101 Dalmatians (1996)NaNNaN1.0NaNNaNNaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
12 Angry Men (1957)NaNNaNNaN1.0NaNNaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
187 (1997)NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN

5 rows × 1664 columns

# 选取user_id为0的行数据,作为测试数据
myRatings = userRatings.loc[0].dropna()
myRatings
title
Empire Strikes Back, The (1980)    5.0
Gone with the Wind (1939)          1.0
Star Wars (1977)                   5.0
Name: 0, dtype: float64
myRatings[0]
5.0
# 为测试用户推荐电影
simCandidates = pd.Series()
for i in range(myRatings.shape[0]):
    print("寻找《{}》的相似电影...".format(myRatings.index[i]))
    # 对应的相似电影
    sims = corrMatrix[myRatings.index[i]].dropna()
    # 将该电影的分数与它相似电影的分数相乘
    sims = sims.map(lambda x: x * myRatings[i])
    # 加入结果
    simCandidates = simCandidates.append(sims)
    

print ("结果:")
simCandidates.sort_values(inplace = True, ascending = False)
print (simCandidates.head(10))
寻找《Empire Strikes Back, The (1980)》的相似电影...
寻找《Gone with the Wind (1939)》的相似电影...
寻找《Star Wars (1977)》的相似电影...
结果:
Empire Strikes Back, The (1980)                       5.000000
Star Wars (1977)                                      5.000000
Empire Strikes Back, The (1980)                       3.741763
Star Wars (1977)                                      3.741763
Return of the Jedi (1983)                             3.606146
Return of the Jedi (1983)                             3.362779
Raiders of the Lost Ark (1981)                        2.693297
Raiders of the Lost Ark (1981)                        2.680586
Austin Powers: International Man of Mystery (1997)    1.887164
Sting, The (1973)                                     1.837692
dtype: float64
# 将结果按电影名称汇总
simCandidates = simCandidates.groupby(simCandidates.index).sum()
# 将分数倒序排列
simCandidates.sort_values(inplace = True, ascending = False)
# 删除自己已评分的电影,由于自己看过的电影无需被推荐
filteredSims = simCandidates.drop(myRatings.index)
print("推荐结果:")
filteredSims.head(10)
推荐结果:





Return of the Jedi (1983)                    7.178172
Raiders of the Lost Ark (1981)               5.519700
Indiana Jones and the Last Crusade (1989)    3.488028
Bridge on the River Kwai, The (1957)         3.366616
Back to the Future (1985)                    3.357941
Sting, The (1973)                            3.329843
Cinderella (1950)                            3.245412
Field of Dreams (1989)                       3.222311
Wizard of Oz, The (1939)                     3.200268
Dumbo (1941)                                 2.981645
dtype: float64

2.改善推荐结果

尝试以下方法改善推荐结果:
1)相关系数: 使用spearman相关系数
2)改变min_period的值
3)当用户讨厌某个电影时,与它相似的电影不应被推荐。

def recommand_movie(method='pearson', min_period=100, weight=0):
    # 设置相关度, 设置筛选数据量
    corrMatrix = userRatings.corr(method=method, min_periods=min_period)
    # 选取user_id为0的行数据,作为测试数据
    myRatings = userRatings.loc[0].dropna()
    # 为测试用户推荐电影
    simCandidates = pd.Series()
    for i in range(myRatings.shape[0]):        
        # 对应的相似电影
        sims = corrMatrix[myRatings.index[i]].dropna()
        if myRatings[i] == 1:
            # 如果用户讨厌一个电影,那么该电影的的分数应该被降低
            sims = sims.map(lambda x: x * myRatings[i] * weight * (-1))
        else:
            # 将该电影的分数与它相似电影的分数相乘
            sims = sims.map(lambda x: x * myRatings[i])            
        # 加入结果
        simCandidates = pd.concat([simCandidates, sims], axis = 0)
    
    # 将结果按电影名称汇总
    simCandidates = simCandidates.groupby(simCandidates.index).sum()
    # 将分数倒序排列
    simCandidates.sort_values(inplace = True, ascending = False)
    # 删除自己已评分的电影,由于自己看过的电影无需被推荐
    simCandidates = simCandidates.drop(myRatings.index)
    print("为用户0推荐电影:")
    print(simCandidates.head(10))
recommand_movie(method='spearman', min_period = 100, weight = 0)
为用户0推荐电影:
Return of the Jedi (1983)                    6.407001
Raiders of the Lost Ark (1981)               4.528739
Indiana Jones and the Last Crusade (1989)    3.299785
Sting, The (1973)                            3.273064
Batman (1989)                                3.012647
Singin' in the Rain (1952)                   2.952571
Field of Dreams (1989)                       2.945751
Dumbo (1941)                                 2.894872
Jaws (1975)                                  2.867804
Star Trek: The Wrath of Khan (1982)          2.859166
dtype: float64
recommand_movie(method='pearson', min_period = 80, weight= 0)
为用户0推荐电影:
Return of the Jedi (1983)                    6.968925
Raiders of the Lost Ark (1981)               5.373883
Bridge on the River Kwai, The (1957)         3.366616
Indiana Jones and the Last Crusade (1989)    3.316717
Cinderella (1950)                            3.245412
Sting, The (1973)                            3.209627
Con Air (1997)                               3.204525
Back to the Future (1985)                    3.100622
Day the Earth Stood Still, The (1951)        3.087913
Field of Dreams (1989)                       3.068508
dtype: float64
recommand_movie(method='pearson', min_period = 100, weight= 0.5)
为用户0推荐电影:
Return of the Jedi (1983)                    6.864302
Raiders of the Lost Ark (1981)               5.300974
Bridge on the River Kwai, The (1957)         3.366616
Cinderella (1950)                            3.245412
Indiana Jones and the Last Crusade (1989)    3.231061
Sting, The (1973)                            3.149519
Field of Dreams (1989)                       2.991607
Dumbo (1941)                                 2.981645
Back to the Future (1985)                    2.971963
Star Trek: The Wrath of Khan (1982)          2.968080
dtype: float64

3.参考资料

Python数据科学与机器学习:从入门到实践
作者:
[美]弗兰克•凯恩(Frank Kane)

源代码下载:
https://www.ituring.com.cn/book/2426

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值