推荐系统-基于项目的协同过滤

最新推荐文章于 2024-07-02 18:52:51 发布

闲庭信步的空间

最新推荐文章于 2024-07-02 18:52:51 发布

阅读量379

点赞数

文章标签： python 机器学习数据挖掘

本文链接：https://blog.csdn.net/danspace1/article/details/130340354

版权

0. 理论

推荐系统使用基于项目的协同过滤，优点是：
1）人是善变的，但项目不随着时间发生变化。
2）推荐的项目比人少得多，可以节省大量的计算能力，可以使用更加复杂的算法。

思路：建立基于项目的协同过滤系统，即实现“看了这部电影的人也看了…”和“对这部电影高度评价的人也高度评价了…”这些功能，即建立电影之间的联系。

1.用python3实现电影推荐

import pandas as pd 
import warnings
warnings.filterwarnings ("ignore")

r_cols = ['user_id', 'movie_id', 'rating']
ratings = pd.read_csv('E:/python/python数据科学与机器学习/《Python数据科学与机器学习：从入门到实践》源代码/ml-100k/u.data', sep = '\t', names=r_cols, usecols=range(3), encoding="ISO-8859-1")

ratings.head()

	user_id	movie_id	rating
0	0	50	5
1	0	172	5
2	0	133	1
3	196	242	3
4	186	302	3

m_cols = ['movie_id', 'title']
movies = pd.read_csv('E:/python/python数据科学与机器学习/《Python数据科学与机器学习：从入门到实践》源代码/ml-100k/u.item', sep='|', names=m_cols, usecols=range(2), encoding="ISO-8859-1")
movies.head()

	movie_id	title
0	1	Toy Story (1995)
1	2	GoldenEye (1995)
2	3	Four Rooms (1995)
3	4	Get Shorty (1995)
4	5	Copycat (1995)

# 连接两张表
ratings = pd.merge(movies, ratings, how = 'inner', on = 'movie_id')
ratings.head()

	movie_id	title	user_id	rating
0	1	Toy Story (1995)	308	4
1	1	Toy Story (1995)	287	5
2	1	Toy Story (1995)	148	4
3	1	Toy Story (1995)	280	4
4	1	Toy Story (1995)	66	3

# 转置表格
userRatings = ratings.pivot_table(index=['user_id'],columns=['title'],values='rating')
userRatings.head()

title	'Til There Was You (1997)	1-900 (1994)	101 Dalmatians (1996)	12 Angry Men (1957)	187 (1997)	2 Days in the Valley (1996)	20,000 Leagues Under the Sea (1954)	2001: A Space Odyssey (1968)	3 Ninjas: High Noon At Mega Mountain (1998)	39 Steps, The (1935)	...	Yankee Zulu (1994)	Year of the Horse (1997)	You So Crazy (1994)	Young Frankenstein (1974)	Young Guns (1988)	Young Guns II (1990)	Young Poisoner's Handbook, The (1995)	Zeus and Roxanne (1997)	unknown	Á köldum klaka (Cold Fever) (1994)
user_id
0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	NaN	NaN	2.0	5.0	NaN	NaN	3.0	4.0	NaN	NaN	...	NaN	NaN	NaN	5.0	3.0	NaN	NaN	NaN	4.0	NaN
2	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1.0	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	NaN	NaN	NaN	NaN	2.0	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

5 rows × 1664 columns

# 皮尔逊相关度， 筛选数据量大于等于100的样本(至少有100人打分)
corrMatrix = userRatings.corr(method='pearson', min_periods=100)
corrMatrix.head()

title	'Til There Was You (1997)	1-900 (1994)	101 Dalmatians (1996)	12 Angry Men (1957)	187 (1997)	2 Days in the Valley (1996)	20,000 Leagues Under the Sea (1954)	2001: A Space Odyssey (1968)	3 Ninjas: High Noon At Mega Mountain (1998)	39 Steps, The (1935)	...	Yankee Zulu (1994)	Year of the Horse (1997)	You So Crazy (1994)	Young Frankenstein (1974)	Young Guns (1988)	Young Guns II (1990)	Young Poisoner's Handbook, The (1995)	Zeus and Roxanne (1997)	unknown	Á köldum klaka (Cold Fever) (1994)
title
'Til There Was You (1997)	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1-900 (1994)	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
101 Dalmatians (1996)	NaN	NaN	1.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
12 Angry Men (1957)	NaN	NaN	NaN	1.0	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
187 (1997)	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

5 rows × 1664 columns

# 选取user_id为0的行数据，作为测试数据
myRatings = userRatings.loc[0].dropna()
myRatings

title
Empire Strikes Back, The (1980)    5.0
Gone with the Wind (1939)          1.0
Star Wars (1977)                   5.0
Name: 0, dtype: float64

myRatings[0]

5.0

# 为测试用户推荐电影
simCandidates = pd.Series()
for i in range(myRatings.shape[0]):
    print("寻找《{}》的相似电影...".format(myRatings.index[i]))
    # 对应的相似电影
    sims = corrMatrix[myRatings.index[i]].dropna()
    # 将该电影的分数与它相似电影的分数相乘
    sims = sims.map(lambda x: x * myRatings[i])
    # 加入结果
    simCandidates = simCandidates.append(sims)
    

print ("结果：")
simCandidates.sort_values(inplace = True, ascending = False)
print (simCandidates.head(10))

寻找《Empire Strikes Back, The (1980)》的相似电影...
寻找《Gone with the Wind (1939)》的相似电影...
寻找《Star Wars (1977)》的相似电影...
结果：
Empire Strikes Back, The (1980)                       5.000000
Star Wars (1977)                                      5.000000
Empire Strikes Back, The (1980)                       3.741763
Star Wars (1977)                                      3.741763
Return of the Jedi (1983)                             3.606146
Return of the Jedi (1983)                             3.362779
Raiders of the Lost Ark (1981)                        2.693297
Raiders of the Lost Ark (1981)                        2.680586
Austin Powers: International Man of Mystery (1997)    1.887164
Sting, The (1973)                                     1.837692
dtype: float64

# 将结果按电影名称汇总
simCandidates = simCandidates.groupby(simCandidates.index).sum()
# 将分数倒序排列
simCandidates.sort_values(inplace = True, ascending = False)
# 删除自己已评分的电影，由于自己看过的电影无需被推荐
filteredSims = simCandidates.drop(myRatings.index)
print("推荐结果:")
filteredSims.head(10)

推荐结果:





Return of the Jedi (1983)                    7.178172
Raiders of the Lost Ark (1981)               5.519700
Indiana Jones and the Last Crusade (1989)    3.488028
Bridge on the River Kwai, The (1957)         3.366616
Back to the Future (1985)                    3.357941
Sting, The (1973)                            3.329843
Cinderella (1950)                            3.245412
Field of Dreams (1989)                       3.222311
Wizard of Oz, The (1939)                     3.200268
Dumbo (1941)                                 2.981645
dtype: float64

2.改善推荐结果

尝试以下方法改善推荐结果：
1）相关系数: 使用spearman相关系数
2）改变min_period的值
3）当用户讨厌某个电影时，与它相似的电影不应被推荐。

def recommand_movie(method='pearson', min_period=100, weight=0):
    # 设置相关度， 设置筛选数据量
    corrMatrix = userRatings.corr(method=method, min_periods=min_period)
    # 选取user_id为0的行数据，作为测试数据
    myRatings = userRatings.loc[0].dropna()
    # 为测试用户推荐电影
    simCandidates = pd.Series()
    for i in range(myRatings.shape[0]):        
        # 对应的相似电影
        sims = corrMatrix[myRatings.index[i]].dropna()
        if myRatings[i] == 1:
            # 如果用户讨厌一个电影，那么该电影的的分数应该被降低
            sims = sims.map(lambda x: x * myRatings[i] * weight * (-1))
        else:
            # 将该电影的分数与它相似电影的分数相乘
            sims = sims.map(lambda x: x * myRatings[i])            
        # 加入结果
        simCandidates = pd.concat([simCandidates, sims], axis = 0)
    
    # 将结果按电影名称汇总
    simCandidates = simCandidates.groupby(simCandidates.index).sum()
    # 将分数倒序排列
    simCandidates.sort_values(inplace = True, ascending = False)
    # 删除自己已评分的电影，由于自己看过的电影无需被推荐
    simCandidates = simCandidates.drop(myRatings.index)
    print("为用户0推荐电影:")
    print(simCandidates.head(10))

recommand_movie(method='spearman', min_period = 100, weight = 0)

为用户0推荐电影:
Return of the Jedi (1983)                    6.407001
Raiders of the Lost Ark (1981)               4.528739
Indiana Jones and the Last Crusade (1989)    3.299785
Sting, The (1973)                            3.273064
Batman (1989)                                3.012647
Singin' in the Rain (1952)                   2.952571
Field of Dreams (1989)                       2.945751
Dumbo (1941)                                 2.894872
Jaws (1975)                                  2.867804
Star Trek: The Wrath of Khan (1982)          2.859166
dtype: float64

recommand_movie(method='pearson', min_period = 80, weight= 0)

为用户0推荐电影:
Return of the Jedi (1983)                    6.968925
Raiders of the Lost Ark (1981)               5.373883
Bridge on the River Kwai, The (1957)         3.366616
Indiana Jones and the Last Crusade (1989)    3.316717
Cinderella (1950)                            3.245412
Sting, The (1973)                            3.209627
Con Air (1997)                               3.204525
Back to the Future (1985)                    3.100622
Day the Earth Stood Still, The (1951)        3.087913
Field of Dreams (1989)                       3.068508
dtype: float64

recommand_movie(method='pearson', min_period = 100, weight= 0.5)

为用户0推荐电影:
Return of the Jedi (1983)                    6.864302
Raiders of the Lost Ark (1981)               5.300974
Bridge on the River Kwai, The (1957)         3.366616
Cinderella (1950)                            3.245412
Indiana Jones and the Last Crusade (1989)    3.231061
Sting, The (1973)                            3.149519
Field of Dreams (1989)                       2.991607
Dumbo (1941)                                 2.981645
Back to the Future (1985)                    2.971963
Star Trek: The Wrath of Khan (1982)          2.968080
dtype: float64