使用KNN预测电影评分-python3

最新推荐文章于 2024-04-17 09:08:56 发布

闲庭信步的空间

最新推荐文章于 2024-04-17 09:08:56 发布

阅读量447

点赞数

分类专栏：数据分析文章标签： python 数据分析数据挖掘

本文链接：https://blog.csdn.net/danspace1/article/details/130276205

版权

数据分析专栏收录该内容

10 篇文章 0 订阅

订阅专栏

文章目录

0. 理论
1.用KNN预测电影评分
2. 参考资料

0. 理论

在散点图上找出k个最近邻居，让他们投票确定分类，类别判定为离它最近的k个观察值中所占比例最大的分类。

1.用KNN预测电影评分

import pandas as pd
import numpy as np
# 读取电影评分数据
r_cols = ['user_id', 'movie_id', 'rating']
ratings = pd.read_csv('E:/python/python数据科学与机器学习/《Python数据科学与机器学习：从入门到实践》源代码/ml-100k/u.data', sep='\t', names=r_cols, usecols=range(3))
ratings.head()

	user_id	movie_id	rating
0	0	50	5
1	0	172	5
2	0	133	1
3	196	242	3
4	186	302	3

# 将movie_id分组汇总，求每个movie_id的评价数量和打分均值

movieProperties = ratings.groupby('movie_id')['rating'].agg(['size', 'mean']).reset_index()
movieProperties.head()

	movie_id	size	mean
0	1	452	3.878319
1	2	131	3.206107
2	3	90	3.033333
3	4	209	3.550239
4	5	86	3.302326

# 将评价数量标准化, 范围在[0, 1]
movieProperties['size'] = (movieProperties['size'] - movieProperties['size'].min())/(movieProperties['size'].max() - movieProperties['size'].min())
movieProperties.head()

	movie_id	size	mean
0	1	0.773585	3.878319
1	2	0.222985	3.206107
2	3	0.152659	3.033333
3	4	0.356775	3.550239
4	5	0.145798	3.302326

movieProperties.tail()

	movie_id	mean
1677	1678	1.0
1678	1679	3.0
1679	1680	2.0
1680	1681	3.0
1681	1682	3.0

# 读取电影类型数据
movieDict = {}
i = 0
with open(r'E:/python/python数据科学与机器学习/《Python数据科学与机器学习：从入门到实践》源代码/ml-100k/u.item',encoding = 'ISO-8859-1') as f:    
    for line in f:
        fields = line.rstrip('\n').split('|')
        movieID = int(fields[0])
        name = fields[1]
        # 类型
        genres = fields[5:25]
        # 将类型转为int列表
        genres = map(int, genres)
        # 标准化后的尺寸
        size = movieProperties.loc[movieProperties.movie_id == movieID, 'size'].values[0]
        # 均值
        mean = movieProperties.loc[movieProperties.movie_id == movieID, 'mean'].values[0]
        # 将数据存入字典
        movieDict[movieID] = (name, np.array(list(genres)), size, mean)

print(movieDict[1])

('Toy Story (1995)', array([0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), 0.7735849056603774, 3.8783185840707963)

from scipy import spatial

# 计算两个电影之间的相似度距离
def ComputeDistance(a, b):
    genresA = a[1]
    genresB = b[1]
    # 用余弦距离作为类型的相似度
    genreDistance = spatial.distance.cosine(genresA, genresB)
    popularityA = a[2]
    popularityB = b[2]
    # 用尺寸的绝对值之差作为流行度的相似度
    popularityDistance = abs(popularityA - popularityB)
    return genreDistance + popularityDistance
    
ComputeDistance(movieDict[2], movieDict[4])

0.8004574042309892

def getNeighbors(movieID, k):
    distances = []
    for movie in movieDict:
        if (movie != movieID):
            dist = ComputeDistance(movieDict[movieID], movieDict[movie])
            distances.append((movie, dist))
    # 按相似度距离正序排列，越相似的电影距离越小
    distances.sort(key= lambda x: x[1])
    # k个近邻
    neighbors = []
    for x in range(k):
        neighbors.append(distances[x][0])
    # 查找k个近邻在字典的评分均值，并计算总体均值
    print('前{}个近邻的名称和打分为:\n'.format(k))
    avgRating = 0
    for neighbor in neighbors:
        avgRating += movieDict[neighbor][3]   
        print('{:<45s}{}'.format(movieDict[neighbor][0], movieDict[neighbor][3]))
    avgRating /= k   
    print('\n{}个近邻的总体打分均值为：{}'.format(k, avgRating))
    print('ID为{}的电影的实际评分为：{}'.format(movieID, movieDict[movieID][3]))
     
getNeighbors(1, 10)

前10个近邻的名称和打分为:

Liar Liar (1997)                             3.156701030927835
Aladdin (1992)                               3.8127853881278537
Willy Wonka and the Chocolate Factory (1971) 3.6319018404907975
Monty Python and the Holy Grail (1974)       4.0664556962025316
Full Monty, The (1997)                       3.926984126984127
George of the Jungle (1997)                  2.685185185185185
Beavis and Butt-head Do America (1996)       2.7884615384615383
Birdcage, The (1996)                         3.4436860068259385
Home Alone (1990)                            3.0875912408759123
Aladdin and the King of Thieves (1996)       2.8461538461538463

10个近邻的总体打分均值为：3.3445905900235564
ID为1的电影的实际评分为：3.8783185840707963

2. 参考资料

Python数据科学与机器学习：从入门到实践
作者：
[美]弗兰克•凯恩（Frank Kane）

源代码下载：
https://www.ituring.com.cn/book/2426

闲庭信步的空间

关注

0
点赞
踩
8

收藏

觉得还不错? 一键收藏
0
评论
使用KNN预测电影评分-python3

在散点图上找出k个最近邻居，让他们投票确定分类，类别判定为离它最近的k个观察值中所占比例最大的分类。Python数据科学与机器学习：从入门到实践。[美]弗兰克•凯恩（Frank Kane）
复制链接

扫一扫