Python电影推荐系统

最新推荐文章于 2022-11-01 18:20:22 发布

Harbour_zhang

最新推荐文章于 2022-11-01 18:20:22 发布

阅读量6.8k

点赞数 3

文章标签： python 数据挖掘推荐系统

本文链接：https://blog.csdn.net/Harbour_zhang/article/details/106037974

版权

本文介绍了如何使用Python实现基于皮尔森系数的协同过滤电影推荐系统。通过爬虫获取用户数据，然后针对待推荐用户（默认为自己）进行推荐。只需运行三个文件，即可得到推荐结果。

摘要由CSDN通过智能技术生成

Python实现基于皮尔森系数的协同过滤电影推荐。

爬虫获取用户数据

# -*- coding: utf-8 -*-
"""
爬取豆瓣某影视的评分前100个用户，将他们的影评信息抓取下来作为movie.json
为了保证数据的可靠性，选择豆瓣电影top250 No.1的【肖申克的救赎】，热门影评的前100人作为数据
"""

from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
import json
import urllib
import requests

people_names = []
people_urls = []
# 创建一个正则表达式匹配对象
r = re.compile(r'e/(.+)/')
headers = {
   
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) '
                  'Chrome/74.0.3724.8 Safari/537.36',
    'Referer': 'https://movie.douban.com/subject/26100958/comments',
    'Connection': 'keep-alive'}

print("爬取用户中 ...")

# 5*20 = 100个用户，若需要修改用户数量，更改外层循环。
for i in range(0, 10):
    url = ("https://movie.douban.com/subject/27010768/comments?"
           "start=" + str(i * 20) + "&limit=20&sort=new_score&status=P&percent_type=")
    req = urllib.request.Request(url=url, headers=headers)
    data = urllib.request.urlopen(req).read().decode('utf-8')
    # data = requests.get(url,headers=headers)
    bs = BeautifulSoup(data, 'html.parser')
    comments = bs.findAll("div", {
   "class": "comment"})
    # 将用户主页存储在people_url中
    for comment in comments:
        people_url = comment.findAll("a")[1].attrs["href"].replace("www", "movie")
        name = re.findall(r, people_url)[0]
        people_names.append(name)
        people_urls.append(people_url)

print("爬取用户完成")

final_data = {
   }
for i in range(0, len(people_names)):
    final_data.setdefault(people_names[i], {
   })
    final_data[people_names[i]]["people_url"] = people_urls[i]

print("爬取用户影评中...")

user_count = 1
for people_name in final_data:
    print("正在爬取第" + str(user_count) + "位用户" + people_name + "的影评信息")
    user_count += 1
    # 爬取该用户前90条影评
    for i in range(0, 6):
        # 获取影评后缀
        comment_url_suffix = ("collect?start=" + str(i * 15) +