电影信息爬取与聚类分析

最新推荐文章于 2024-06-22 20:09:44 发布

Davy_Zhuang

最新推荐文章于 2024-06-22 20:09:44 发布

阅读量5.1k

点赞数 16

分类专栏： Python 文章标签：聚类 python 数据分析推荐系统

本文链接：https://blog.csdn.net/u013300280/article/details/107760341

版权

电影信息爬取与聚类分析

要求：爬取电影相关数据，条数不小于1000，结构自定，要求包含情感信息，类别，评论关键词等，然后基于这些信息根据用户的喜好做相关性聚类。

一、总体设计

（1）爬取豆瓣电影中的50部电影数据，包括片名、国家、时长、主演、导演、类型、评分、评价人数等信息
（2）爬取各部电影的短评数据，包括用户名、评价、评论、赞同数等
（3）对爬取的数据进行处理并写入相应的csv文件中
（4）读取csv文件，对数据进行分析处理，抛去不参与聚类的特征，将非数值型特征转换为数值型特征。
（5）对数据进行降维处理，并通过K-means进行聚类
（6）将聚类结果可视化，并进行结果分析与总结

二、详细设计

（1）爬取豆瓣电影中的50部电影数据，包括片名、国家、时长、主演、导演、类型、评分、评价人数等信息

//导入库函数
import json
import re
import requests
from lxml import etree
import numpy as np
import csv

header = {
   "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36"}
res = requests.get(url="https://movie.douban.com/top250?start=0&filter=",headers=header)#排名1-20的影片地址
res1 =requests.get(url="https://movie.douban.com/top250?start=225&filter=",headers=header)#排名21-40的影片地址
res.encoding = 'utf8'
res1.encoding = 'utf8'
text = res.text
text1=res1.text
tree = etree.HTML(text)
tree1 = etree.HTML(text1)
items = tree.xpath('//ol/li/div/div[@class="info"]')
items1 = tree1.xpath('//ol/li/div/div[@class="info"]')
director = [] #导演
film = []  #影片名
film_date = []  #上映时间
film_country = []  #拍摄国家
film_type = []  #类型
star = []  #评分
assess_num = []  #评价人数
quote = []  #推荐语
url = []  #影片地址
#获取排名1-20的电影信息
for item in items:
    film_url = item.xpath("./div[@class='hd']/a/@href")
    url.append(film_url[0])
    film_name = item.xpath("./div[@class='hd']/a/span[1]/text()")[0]
    film.append(film_name)
    f_info = item.xpath("./div[@class='bd']/p[1]/text()")
    info_1 = f_info[1].replace("\xa0","").replace("\n"," ").split("/")
    film_date.append(info_1[0].replace(" ",""))
    country =info_1[1]
    film_country.append(country)
    film_type.append(info_1[2].replace("                         ",""))
    f_info = f_info[0].replace("\n","").split('       ')
    director_deal = f_info[4].split(":")[1].replace("主演","").replace("...","").replace("主","").replace("\xa0","").replace("\n"," ")
    director.append(director_deal) 
    film_star = item.xpath("./div[@class='bd']/div[@class='star']/span[@class='rating_num']/text()")
    star.append(film_star[0])
    film_assess = item.xpath("./div[@class='bd']/div[@class='star']/span/text()")[1].replace("人评价","")
    assess_num.append(film_assess)
    film_quote = item.xpath("./div[@class='bd']/p/span[@class='inq']/text()")
    if len(film_quote)==0:
        film_quote = "无"
    else:
        film_quote = film_quote[0]
    quote.append(film_quote)
        
#获取排名21-40的电影信息
for item in items1:
    film_url = item.xpath("./div[@class='hd']/a/@href")
    url.append(film_url[0])
    film_name = item.xpath("./div[@class='hd']/a/span[1]/text()")[0]
    film.append(film_name)
    f_info = item.xpath("./div[@class='bd']/p[1]/text()")
    info_1 = f_info[1].replace("\xa0","").replace("\n"," ").split("/")
    film_date.append(info_1[0].replace(" ",""))
    country =info_1[1]
    film_country.append(country)
    film_type.append(info_1[2].replace("                         ",""))
    f_info = f_info[0].replace("\n","").split('       ')
    director_deal = f_info[4].split(":")[1].replace("主演","").replace("...","").replace("主","").replace("\xa0","").replace("\n"," ")
    director.append(director_deal)
    film_star = item