电影信息爬取与聚类分析

电影信息爬取与聚类分析

要求:爬取电影相关数据,条数不小于1000,结构自定,要求包含情感信息,类别,评论关键词等,然后基于这些信息根据用户的喜好做相关性聚类。

一、总体设计

(1)爬取豆瓣电影中的50部电影数据,包括片名、国家、时长、主演、导演、类型、评分、评价人数等信息
(2)爬取各部电影的短评数据,包括用户名、评价、评论、赞同数等
(3)对爬取的数据进行处理并写入相应的csv文件中
(4)读取csv文件,对数据进行分析处理,抛去不参与聚类的特征,将非数值型特征转换为数值型特征。
(5)对数据进行降维处理,并通过K-means进行聚类
(6)将聚类结果可视化,并进行结果分析与总结

二、详细设计

(1)爬取豆瓣电影中的50部电影数据,包括片名、国家、时长、主演、导演、类型、评分、评价人数等信息

//导入库函数
import json
import re
import requests
from lxml import etree
import numpy as np
import csv

header = {
   "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36"}
res = requests.get(url="https://movie.douban.com/top250?start=0&filter=",headers=header)#排名1-20的影片地址
res1 =requests.get(url="https://movie.douban.com/top250?start=225&filter=",headers=header)#排名21-40的影片地址
res.encoding = 'utf8'
res1.encoding = 'utf8'
text = res.text
text1=res1.text
tree = etree.HTML(text)
tree1 = etree.HTML(text1)
items = tree.xpath('//ol/li/div/div[@class="info"]')
items1 = tree1.xpath('//ol/li/div/div[@class="info"]')
director = [] #导演
film = []  #影片名
film_date = []  #上映时间
film_country = []  #拍摄国家
film_type = []  #类型
star = []  #评分
assess_num = []  #评价人数
quote = []  #推荐语
url = []  #影片地址
#获取排名1-20的电影信息
for item in items:
    film_url = item.xpath("./div[@class='hd']/a/@href")
    url.append(film_url[0])
    film_name = item.xpath("./div[@class='hd']/a/span[1]/text()")[0]
    film.append(film_name)
    f_info = item.xpath("./div[@class='bd']/p[1]/text()")
    info_1 = f_info[1].replace("\xa0","").replace("\n"," ").split("/")
    film_date.append(info_1[0].replace(" ",""))
    country =info_1[1]
    film_country.append(country)
    film_type.append(info_1[2].replace("                         ",""))
    f_info = f_info[0].replace("\n","").split('       ')
    director_deal = f_info[4].split(":")[1].replace("主演","").replace("...","").replace("主","").replace("\xa0","").replace("\n"," ")
    director.append(director_deal) 
    film_star = item.xpath("./div[@class='bd']/div[@class='star']/span[@class='rating_num']/text()")
    star.append(film_star[0])
    film_assess = item.xpath("./div[@class='bd']/div[@class='star']/span/text()")[1].replace("人评价","")
    assess_num.append(film_assess)
    film_quote = item.xpath("./div[@class='bd']/p/span[@class='inq']/text()")
    if len(film_quote)==0:
        film_quote = "无"
    else:
        film_quote = film_quote[0]
    quote.append(film_quote)
        
#获取排名21-40的电影信息
for item in items1:
    film_url = item.xpath("./div[@class='hd']/a/@href")
    url.append(film_url[0])
    film_name = item.xpath("./div[@class='hd']/a/span[1]/text()")[0]
    film.append(film_name)
    f_info = item.xpath("./div[@class='bd']/p[1]/text()")
    info_1 = f_info[1].replace("\xa0","").replace("\n"," ").split("/")
    film_date.append(info_1[0].replace(" ",""))
    country =info_1[1]
    film_country.append(country)
    film_type.append(info_1[2].replace("                         ",""))
    f_info = f_info[0].replace("\n","").split('       ')
    director_deal = f_info[4].split(":")[1].replace("主演","").replace("...","").replace("主","").replace("\xa0","").replace("\n"," ")
    director.append(director_deal)
    film_star = item
评论 10
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值