电影信息爬取与聚类分析
要求:爬取电影相关数据,条数不小于1000,结构自定,要求包含情感信息,类别,评论关键词等,然后基于这些信息根据用户的喜好做相关性聚类。
一、总体设计
(1)爬取豆瓣电影中的50部电影数据,包括片名、国家、时长、主演、导演、类型、评分、评价人数等信息
(2)爬取各部电影的短评数据,包括用户名、评价、评论、赞同数等
(3)对爬取的数据进行处理并写入相应的csv文件中
(4)读取csv文件,对数据进行分析处理,抛去不参与聚类的特征,将非数值型特征转换为数值型特征。
(5)对数据进行降维处理,并通过K-means进行聚类
(6)将聚类结果可视化,并进行结果分析与总结
二、详细设计
(1)爬取豆瓣电影中的50部电影数据,包括片名、国家、时长、主演、导演、类型、评分、评价人数等信息
//导入库函数
import json
import re
import requests
from lxml import etree
import numpy as np
import csv
header = {
"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36"}
res = requests.get(url="https://movie.douban.com/top250?start=0&filter=",headers=header)#排名1-20的影片地址
res1 =requests.get(url="https://movie.douban.com/top250?start=225&filter=",headers=header)#排名21-40的影片地址
res.encoding = 'utf8'
res1.encoding = 'utf8'
text = res.text
text1=res1.text
tree = etree.HTML(text)
tree1 = etree.HTML(text1)
items = tree.xpath('//ol/li/div/div[@class="info"]')
items1 = tree1.xpath('//ol/li/div/div[@class="info"]')
director = [] #导演
film = [] #影片名
film_date = [] #上映时间
film_country = [] #拍摄国家
film_type = [] #类型
star = [] #评分
assess_num = [] #评价人数
quote = [] #推荐语
url = [] #影片地址
#获取排名1-20的电影信息
for item in items:
film_url = item.xpath("./div[@class='hd']/a/@href")
url.append(film_url[0])
film_name = item.xpath("./div[@class='hd']/a/span[1]/text()")[0]
film.append(film_name)
f_info = item.xpath("./div[@class='bd']/p[1]/text()")
info_1 = f_info[1].replace("\xa0","").replace("\n"," ").split("/")
film_date.append(info_1[0].replace(" ",""))
country =info_1[1]
film_country.append(country)
film_type.append(info_1[2].replace(" ",""))
f_info = f_info[0].replace("\n","").split(' ')
director_deal = f_info[4].split(":")[1].replace("主演","").replace("...","").replace("主","").replace("\xa0","").replace("\n"," ")
director.append(director_deal)
film_star = item.xpath("./div[@class='bd']/div[@class='star']/span[@class='rating_num']/text()")
star.append(film_star[0])
film_assess = item.xpath("./div[@class='bd']/div[@class='star']/span/text()")[1].replace("人评价","")
assess_num.append(film_assess)
film_quote = item.xpath("./div[@class='bd']/p/span[@class='inq']/text()")
if len(film_quote)==0:
film_quote = "无"
else:
film_quote = film_quote[0]
quote.append(film_quote)
#获取排名21-40的电影信息
for item in items1:
film_url = item.xpath("./div[@class='hd']/a/@href")
url.append(film_url[0])
film_name = item.xpath("./div[@class='hd']/a/span[1]/text()")[0]
film.append(film_name)
f_info = item.xpath("./div[@class='bd']/p[1]/text()")
info_1 = f_info[1].replace("\xa0","").replace("\n"," ").split("/")
film_date.append(info_1[0].replace(" ",""))
country =info_1[1]
film_country.append(country)
film_type.append(info_1[2].replace(" ",""))
f_info = f_info[0].replace("\n","").split(' ')
director_deal = f_info[4].split(":")[1].replace("主演","").replace("...","").replace("主","").replace("\xa0","").replace("\n"," ")
director.append(director_deal)
film_star = item