豆瓣最受欢迎的影评----1-5页,电影名、作者名、评论时间、推荐级别、评论内容(全文).(全文基于Python语言编写)
1.爬取使用的网址:
https://movie.douban.com/review/best/?start=0
注意:
(1),因为要爬取1-5的内容,所以我们需要对每一页url进行一个自动翻页操作。
(2),由于网页是在实时更新的,所以我们在不同时间爬取到的内容也不一样。
(注: 一共有五页,每一页的url都是在前一页的基础上在start=加了20,第一页start=0,所以第五页,也就是最后一页的start=80。有了这个规律,我们就可以使用循环取得五页的url。方法在后文中。)
2.需要使用到的库包:
import xlwt
import requests
import re
from lxml import etree
(各种库包的方法可以自己查找)
3.网页分析:
(1).在页面中用鼠标指向在需要查看的元素上,点击鼠标右键,选择检查,即可查看元素位置。
(2).电影名是隐藏在<img>标签中。
(3).由于全部的评论内容是使用的JavaScrip方法隐藏起来的,所以我们需要先爬取每一个隐藏的url,然后再在其中爬取评论内容。
4.代码编写:
import xlwt
import requests
import re
from lxml import etree
# 爬取1-5页内容
urls = [] # 取出五页的url
for i in range(0, 100, 20):
url = 'https://movie.douban.com/review/best/?start={}'.format(i)
urls.append(url)
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36',
'Cookie': 'bid=IjHHfUlQ7uU; _pk_id.100001.4cf6=ab2e39f5b94bf789.1697852638.; __yadk_uid=YOjYTdHZ1YVwr0c5Fvmr03tu5cPb3MjN; ll="118254"; _vwo_uuid_v2=D2D51327B1966A3B37670EFE56C7AAB55|b674032be46c968f5e7dbd14795ab301; Hm_lvt_16a14f3002af32bf3a75dfe352478639=1699694384; __utmz=30149280.1701867725.9.7.utmcsr=baidu|utmccn=(organic)|utmcmd=organic; __utmz=223695111.1701867725.9.7.utmcsr=baidu|utmccn=(organic)|utmcmd=organic; ap_v=0,6.0; _pk_ref.100001.4cf6=%5B%22%22%2C%22%22%2C1702028493%2C%22https%3A%2F%2Fwww.baidu.com%2Flink%3Furl%3Duy5uI_dBa2ZoHrZZUQGJZ7BnLX3kQkqVGF33luSgiB9YsEWskyDnFjW6RM7EwlUDyOeNGnkkGNuwnPSiEHjJG7X7WggQtNSPPu7OFCoEC9dISRL16o0-fxWE6B8RfL34lyS1T3SJjtUMF2NexGk57_%26wd%3D%26eqid%3Dcc0879e8002c63ea00000005657070c9%22%5D; _pk_ses.100001.4cf6=1; __utma=30149280.849823623.1697852638.1702025221.1702028493.15; __utma=223695111.2080645572.1697852638.1702025221.1702028493.15; __utmb=223695111.0.10.1702028493; push_noty_num=0; push_doumail_num=0; dbcl2="156916174:AUGaxoSBmoY"; ck=Pq3B; __utmc=30149280; __utmt=1; __utmv=30149280.15691; __utmb=30149280.5.10.1702028493; __utmc=223695111; ct=y'
}
douban = []
for url in urls:
response = requests.get(url, headers=headers)
contents = response.content.decode('utf-8')
# print(contents)
divs = re.findall('<div data-cid=.*?>(.*?)</div>', contents, re.DOTALL) # 取得每一部电影的<div>块(包含一部电影的全部信息)
# print(divs)
movies, authors, times, ranks, comments = [], [], [], [], []
for j in divs:
movie = re.findall('<a.*?<img alt="(.*?)"', j) # 电影片名:movie
movies.append(movie[0])
author = re.findall('<a.*?class="name">(.*?)</a>', j) # 作者:author
authors.append(author[0])
time = re.findall('<span.*?class="main-meta">(.*?)</span>', j) # 评论时间:time
times.append(time[0])
rank = re.findall('<span.*?title="(.*?)">', j) # 推荐级别:rank
if not rank: # 解决没有推荐等级的问题
ranks.append('暂无推荐级别')
else:
ranks.append(rank[0])
href_comments = re.findall('<a href="(.*?)">([^<]+)</a>', j)[1][0] # 取出每一个使用JavaScrip隐藏的url
# print(href_comments)
response_comments = requests.get(href_comments, headers=headers)
contents_comments = response_comments.content.decode('utf-8')
html = etree.HTML(contents_comments)
div_comments = html.xpath('//div[@data-author]//text()') # 评论内容:div_comments
new_comment = []
for comment in div_comments:
comment01 = re.sub(r'(<.*?>)|(\s+)|(\n)', "", comment) # 去除爬取到的内容中的各种字符问题(re.sub()方法)
new_comment.append(comment01)
for _ in new_comment:
if _ == '':
new_comment.remove(_)
comments.append(new_comment)
# 一对一匹配进行存储,使用zip()方法进行打包
for movie_2, author_2, time_2, rank_2, comment_2 in zip(movies, authors, times, ranks, comments):
eg = {
'电影片名': movie_2,
'作者': author_2,
'评论时间': time_2,
'推荐级别': rank_2,
'影评内容': comment_2
}
douban.append(eg)
# print(douban)
# print(len(douban))
'''使用excel进行存储'''
workbook = xlwt.Workbook(encoding="utf-8")
sheet = workbook.add_sheet('豆瓣影评')
title = list(douban[0].keys())
# print(title)
for i in range(len(title)): # 先在excel表格中写入标题行
sheet.write(0, i, title[i])
for row in range(1, len(douban)+1): # 对每一列内容进行填充
for col, values in zip(range(len(douban)), title):
sheet.write(row, col, douban[row-1][values])
workbook.save(r"D:\1,python学习\pythonProject1\pytest4--网络爬虫\爬取内容\Excel\豆瓣影评.xls")
(5).使用excel保存结果: