《霸王别姬》短评
目标URL: https://movie.douban.com/subject/1291546/comments?status=P
首先安装两个模块:
pip install jieba -i https://pypi.douban.com/simple
pip install wordcloud -i https://pypi.douban.com/simple
爬取数据
Excel表格要获取的内容:用户名、点赞数、星级、时间、短评内容
1)获取第一页数据
(用户名、点赞数、星级、时间、短评内容)
Code:
运行结果:
2)使用xlwt、openpyxl模块写入excel文件
Code:
运行结果:
3)尝试获取多页的数据
-
添加cookie
-
尝试获取多页的数据,并添加异常捕获机制。
-
- 针对响应:排除某一页爬取时被截获
- 有下一页时的Element:
- 无下一页时的Elements:
判断有无下一页:
def parse_data(resp): '''解析响应''' ... res = html.xpath('//*[contains(@id, "paginator")]/a[contains(@class, "next")]') if not res: print('下一页已经没有内容了...') return None else: # 返回数据 return True
补充:星级也要做对应的处理,不然会混入奇怪的字符:m
if star.isalpha(): # 有些用户没有给星级 star = '0'
完整代码:
import requests
from lxml import etree
import openpyxl
import time
url = 'https://movie.douban.com/subject/1291546/comments?'
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/90.0.4430.212 Safari/537.36 ',
'Cookie': '_vwo_uuid_v2=D21A90EEC60A84B63D467FE996BB1DF1D|12adfdc440b6565a76940f86fc6a8472; douban-fav-remind=1; '
'll="118283"; bid=1MCaZCO3JIA; __yadk_uid=TkyAIhUGdwemFLKFoOVBKys5VHsdSjsr; '
'__gads=ID=4018de559e3adebb-22be7e1fccc60064:T=1616471893:RT=1616471893:S=ALNI_Maoo2'
'-1ph1qkwk3DfdPpnN1xqoiQA; '
'_vwo_uuid_v2=D21A90EEC60A84B63D467FE996BB1DF1D|12adfdc440b6565a76940f86fc6a8472; __utmc=30149280; '
'__utmc=223695111; _pk_ref.100001.4cf6=%5B%22%22%2C%22%22%2C1621742461%2C%22https%3A%2F%2Fcn.bing.com'
'%2F%22%5D; _pk_ses.100001.4cf6=*; dbcl2="218934882:7Y/YTUP8/mk"; ck=yFqV; '
'__utma=30149280.848034361.1580634472.1621734964.1621742494.28; __utmb=30149280.0.10.1621742494; '
'__utmz=30149280.1621742494.28.22.utmcsr=accounts.douban.com|utmccn=('
'referral)|utmcmd=referral|utmcct=/; __utma=223695111.597269736.1580634472.1621734964.1621742494.27; '
'__utmb=223695111.0.10.1621742494; '