4、采用正则表达式、BeautifulSoup进行解析提取【豆瓣好、中、差三个短评页面各60条评论数据】
4.1 爬虫的一般思路
- 分析目标网页,确定爬取的url路径,headers参数【判断是静态网页还是动态网页】
- 发送请求 --requests 模拟浏览器发送请求,获取响应的数据
- 解析数据 --BeautifulSoup对象,能够转化数据,re进行提取
- 保存数据
4.2 豆瓣数据下载分析
- 每个页面有多少个评论
- 每个评论具体信息的下载
- 不同类型评论的下载
4.3 源代码
import requests
import re
import warnings
import time
import random
from bs4 import BeautifulSoup
4.3.1 伪造请求头
# 三种页面的base_url
BASE_URL="https://movie.douban.com/subject/33404425/comments?start={}&limit=20&sort=new_score&status=P&percent_type={}"
headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36",
"Cookie": 'bid=6Y_umIrRUHk; __gads=ID=f3fa196be74c49f5:T=1589907087:S=ALNI_MbVwFaOcaNVABqsayjnOCawaNo-3A; gr_user_id=fe3032d1-40a6-4aef-93f4-054a36710beb; _vwo_uuid_v2=DE361BA9F9B9BACBDEB73CC87199709AE|bf1c5209c48152fea364a3ac6e60548f; ll="108296"; __yadk_uid=BNpZEeOtOgDz2raZXEavltn1VuJB005I; viewed="24715620_30231494"; __utma=30149280.669920134.1589907069.1593061398.1593764577.6; __utmc=30149280; __utmz=30149280.1593764577.6.6.utmcsr=baidu|utmccn=(organic)|utmcmd=organic; ap_v=0,6.0; _pk_ref.100001.4cf6=%5B%22%22%2C%22%22%2C1593764586%2C%22https%3A%2F%2Fwww.douban.com%2Fsearch%3Fq%3D%25E9%259A%2590%25E8%2597%258F%25E7%259A%2584%25E8%25A7%2592%25E8%2590%25BD%22%5D; _pk_ses.100001.4cf6=*; __utma=223695111.1716723746.1590498467.1590498467.1593764586.2; __utmb=223695111.0.10.1593764586; __utmc=223695111; __utmz=223695111.1593764586.2.2.utmcsr=douban.com|utmccn=(referral)|utmcmd=referral|utmcct=/search; ct=y; _pk_id.100001.4cf6=76ecf6aae620740b.1590498467.2.1593764786.1590498508.; __utmb=30149280.11.10.1593764577'}
4.3.2 得到单个网页的所有评论
def get_html_comments_divs(url,headers):
response=requests.get(url,headers=headers)
response.encoding='utf-8'
html_str=response.text
bsObj=BeautifulSoup(html_str)
soup_list=bsObj.findAll('div',{'class':'comment'})
return soup_list
4.3.3 得到每个评论的具体信息
def get_comment(comments_divs,i,percent_type):
comments_list=[]
for div in comments_divs:
div=str(div)
try:
comments={}
comments["评论数"]=percent_type
comments['点赞数']=re.findall('<span class="votes">(.*?)</span>',div)[0]
comments['评分']=re.findall('<span class="allstar(.*?) rating"',div)[0]
comments['发布日期']=re.findall('<span class="comment-time" title="(.*?)">',div)[0]
comments['评论']=re.findall('<span class="short">(.*?)</span>',div)[0]
comments_list.append(comments)
except:
# 这边使用 try…… except是为了跳过解析错误的数据
continue
# print(r'已经爬取{i+len(comments_list)}条评论')
return comments_list
4.3.4 实现三种网页评论信息的提取
# 评论类型分为好、中、差 评
PERCENT_TYPE=['h','m','l']
if __name__ == '__main__':
# 创建空列表用以存储评论信息
COMMENTS=[]
# 外层循环评论类型
for percent_type in PERCENT_TYPE:
# 由于豆瓣网站限制,所以每种评论类型只可以爬取220条评论。现在爬取60条
for i in range(0,60,20):
URL=BASE_URL.format(i,percent_type)
comments_divs=get_html_comments_divs(URL,headers)
comments_list=get_comment(comments_divs,i,percent_type)
COMMENTS.extend(comments_list)
# 随即沉睡1-5秒后继续循环,可以不设置,但是安全第一
sleep_time = random.uniform(1, 5)
print(f"沉睡{sleep_time}秒")
time.sleep(sleep_time)
4.3.5 保存数据
import pandas as pd
data=pd.DataFrame(COMMENTS)
data.to_excel(r"C:\Users\23691\AnacondaProjects\《Python网络数据采集\comments_douban.xlsx")
data.head()