python爬取携程景区用户评论(爬虫时遇到翻页但url不变问题时的解决办法)
python爬取携程景区用户评论
前两天想要爬取携程网上景区的用户评论,但是却发现用户评论在翻页时,网页的url却不变,这就造成我们无法用简单的request.get()访问其他页面获取内容了。经过查阅资料发现这是一个通过Ajax页面加载的页面。具体Ajax页面加载的页面是什么这里不加以解释,本文只关于如何爬取这样的网站。本文以黄鹤楼景区为例。携程网黄鹤楼景区链接
Ajax页面加载的页面和普通页面的不同
下面是用户评论第一页和第二页的截图。
可以看出,不管是哪页,网址都是https://piao.ctrip.com/ticket/dest/t8979.html,这就是Ajax页面加载的页面和普通页面的不同,造成我们不能再用简单的方法爬取。
解决办法
- 打开网页的审查元素
首先我们要打开网页的审查元素,然后切换到network查看一下传输的内容,首先点击一下左上角的红色按钮然后刷新一下界面,避免干扰,然后观察下图圈中的区域:
发现了这些参数以后,我们在爬虫模拟HTTP请求时,只需要加上这些参数就可以了。
- 代码编写
该代码除爬取了500条用户的评论外,还爬取了用户名、评论时间和评分,并写入了csv文件。
import re
import requests
import json
import time
import csv
c=open(r'D:\xiecheng.csv','a+',newline='',encoding='utf-8')
fieldnames=['user','time','score','content']
writer=csv.DictWriter(c,fieldnames=fieldnames)
writer.writeheader()
head = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:56.0) Gecko/20100101 Firefox/56.0'}
postUrl = "https://sec-m.ctrip.com/restapi/soa2/12530/json/viewCommentList"
data_1 = {
"pageid": "10650000804",
"viewid": "8979",
"tagid": "0",
"pagenum": "1",
"pagesize": "50",
"contentType": "json",
"SortType": "1",
"head": {
"appid": "100013776",
"cid": "09031037211035410190",
"ctok": "",
"cver": "1.0",
"lang": "01",
"sid": "8888",
"syscode": "09",
"auth": "",
"extension": [
{
"name": "protocal",
"value": "https"
}
]
},
"ver": "7.10.3.0319180000"
}
html = requests.post(postUrl, data=json.dumps(data_1)).text
html = json.loads(html)
jingqu = '黄鹤楼'
pages = html['data']['totalpage']
datas = []
for j in range(pages):
data1 = {
"pageid": "10650000804",
"viewid": "8979",
"tagid": "0",
"pagenum": str(j + 1),
"pagesize": "50",
"contentType": "json",
"SortType": "1",
"head": {
"appid": "100013776",
"cid": "09031037211035410190",
"ctok": "",
"cver": "1.0",
"lang": "01",
"sid": "8888",
"syscode": "09",
"auth": "",
"extension": [
{
"name": "protocal",
"value": "https"
}
]
},
"ver": "7.10.3.0319180000"
}
datas.append(data1)
for k in datas[:10]:
print('正在抓取第' + k['pagenum'] + "页")
time.sleep(3)
html1 = requests.post(postUrl, data=json.dumps(k)).text
html1 = json.loads(html1)
comments = html1['data']['comments']
for i in comments:
user = i['uid']
time1 = i['date']
score = i['score']
content = i['content']
content = re.sub(" ", "", content)
content = re.sub("
", "", content)
writer.writerow({'user': user,'time':time1,'score': score,'content':content})
c.close()
效果
最终效果: