python爬取苏宁商品评论
爬取其他电商物品评论的案例如下:
https://blog.csdn.net/coffeetogether/article/details/114296159
https://blog.csdn.net/coffeetogether/article/details/114274960?spm=1001.2014.3001.5501
以苏宁家电为例
1.找到目标的url:
2.检查响应结果
3.解析数据
注:需要手动将json数据中的干扰信息去除,(还有最后的小括号)。在代码中通过正则去除干扰信息
4.找到翻页规律:
http://review.suning.com/ajax/cluster_review_lists/cluster-37502374-000000012031487720-0000000000-total-1-default-10-----reviewList.htm?callback=reviewList
http://review.suning.com/ajax/cluster_review_lists/cluster-37502374-000000012031487720-0000000000-total-2-default-10-----reviewList.htm?callback=reviewList
http://review.suning.com/ajax/cluster_review_lists/cluster-37502374-000000012031487720-0000000000-total-3-default-10-----reviewList.htm?callback=reviewList
通过对比url发现,不同页url的规律在于参数total之后的数字。
解析完毕,上代码:
import requests
import re
import json
import jsonpath
if __name__ == '__main__':
# 手动输入要爬取的页数
pages = int(input('请输入要爬取的页数:'))
# 创建for循环进行翻页操作
for i in range(pages):
page = i+1
# 确认目标的url
url_ = f'http://review.suning.com/ajax/cluster_review_lists/cluster-37502374-000000012031487720-0000000000-total-{page}-default-10-----reviewList.htm?callback=reviewList'
# 创建请求头参数
headers = {
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.104 Safari/537.36'
}
# 发送请求,获取相应
response = requests.get(url_,headers=headers)
# 通过正则去除多余的信息
str_data = re.findall(r'reviewList\((.*?)\)',response.text)[0]
# 将数据转换为python 数据
py_data = json.loads(str_data)
# 提取用户id和评论
id_list = jsonpath.jsonpath(py_data,'$..nickName')
comment_list = jsonpath.jsonpath(py_data,'$.commodityReviews[*].content')
# 创建字典,保存id和评论
for i in range(len(id_list)):
dict_ = {}
dict_[id_list[i]] = comment_list[i]
json_data = json.dumps(dict_,ensure_ascii=False)+',\n'
with open('翻页苏宁商品评论.json','a',encoding='utf-8')as f:
f.write(json_data)
爬取了三页