前言
本文我们主要是讲述一个某网站评论爬取的案例
一、单页爬取
目标url:https://ke.qq.com/course/380991/12573838881968191?tuin=7265bf35#term_id=100454125
import jsonpath
import requests
if __name__ == '__main__':
# 1.确认代理url
url_ = 'https://ke.qq.com/cgi-bin/comment_new/course_comment_list?cid=380991&filter_rating=0&page=0&bkn=&r=0.9592'
# 2.用户代理
headers_ = {
# 添加跳转referer
'Referer': 'https://ke.qq.com/course/380991/12573838881968191?tuin=7265bf35',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36'
}
response_ = requests.get(url_, headers=headers_)
py_data = response_.json()
# print(py_data)
# 3.解析数据
name_list = jsonpath.jsonpath(py_data, '$..nick_name')
print(name_list)
comment_list = jsonpath.jsonpath(py_data, '$..first_comment')
print(comment_list)
for i in range(len(name_list)):
dict_ = {}
dict_[name_list[i]] = comment_list[i]
print(dict_)
报错处理:{“msg”:“refer错误”,“type”:1,“retcode”:100101}
解决:referer错误,是一个新的反爬点,只要加上一个referer跳转就可以成功访问
二、翻页案例
我们获取数据的时候,一般一次性会需要的到大量的数据,但是一页一页的爬取效率就会极其低下,这就引发了另一种思考,我们是否可以利用代码来实现评论的翻页呢?
import jsonpath
import requests
import time
import json
if __name__ == '__main__':
with open('腾讯评论.json','w') as f:
dict_list = []
for page in range(2):
# 1.确认代理url
url_ = f'https://ke.qq.com/cgi-bin/comment_new/course_comment_list?cid=380991&filter_rating=0&page={page}&bkn=&r=0.9592'
# 2.用户代理
headers_ = {
# 添加跳转referer
'Referer': 'https://ke.qq.com/course/380991/12573838881968191?tuin=7265bf35',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36'
}
response_ = requests.get(url_,headers=headers_)
py_data = response_.json()
# print(py_data)
# 3.解析数据
name_list = jsonpath.jsonpath(py_data,'$..nick_name')
# print(name_list)
comment_list = jsonpath.jsonpath(py_data, '$..first_comment')
# print(comment_list)
for i in zip(name_list,comment_list):
dict_data = {
'姓名': i[0],
'内容': i[1]
}
dict_list.append(dict_data)
print(f'第{page+1}页打印完成,休息一下~')
time.sleep(2)
json_data = json.dumps(dict_list,ensure_ascii=False)
f.write(json_data)