网站
http://www.ccgp-anhui.gov.cn/ZcyAnnouncement/ZcyAnnouncement2/index.html
分析
- 列表页
接口:http://www.ccgp-anhui.gov.cn/front/search/category
请求方式:post
请求参数:{“leaf”:“0”,“categoryCode”:“ZcyAnnouncement2”,“pageSize”:15,“pageNo”:1}
leaf:默认为0
categoryCode:对应不同列表数据的分类 的参数
pageSize:一页多少数据
pageNo:当前页数
- 详情页
当没有cookie信息的时候,请求详情页会自动发起3次请求
**第一次请求:**获取一段混淆js代码和服务器set-cookie,这段js混淆后的js代码会自执行,并再次发送请求
**第二次请求:**这次请求需要第一次请求获取的cookie才能请求到,并返回302状态码进行重定向
第三次请求:这次请求才是获取真正的详情页面,并且详情内容放在了标签里
![image.png](https://img-blog.csdnimg.cn/img_convert/c9000f1281dee4caa47b80dd978be43e.png#averageHue=#d9e1f5&clientId=ue386d896-dc02-4&crop=0&crop=0&crop=1&crop=1&from=paste&height=129&id=u2f9000f6&margin=[object Object]&name=image.png&originHeight=155&originWidth=923&originalType=binary&ratio=1&rotation=0&showTitle=false&size=40243&status=done&style=none&taskId=u5ca3353a-87dc-4a9f-bc7a-230bf418769&title=&width=768)
逆向分析
第二次获取的cookie有有效时间,过几分钟就要重新请求,并且获取到的cookie可以访问全部的详情页。利用这一点,我们只需要模仿浏览器不停的刷新一个页面的获取cookie来爬取整个网站的数据。
经过测试,请求不能太频繁,太频繁会出现验证码,而且还会封ip ,使用一般的代理也不行。
python代码
import json
import requests
from lxml.html import fromstring,tostring
def get_wzws_sid ():
# 这里获取 wzws_sid 这个cookie
headers = {
'Host': 'www.ccgp-anhui.gov.cn',
'Pragma': 'no-cache',
'Cache-Control': 'no-cache',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'Referer': 'http://www.ccgp-anhui.gov.cn/ZcyAnnouncement/ZcyAnnouncement2/ZcyAnnouncement3011/K6n+No0HoHdzuaEakC0Bey0Pmsg4eSzY519ekYYKIW8=.html',
'Accept-Encoding': 'gzip, deflate',
'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8',
'Cookie': "wzws_sessionid=gDE0LjExMS45MS4yM4FhODE4YWKCNDY1N2JhoGO1NZ8=",
'Connection': 'keep-alive'
}
response = requests.get('http://www.ccgp-anhui.gov.cn/CSPDREL1pjeUFubm91bmNlbWVudC9aY3lBbm5vdW5jZW1lbnQyL1pjeUFubm91bmNlbWVudDMwMTEvSzZuK05vMEhvSGR6dWFFYWtDMEJleTBQbXNnNGVTelk1MTlla1lZS0lXOD0uaHRtbA==?wzwscspd=MC4wLjAuMA==',
headers=headers, allow_redirects=False)
set_cookie = response.headers.get("Set-Cookie").split(";")[0].split("=")[1]
# wzws_sid=8b22f52c1dab859517b2d438acf04a7ba07b3e51d1961e0068daf016f9216c38476a32312d305fc536c451a8655495ea6769eefe3259a6640db9cdbea9ee0939a9c7ba581c6d22a00f3498965f34f674
return set_cookie
cookies = {
'wzws_sessionid': 'gjQ2NTdiYYFhODE4YWKAMTQuMTExLjkxLjIyoGO1Lf4=',
'wzws_sid': get_wzws_sid(),
}
headers = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8',
'Cache-Control': 'no-cache',
'Pragma': 'no-cache',
'Proxy-Connection': 'keep-alive',
'Referer': 'http://www.ccgp-anhui.gov.cn/ZcyAnnouncement/ZcyAnnouncement2/ZcyAnnouncement3011/K6n+No0HoHdzuaEakC0Bey0Pmsg4eSzY519ekYYKIW8=.html',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36',
}
response = requests.get('http://www.ccgp-anhui.gov.cn/ZcyAnnouncement/ZcyAnnouncement2/ZcyAnnouncement3002/js7GPLBiMyVoKEX8siQqlsky/k6QuxfU2xlnt5u8ifQ=.html', headers=headers, cookies=cookies, verify=False)
response.encoding = "utf-8"
html = fromstring(response.text)
detail_element = html.xpath('//input[@name="articleDetail"]/@value')[0]
detail = str(detail_element)
# detail = tostring(detail_element).decode()
data = json.loads(detail)
print(data.get("content"))
with open("text.html","w",encoding="utf-8") as f:
f.write(data.get("content"))