巧用selenium破解瑞数js逆向
实战站点:湖北省生态环境厅:http://sthjt.hubei.gov.cn/site/sthjt/search.html?searchWord=%E7%A2%B3%E6%8E%92%E6%94%BE&siteId=41&pageSize=10
前言:爬虫界的一座大山,瑞数信息,看了网上的资料就算硬破也得搭建简易的web服务.
不过再难,他都有解决方案,就是通过selenium+js逆向过程中的一些有用信息。
# -*- coding: utf-8 -*-
"""
# @Time : 2021/8/13 9:19
# @Author : ChenLvLei
# @Email : 2516455367@qq.com
# @FileName : hubeisheng
# @Description :http://sthjt.hubei.gov.cn/site/sthjt/search.html?searchWord=%E7%A2%B3%E6%8E%92%E6%94%BE&siteId=41&pageSize=10
# code is far away from bugs with the god animal protecting
I love animals. They taste delicious.
┏┓ ┏┓
┏┛┻━━━┛┻┓
┃ ☃ ┃
┃ ┳┛ ┗┳ ┃
┃ ┻ ┃
┗━┓ ┏━┛
┃ ┗━━━┓
┃ 神兽保佑 ┣┓
┃ 永无BUG! ┏┛
┗┓┓┏━┳┓┏┛
┃┫┫ ┃┫┫
┗┻┛ ┗┻┛
"""
import sys
import requests
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
headers = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'Accept-Encoding': 'gzip, deflate',
'Accept-Language': 'zh-CN,zh;q=0.9',
'Cache-Control': 'no-cache',
'Connection': 'keep-alive',
'Host': 'sthjt.hubei.gov.cn',
'Pragma': 'no-cache',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36'}
def get_cookies():
option = Options()
option.add_argument("--incognito") # 配置隐私模式
# option.add_argument('--headless') # 配置无界面
option.add_experimental_option('excludeSwitches', ['enable-automation'])
driver = webdriver.Chrome(executable_path="./chromedriver.exe", options=option)
driver.execute_cdp_cmd("Page.addScriptToEvaluateOnNewDocument", {
"source": """
Object.defineProperty(navigator,'webdriver',{
get: () => undefined
})
"""
})
driver.get('灰色链接...已经隐藏')
# driver.maximize_window()
data = driver.get_cookies()
cookie = {}
cookie.update({'uuid': data[0].get('value')})
cookie.update({'FSSBBIl1UgzbN7N80T': data[1].get('value')})
cookie.update({'FSSBBIl1UgzbN7N80S': data[2].get('value')})
cookie.update({'token': data[3].get('value')})
cookie.update({'JSESSIONID': data[4].get('value')})
driver.close()
return cookie
cookie = get_cookies()
url = '灰色链接...已经隐藏'
lis = [i for i in range(1, 21)]
for page, n in enumerate(lis):
response = requests.get(url.format(page), headers=headers, cookies=cookie)
x = 100
y = (page + 1) / len(lis)
done = int(x * y)
if response.status_code != 200:
cookie = get_cookies()
response = requests.get(url.format(page), headers=headers, cookies=cookie)
else:
print(response.text)
done = int(x * y)
sys.stdout.write("\r[%s%s] %d%%" % ('█' * done, ' ' * (100 - done), x * y) + '\n')
sys.stdout.flush()
结果展示:
肯定看了的人有疑惑selenium加载的数据明明是h5,为啥我的是json。因为会思考的人往往不缺解决方案。
总结:selenium抓数据是不推荐的,但可以利用他获取用价值的加密参数。
学习交流专用: