注意!!!!某XX网站实例仅作为学习案例,禁止其他个人以及团体做谋利用途!!!
场景
Python采集某网址页面内容
aHR0cHM6Ly9jcmVkaXRiai5qeGouYmVpamluZy5nb3YuY24vY3JlZGl0LXBvcnRhbC9jcmVkaXRfc2VydmljZS9wdWJsaWNpdHkvcmVjb3JkL2JsYWNr
报错信息
requests.exceptions.SSLError: HTTPSConnectionPool(host='creditbj.jxj.beijing.gov.cn', port=443): Max retries exceeded with url: /credit-portal/api/publicity/record/BLACK/0 (Caused by SSLError(SSLError("bad handshake: Error([('elliptic curve routines', 'ecx_key_op', 'invalid encoding'), ('SSL routines', 'tls_process_ske_ecdhe', 'bad ecpoint')],)",),))
问题溯源
正常使用requests 请求,总是报上述的错误。早前担心是headers内容不全和代理不稳定以及网络等外界因素。在各个条件齐全的条件下报错依旧存在。
通过各种查资料了解到这种报错是JA3 TLS指纹反爬的表现。本人能力有限还仅限于了解表层了解,深入的知识点请自行解决。
问题解决方法
使用 curl_cffi 库
curl_cffi: 支持原生模拟浏览器 TLS/JA3 指纹的 Python 库(建议3.7及以上的Python)
from curl_cffi import requests as requests1
def get_req(url, headers, proxies, method, data=None):
# impersonate 参数,指定了模拟哪个浏览器
s = requests1.Session()
if method.lower() in ["payload"]:
res = s.post(url=url, headers=headers, data=json.dumps(data), verify=False, proxies=proxies, impersonate="chrome101")
elif method.lower() in ["post"]:
res = s.post(url=url, headers=headers, data=data, verify=False, proxies=proxies,
impersonate="chrome101")
else:
res = s.get(url=url, headers=headers, verify=False, proxies=proxies, impersonate="chrome101")
res.encoding='utf-8'
return res
if __name__ == '__main__':
url = "https://XXXX/ZXXX"
headers = {
"Accept":"application/json, text/javascript, */*; q=0.01",
"Content-Type":"application/json",
"Referer":"https://XXXXXXXXXXXX",
"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
}
proxies = "" # 代理
method = "GET" # 请求方式 GET,POST,PAYLOAD
data = {} # 请求参数 可不填
res = get_req(url, headers, proxies, method, data=None)
print(res)
参考资料
仅作为笔记记录,如有问题请各位大佬来指导