使用Python破解代理网站反爬策略，获取大量免费代理

最新推荐文章于 2023-03-25 11:04:54 发布

羽羽羽羽羽落

最新推荐文章于 2023-03-25 11:04:54 发布

阅读量953

点赞数

文章标签： Python Python爬虫 Python入门网络爬虫

本文链接：https://blog.csdn.net/q122091987/article/details/89813825

版权

最近在做关于某视频网站爬虫时发现了一个可以通过API大量提取免费代理的网站，但美中不足的是该网站的反爬措施相当严格（？），便着手破解。

直接使用requests.get访问，返回的结果为混淆后的JS代码，查看状态码为521：

>>> import requests
>>> response = requests.get("http://www.66ip.cn/mo.php?tqsl=1024", headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36 Edge/18.17763",})
>>> print(response.text)
<script>var x="@@@17@chars@d@substr@return@while@fromCharCode@eval@join@firstChild@Tue@RegExp@onreadystatechange@for@2@5@if@@Array@toString@parseInt@as@1555403838@hantom@@catch@09@innerHTML@__p@addEventListener@attachEvent@@Apr@2Bn@gCZ@div@window@@@5L@new@@@@GMT@reverse@@@@@rOm9XFMtA3QKV7nYsPGT4lifyWwkq5vcjH2IdxUoCbhERLaz81DNB6@function@DOMContentLoaded@location@match@0xFF@replace@@challenge@String@@href@SGZ@@16@false@@cookie@Path@@@e@0xEDB88320@charAt@8@g@@@f@@Expires@length@https@@search@37@4@0@@36@@1@@a@QM@try@@@split@FWC@1500@D@setTimeout@@toLowerCase@JgSe0upZ@@captcha@19@@@18@@@@else@var@@B6hQ@charCodeAt@pathname@document@@createElement@__jsl_clearance".replace(/@*$/,"").split("@"),y="3r 42=1o(){3d('1q.23=1q.40+1q.2q.1t(/[\\?|&]3i-20/,\\'\\')',3b);41.29='44=q.4|2t|'+(1o(){3r 42=m(+[[-~[]]+[j]]),2=['%',[-~[-~{}-~{}]],'3a%16',[!/!/+[]][2t].2f(-~[-~{}-~{}]),'17',[{}+[]][2t].2f(i-~[]-~{}-~{}),'1c',[-~((-~{}<<((+!!/!/)|-~(+!!/!/))))]+(19['11'+'r'+'p']+[]+[[]][2t]).2f((+!{})),'24',[-~(+[])-~[]+2s],'35',[-~(+[])-~[]+2s]+(-~[(-~{}+[-~{}-~{}]>>-~{}-~{})+(-~{}+[-~{}-~{}]>>-~{}-~{})]+[]+[[]][2t])+(-~[(-~{}+[-~{}-~{}]>>-~{}-~{})+(-~{}+[-~{}-~{}]>>-~{}-~{})]+[]+[[]][2t]),'3t%',(-~(+!!/!/)+[]+[[]][2t]),'3c'];h(3r 38=2t;38<2.2n;38++){42.1i()[38]=2[38]};8 42.c('')})()+';2m=e, 26-15-3j u:2r:3m 1h;2a=/;'};k((1o(){36{8 !!19.12;}t(2d){8 27;}})()){41.12('1p',42,27)}3q{41.13('g',42)}",f=function(x,y){var a=0,b=0,c=0;x=x.split("");y=y||99;while((a=x.shift())&&(b=a.charCodeAt(0)-77.5))c=(Math.abs(b)<13?(b+48.5):parseInt(a,36))+y*c;return c},z=f(y.match(/\w/g).sort(function(x,y){return f(x)-f(y)}).pop());while(z++)try{eval(y.replace(/\b\w+\b/g, function(y){return x[f(y,z)-1]||("_"+y)}));break}catch(_){}</script>
>>> print(response.status_code)
521

查看请求头：

>>> print(response.headers)
{'Server': 'nginx', 'Date': 'Tue, 16 Apr 2019 08:33:41 GMT', 'Transfer-Encoding': 'chunked', 'Connection': 'close, close', 'X-Via-JSL': 'b3ca7e7,-', 'Set-Cookie': '__jsluid=800e5382bd0c39f56b244d87cf2615a3; max-age=31536000; path=/; HttpOnly'}

搜索并整理资料后得到的答案是：这段JS代码会在混淆后的字符串中生成JS代码字符串，再将其eval执行真正的逻辑代码生成cookie，和headers中的Set-Cookie项合并，最后刷新网页用真正的cookie访问服务器得到数据。

于是第一反应是执行js，但该段js用js2py/execjs执行均会报错（涉及到一个暗坑，见文末彩蛋），遂采用selenium+ChromeDriver的方式取得请求头。由于cookie通常拥有一定的有效期，为了降低调用浏览器的频率，我们获取到cookie后将其保存，下次检测到cookie失效时再调用，代码如下：

from selenium import webdriver
driver = webdriver.Chrome()
driver.get("http://www.66ip.cn/mo.php?tqsl=1024")
cookie = driver.get_cookies()
driver.close()

检查获取到的cookie：

>>> print(cookie)
[{'domain': 'www.66ip.cn', 'expiry': 1586943449.796784, 'httpOnly': True, 'name': '__jsluid', 'path': '/', 'secure': False, 'value': '73da79ccc591971704ffebff501eb26e'}, {'domain': 'www.66ip.cn', 'expiry': 1555411050, 'httpOnly': False, 'name': '__jsl_clearance', 'path': '/', 'secure': False, 'value': '1555407450.549|0|Ad6%2B78qFTS188pb2kOoKzQtjo2Y%3D'}]

检查浏览器中的cookie：

使用浏览器打开网页时生成的cookie

不难发现最后使用的cookie就是driver.get_cookies()的name与value生成的键值对。OK，那么现在生成cookie并测试：

cookie = driver.get_cookies()
str_cookie = ""
for data in cookie:
    str_cookie += data["name"] + "=" + data["value"] + "; "
str_cookie = str_cookie[:-2]
response = requests.get("http://www.66ip.cn/mo.php?tqsl=1024",headers={
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36",
            "cookie" : str_cookie})
print(response)
>>> <Response [200]>

（在重新整理、测试这一段代码时发现了一个新的细节：JS生成的代码与User-Agent相关，换句话说并不能通过fake_useragent的random功能用随机字符串去访问。）

OK！似乎一切都完成了，我们添加无头模式参数再次测试：

from selenium import webdriver
#添加无头参数
from selenium.webdriver.chrome.options import Options
chrome_options = Options()
chrome_options.add_argument('--headless')

driver = webdriver.Chrome(options=chrome_options)
driver.get("http://www.66ip.cn/mo.php?tqsl=1024")
cookie = driver.get_cookies()
str_cookie = ""
for data in cookie:
    str_cookie += data["name"] + "=" + data["value"] + "; "
str_cookie = str_cookie[:-2]
response = requests.get("http://www.66ip.cn/mo.php?tqsl=1024",headers={
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36",
            "cookie" : str_cookie})
print(response)
>>> <Response [521]>

奇怪的是，在无头模式下获取的Cookie并不能用来作为requests.get()的参数，以此作为参数时依然会被反爬措施拦截。推测是因为无头模式启动时会设置window['__phantomas']对象的值，使其不为"undefined"，从而破坏最终输出，使校验失败。

虽然在GUI模式下启动Chrome也可以凑合用，但如果就这样甘于GUI模式，每次运行时一个chrome一闪而过，简直将Python的优雅破坏殆尽（其实是不想在挂着爬虫玩游戏时被打扰）（笑）

从JS本身入手，使用JS在线格式化工具查看格式化后的JS代码：

var x = "@@@17@chars@d@substr@return@while@fromCharCode@eval@join@firstChild@Tue@RegExp@onreadystatechange@for@2@5@if@@Array@toString@parseInt@as@1555403838@hantom@@catch@09@innerHTML@__p@addEventListener@attachEvent@@Apr@2Bn@gCZ@div@window@@@5L@new@@@@GMT@reverse@@@@@rOm9XFMtA3QKV7nYsPGT4lifyWwkq5vcjH2IdxUoCbhERLaz81DNB6@function@DOMContentLoaded@location@match@0xFF@replace@@challenge@String@@href@SGZ@@16@false@@cookie@Path@@@e@0xEDB88320@charAt@8@g@@@f@@Expires@length@https@@search@37@4@0@@36@@1@@a@QM@try@@@split@FWC@1500@D@setTimeout@@toLowerCase@JgSe0upZ@@captcha@19@@@18@@@@else@var@@B6hQ@charCodeAt@pathname@document@@createElement@__jsl_clearance".replace(/@*$/, "").split("@"),
y = "3r 42=1o(){3d('1q.23=1q.40+1q.2q.1t(/[\\?|&]3i-20/,\\'\\')',3b);41.29='44=q.4|2t|'+(1o(){3r 42=m(+[[-~[]]+[j]]),2=['%',[-~[-~{}-~{}]],'3a%16',[!/!/+[]][2t].2f(-~[-~{}-~{}]),'17',[{}+[]][2t].2f(i-~[]-~{}-~{}),'1c',[-~((-~{}<<((+!!/!/)|-~(+!!/!/))))]+(19['11'+'r'+'p']+[]+[[]][2t]).2f((+!{})),'24',[-~(+[])-~[]+2s],'35',[-~(+[])-~[]+2s]+(-~[(-~{}+[-~{}-~{}]>>-~{}-~{})+(-~{}+[-~{}-~{}]>>-~{}-~{})]+[]+[[]][2t])+(-~[(-~{}+[-~{}-~{}]>>-~{}-~{})+(-~{}+[-~{}-~{}]>>-~{}-~{})]+[]+[[]][2t]),'3t%',(-~(+!!/!/)+[]+[[]][2t]),'3c'];h(3r 38=2t;38<2.2n;38++){42.1i()[38]=2[38]};8 42.c('')})()+';2m=e, 26-15-3j u:2r:3m 1h;2a=/;'};k((1o(){36{8 !!19.12;}t(2d){8 27;}})()){41.12('1p',42,27)}3q{41.13('g',42)}",
f = function(x, y) {
    var a = 0,
    b = 0,
    c = 0;
    x = x.split("");
    y = y || 99;
    while ((a = x.shift()) && (b = a.charCodeAt(0) - 77.5)) c = (Math.abs(b) < 13 ? (b + 48.5) : parseInt(a, 36)) + y * c;
    return c
},
z = f(y.match(/\w/g).sort(function(x, y) {
    return f(x) - f(y)
}).pop());
while (z++) try {
    //重点
    eval(y.replace(/\b\w+\b/g,
    function(y) {
        return x[f(y, z) - 1] || ("_" + y)
    }));
    break
} catch(_) {}

可以看到最终使用eval执行了生成的字符串，我们用console.log替换eval，复制到浏览器中执行查看输出：

获取到了新代码，格式化后分析：

var _42 = function() {
    setTimeout('location.href=location.pathname+location.search.replace(/[\?|&]captcha-challenge/,\'\')', 1500);
    //生成cookie
    document.cookie = '__jsl_clearance=1555403838.17|0|' + (function() {
        var _42 = Array( + [[ - ~ []] + [5]]),
        _2 = ['%', [ - ~ [ - ~ {} - ~ {}]], 'FWC%2Bn', [!/!/ + []][0].charAt( - ~ [ - ~ {} - ~ {}]), 'gCZ', [{} + []][0].charAt(2 - ~ [] - ~ {} - ~ {}), '5L', 
        [ - ~ (( - ~ {} << (( + !!/!/) | -~ ( + !!/!/))))] + (window['__p' + 'hantom' + 'as'] + [] + [[]][0]).charAt(( + !{})), 'SGZ', [ - ~ ( + []) - ~ [] + 4], 'QM', 
        [ - ~ ( + []) - ~ [] + 4] + ( - ~ [( - ~ {} + [ - ~ {} - ~ {}] >> -~ {} - ~ {}) + ( - ~ {} + [ - ~ {} - ~ {}] >> -~ {} - ~ {})] + [] + [[]][0]) + 
        ( - ~ [( - ~ {} + [ - ~ {} - ~ {}] >> -~ {} - ~ {}) + ( - ~ {} + [ - ~ {} - ~ {}] >> -~ {} - ~ {})] + [] + [[]][0]), 
        'B6hQ%', ( - ~ ( + !!/!/) + [] + [[]][0]), 'D'];
        for (var _38 = 0; _38 < _2.length; _38++) {
            _42.reverse()[_38] = _2[_38]
        };
        return _42.join('')
    })() + ';Expires=Tue, 16-Apr-19 09:37:18 GMT;Path=/;'
};
if ((function() {
    try {
        return !! window.addEventListener;
    } catch(e) {
        return false;
    }
})()) {
    document.addEventListener('DOMContentLoaded', _42, false)
} else {
    document.attachEvent('onreadystatechange', _42)
}

可以看到最终生成的cookie被赋值给了document.cookie。我们将生成语句复制出来运行，查看结果：

NICE！似乎已经万事大吉了。

整理思路：访问API，得到<Response [521]>，保存headers中的cookie，运行js代码，将js生成的cookie与headers中的cookie合并后再次请求API得到数据。

根据思路写出代码（js运行库选择js2py）：

def main():
    response = requests.get("http://www.66ip.cn/mo.php?tqsl=1024",
                    headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36 Edge/18.17763",})
    #保存第一段cookie
    cookie=response.headers["Set-Cookie"]
    js = response.text.encode("utf8").decode("utf8")
    #删除script标签并替换eval。
    js = js.replace("<script>","").replace("</script>","").replace("{eval(","{var data1 = (").replace(chr(0),chr(32))
    #使用js2py的js交互功能获得刚才赋值的data1对象
    context = js2py.EvalJs()
    context.execute(js)
    js_temp = context.data1
    
    #找到cookie生成语句的起始位置
    index1 = js_temp.find("document.")
    index2 = js_temp.find("};if((")
    #故技重施，替换代码中的对象以获得数据
    js_temp = js_temp[index1:index2].replace("document.cookie","data2")
    context.execute(js_temp)
    data = context.data2
    
    #合并cookie，重新请求网站。
    cookie += ";"+data
    response = requests.get("http://www.66ip.cn/mo.php?tqsl=1024", headers={
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36 Edge/18.17763",
        "cookie" : cookie,
    })
    return response
if __name__ == "__main__" :
    main()

检查返回值：