当我们对 中国银监局 进行爬虫时,先用postman模拟请求可以发现,返回的内容却是一段js,猜想是js动态加载cookie,把js复制到本地运行查看
本地查看返回的js,再运用爬虫破解js的一贯思路,将js中的eval替换成return 在图中,我们先将eval替换console.log将js中的代码打印出来
可以发现这一段js运行后又会返回一段js,在返回的js中可以找到js添加cookie的代码,因此只需要使用代码找到这一段js,在模拟执行即可
因此只需要将第一次访问网页时,返回的js代码使用py库来执行,将返回的内容在使用py库执行,最后添加cookie就完成爬取了
贴上代码
import execjs
import requests
class YinjianSpider(object):
def __init__(self, url):
self.url = url
self.session = requests.Session()
self.session.headers = {
'Connection': "keep-alive",
'Upgrade-Insecure-Requests': "1",
'User-Agent': "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36",
'Accept': "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3",
'Referer': self.url,
'Host': "www.cbrc.gov.cn",
}
def add_cookie(self, html):
js_code1 = html.text.strip()
js_code1 = js_code1.replace('</script>', '').replace('<script>', '')
index = js_code1.rfind('}')
js_code1 = js_code1[0:index + 1]
js_code1 = 'function getEval() {' + js_code1 + '}'
js_code1 = js_code1.replace('eval(', 'return(')
js_code2 = execjs.compile(js_code1)
code = js_code2.call('getEval')
code = 'var a' + code.split('document.cookie')[1].split("Path=/;'")[0] + "Path=/;';return a;"
code = 'window = {}; \n' + code
js_final = "function getClearance(){" + code + "};"
js_final = js_final.replace("return return", "return eval")
ctx = execjs.compile(js_final)
jsl_clearance = ctx.call('getClearance')
jsl_cle = jsl_clearance.split(';')[0].split('=')[1]
self.session.cookies['__jsl_clearance'] = jsl_cle
def run(self):
html = self.session.get(self.url)
# print(html.text)
self.add_cookie(html)
web = self.session.get(self.url)
print(web.text)
def main():
url = "http://www.cbrc.gov.cn/chinese/newShouDoc/051BBB322CED45E2A077428FA8594A44.html"
yj = YinjianSpider(url)
yj.run()
if __name__ == '__main__':
main()