爬虫——记一次奇妙的异步请求爬取

公司的需求,爬取某某查的企业信息,为防止律师函,全文不提该友站名称。这篇文章主要讲的是一个反反爬的思路,初学者爬数据过程中遇到问题也别慌,开发过程中没有灵异事件,所有奇怪的结果都是有原因的,塌下心来一步步捋顺,毕竟反爬的措施也是人写的。也许这篇文章比较长,但是也别怕,图片较多。

获取异步响应的数据是很常见的事,为什么把这次单拿出来分享呢,听我娓娓道来。

某查的反爬真是酸爽,这次我遇到的反爬过程是这样的:

1.通过开发者工具找到目标数据的请求地址,通过requests构造请求我用携带了cookie的session对象去请求;

2.发现和浏览器响应的数据不一致;

3.检查了一下发现浏览器中的这次请求cookie和本次回话其他请求的cookie有差异,说明js从中作祟没跑了,于是去找js;

4.js果然执行过程中篡改了cookie中的某几个值,但不是随便修改,而是在目标请求之前先发送了一个获取最新cookie的请求;

5.又去看了一下最新cookie的请求的响应内容,结果是一个全是数字的列表;

6.继续读js,原来是后端传了一些数字,前端通过fromCharCode()方法转为字符串,字符串为js内容,从中提取最新的cookie值

下面通过图文展示一下

1.下图为某查的前端页面,我需要“企业图谱”中的详情内容

 2.点击“查看详情”,(实际上下面的页面是有某查logo的背景图的,我在前端删掉了 XD

 3.开发者工具中,找到以上数据的请求

 4.去尝试模拟这个请求,不出意外403了

import requests


# 为了方便,直接从开发者工具复制下来cookie模拟登录
cookie = '_ga=GA1.2.1955355838.1558056696; _gid=GA1.2.54470962.1558056696; bannerFlag=undefined; _rutm=d46be5d8a6024caf8e35921087eade99; rtoken=ae0769f415f4409c8165f231f1231032; Hm_lpvt_e92c8d65d92d534b0fc290df538b4758=1558059476; Hm_lvt_e92c8d65d92d534b0fc290df538b4758=1558056696; auth_token=eyJhbGciOiJIUzUxMiJ9.eyJzdWIiOiIxODc0NTY4ODY1NSIsImlhdCI6MTU1ODA1OTQ2OSwiZXhwIjoxNTg5NTk1NDY5fQ.iJxAQ0jZi1biugR51cUkFcwsDQoz-Iv0bROVMpPD-TVbmXwCiBfvLpnobjfQ0EokimPoztzdr1mz2zm7CZ7n-w; tyc-user-info=%257B%2522claimEditPoint%2522%253A%25220%2522%252C%2522myAnswerCount%2522%253A%25220%2522%252C%2522myQuestionCount%2522%253A%25220%2522%252C%2522signUp%2522%253A%25220%2522%252C%2522explainPoint%2522%253A%25220%2522%252C%2522privateMessagePointWeb%2522%253A%25220%2522%252C%2522nickname%2522%253A%2522%25E9%2583%25AD%25E9%259D%2596%2522%252C%2522integrity%2522%253A%25220%2525%2522%252C%2522privateMessagePoint%2522%253A%25220%2522%252C%2522state%2522%253A%25220%2522%252C%2522announcementPoint%2522%253A%25220%2522%252C%2522isClaim%2522%253A%25220%2522%252C%2522vipManager%2522%253A%25220%2522%252C%2522discussCommendCount%2522%253A%25221%2522%252C%2522monitorUnreadCount%2522%253A%2522191%2522%252C%2522onum%2522%253A%25220%2522%252C%2522claimPoint%2522%253A%25220%2522%252C%2522token%2522%253A%2522eyJhbGciOiJIUzUxMiJ9.eyJzdWIiOiIxODc0NTY4ODY1NSIsImlhdCI6MTU1ODA1OTQ2OSwiZXhwIjoxNTg5NTk1NDY5fQ.iJxAQ0jZi1biugR51cUkFcwsDQoz-Iv0bROVMpPD-TVbmXwCiBfvLpnobjfQ0EokimPoztzdr1mz2zm7CZ7n-w%2522%252C%2522pleaseAnswerCount%2522%253A%25221%2522%252C%2522redPoint%2522%253A%25220%2522%252C%2522bizCardUnread%2522%253A%25220%2522%252C%2522vnum%2522%253A%25220%2522%252C%2522mobile%2522%253A%252218745688655%2522%257D; _gat_gtag_UA_123487620_1=1; RTYCID=cb7b8a8a10fb4057907f2b50dd0cd778; TYCID=ee79e4001de611e9bf3d3ff6d3a1d2e4; ssuid=2068339864; undefined=ee79e4001de611e9bf3d3ff6d3a1d2e4'
# 转成dict
cookies = dict([l.split("=", 1) for l in cookie.split("; ")])
# 请求头安排的明明白白,某查就是这个尿性,新手都会的模拟请求头他们也不放过
headers = {
    'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.1 Safari/605.1.15'
}
resp = requests.get('https://dis.tianyancha.com/dis/enterpriseMap.json?id=24416401',cookies=cookies,headers=headers).text
print(resp)

’‘’
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html>
<head><title>403 Forbidden</title></head>
<body bgcolor="white">
<h1>403 Forbidden</h1>
<p>You don't have permission to access the URL on this server. Sorry for the inconvenience.<br/>
Please report this message and include the following information to us.<br/>
Thank you very much!</p>
<table>
<tr>
<td>URL:</td>
<td>https://dis.tianyancha.com/dis/enterpriseMap.json?id=24416401</td>
</tr>
<tr>
<td>Server:</td>
<td>iz2zef8sue94bxg3w0librz</td>
</tr>
<tr>
<td>Date:</td>
<td>2019/05/17 10:29:25</td>
</tr>
</table>
<hr/>Powered by Tengine</body>
</html&g
  • 4
    点赞
  • 7
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值