爬虫之有道翻译
实现爬虫主要找到以下三个部分,分别是url,header,form data。直接从谷歌开发工具中复制出来,再做一些修改即可!
url
#请求地址
url = 'http://fanyi.youdao.com/translate_o?smartresult=dict&smartresult=rule'
headers
headers = {
"Accept": "application/json,text/javascript,*/*;q = 0.01",
"Accept-Language": "zh-CN,zh;q = 0.8",
"Connection": "keep-alive",
"Content-Length": "200",
"Content-Type": "application/x-www-form-urlencoded;charset=UTF-8",
"Cookie":"YOUDAO_EAD_UUID=c7c1d171-272e-443f-9d07-d9f8c779342e;
"Host":"fanyi.youdao.com",
"Origin":"http://fanyi.youdao.com",
"Referer":"http://fanyi.youdao.com/",
"User-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36",
form data
# 请求体
data = {
"i": "girl",
"from":"AUTO",
"to": "AUTO",
"smartresult": "dict",
"client": "fanyideskweb",
"salt": "1526736521130", ### 盐:很长的随机串,防止用字典反推
"sign": "a18e780b545d559a3a1f9647b91c6ed0", ## 签名:js加密
"doctype": "json",
"version": "2.1",
"keyfrom": "fanyi.web",
"action": "FY_BY_REALTIME",
"typoResult": "false"
破解反爬虫机制
js加密
发送请求的form Data中salt(盐)和sgin(签名)就是用来js加密的
salt
一个随机字符串,防止使用字典反推破解----{明文:密文}–{密文:明文}
破解方法:
network中找到fanyi.min.js,将其response中的代码复制出来,用在线格式转换工具转换,
然后在复制出来,用编译器打开,找到salt的位置,实现方法为(new Date).getTime(),js中getTime 方法返回一个整数值,这个整数代表了从 1970 年 1 月 1 日开始计算到 Date 对象中的时间之间的毫秒数,转换成python方法就可以了
salt = int(time.time()*1000)+random.randint(0,10)
sign
sign的加密方式使用了md5加密算法,js代码中可以看到md5()参数一个有4个字符串组成,其中第一个和第二个都是常字符串,直接复制过来,第三个是所谓的salt,第四个是输入要翻译的单词
u = 'fanyideskweb'
d = content
f = str(int(time.time()*1000) + random.randint(1,10)) # salt
c= 'rY0D^0\'nM0}g5Mm1z%1G4'
sign = hashlib.md5((u + d + f + c).encode('utf-8')).hexdigest()
request
参数:url,data,headers,method(地址,form表单,头部信息,请求方式)
data = urllib.parse.urlencode(data).encode('utf-8')
request = urllib.request.Request(url=url,data=data,headers=head, method='POST')
response
response = urllib.request.urlopen(request)
爬出翻译内容
line = json.load(response) # 将得到的字符串转换成json格式
text=''
for x in line['translateResult']:
text += x[0]['tgt']
yd = text
改进–伪装
user1="Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36"
user2="Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36"
user3="Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36"
user4="Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36"
// 随机选择一个User-Agent
User-Agent = random.choice([user1, user2, user3, user4])
改进–IP代理
# 在网上找一些免费可用的代理IP
iplist = ['118.31.220.3:8080','221.228.17.172:8181','219.141.153.4:80']#代理ip及端口
# 在请求时随机使用一个作为代理ip地址进行访问
dict1 = {'http':random.choice(iplist)}
proxy_support = urllib.request.ProxyHandler(dict1)
opener = urllib.request.build_opener(proxy_support)
urllib.request.install_opener(opener)