7.模拟浏览器访问,隐藏python自身信息
原理:当浏览器访问服务器上的内容时,服务器会抓取访问信息中 header 中的 User-Agent 信息,若User-Agent中显示有python信息等,则视为爬虫程序, 此时服务器会阻止它进行信息爬取。为了隐藏爬虫程序,此时使用模拟浏览器访问的方式来进行信息获取,模拟浏览器方法是添加浏览器的User-Agent,目前有两种方法添加。
第一种:先定义head,通过浏览器访问某个网址,找到其headers的User-Agent的值,在urlopen参数中添加head值即可。
代码如下:
head = {}
head['User-Agent'] = 'Mozilla/5.0 (WindowsNT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110Safari/537.36'
req = urllib.request.Request(url,data,head)
response = urllib.request.urlopen(req)
html = response.read().decode('utf-8')
第二种,不用先定义head,而是定义urllib.request对象后,在其后添加add_header即可。
代码如下:
req = urllib.request.Request(url,data)
req.add_header('User-Agent','Mozilla/5.0(Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)Chrome/58.0.3029.110 Safari/537.36')
-----------------------------------------------------------------------------------------------------------------
修改有道翻译的例子,进行隐藏访问。
注意:示例中采用的第二种进行隐藏
选择第二种隐藏的原因,无需先定义head,直接在Requset对象中添加add_header,更加简洁。
代码如下
import urllib.request
import urllib.parse
import json
content = input("请输入需要翻译的内容:")
url ='http://fanyi.youdao.com/translate?smartresult=dict&smartresult=rule'
data = {}
data['i'] = content
data['from'] = 'AUTO'
data['to'] = 'AUTO'
data['smartresult'] = 'dict'
data['client'] = 'fanyideskweb'
data['salt'] = '1507902676814'
data['sign'] ='f4bd4b3b948cbc76c913eafdd1853ed8'
data['doctype'] = 'json'
data['version']= '2.1'
data['keyfrom'] = 'fanyi.web'
data['action'] = 'FY_BY_CLICKBUTTION'
data['typoResult'] = 'true'
data = urllib.parse.urlencode(data).encode('utf-8')
req = urllib.request.Request(url,data)
req.add_header('User-Agent','Mozilla/5.0(Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)Chrome/58.0.3029.110 Safari/537.36')
response = urllib.request.urlopen(req)
html = response.read().decode('utf-8')
target = json.loads(html)
print('翻译结果:%s' %(target['translateResult'][0][0]['tgt']))
8.使用代理(agent)进行信息爬取
若爬虫访问频率过高,则服务器会屏蔽爬虫,一般有两种方法,
其一是延迟访问浏览器,
其二是使用代理。
One:使用延迟
import urllib.request
import urllib.parse
import json
import time
while(true):
content = input("请输入需要翻译的内容(输入‘q!’退出):")
if content == q!:
break;
url ='http://fanyi.youdao.com/translate?smartresult=dict&smartresult=rule'
data = {}
data['i'] = content
data['from'] = 'AUTO'
data['to'] = 'AUTO'
data['smartresult'] = 'dict'
data['client'] = 'fanyideskweb'
data['salt'] = '1507902676814'
data['sign'] = 'f4bd4b3b948cbc76c913eafdd1853ed8'
data['doctype'] = 'json'
data['version']= '2.1'
data['keyfrom'] = 'fanyi.web'
data['action'] = 'FY_BY_CLICKBUTTION'
data['typoResult'] = 'true'
data = urllib.parse.urlencode(data).encode('utf-8')
req = urllib.request.Request(url,data)
req.add_header('User-Agent','Mozilla/5.0 (Windows NT 6.1; Win64;x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110Safari/537.36')
response = urllib.request.urlopen(req)
html = response.read().decode('utf-8')
target = json.loads(html)
print('翻译结果:%s' % (target['translateResult'][0][0]['tgt']))
time.sleep(5) # 等待5s
Two:使用代理
步骤:
第一步:参数为一个字典{‘类型’:“代理ip:端口号”}
proxy_support = urllib.request.ProxyHandler({})
第二步:定制、创建一个opener
opener = urllib.request.build_opener(proxy_support)
第三步:安装opener
urllib.request.install_opener(opener)
调用opener
Opener.open(url)
示例:
import urllib.request
import random
url = 'http://www.whatismyip.com.tw' # 查看ip地址信息
iplist =['118.114.77.47:8080','111.200.58.94:80'] # 代理IP list
proxy_support =urllib.request.ProxyHandler({'http':random.choice(iplist)})
opener =urllib.request.build_opener(proxy_support)
# 定制header信息
opener.addheaders =[('User-Agent','Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36(KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.')]
urllib.request.install_opener(opener)
response = urllib.request.urlopen(url)
html = response.read().decode('utf-8')
print(html)