爬虫学习,隐藏代理
我们在做爬虫的过程中有时会遇到这样的情况:最初爬虫正常运行,正常抓取数据然而一杯茶的功夫可能就会出现错误,比如403 Forbidden;出现这样的原因往往是网站采取了一些反爬虫的措施,比如,服务器会检测某个IP在单位时间内的请求次数,如果超过了某个阈值,那么服务器会直接拒绝服务,返回一些错误信息。这时候,隐藏,代理就派上用场了。
1.修改User-Agent
head = {}
head['User-Agent'] = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36'
req = urllib.request.Request(url,data,head)
response = urllib.request.urlopen(req)
也可以使用requests库
安装requests库:pip install requests
链接: Requests库.
2.使用代理
转: httpbin.org的使用.
全网代理IP.
西刺免费代理IP.
使用高匿ip,多试几个,有可能不成功。
import urllib.request
url = "http://httpbin.org/get?show_env=1"
proxy_support = urllib.request.ProxyHandler({"http":"121.237.148.207:3000"})
opener = urllib.request.build_opener(proxy_support)
head = {}
head['User-Agent'] = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36'
req = urllib.request.Request(url,headers = head)
response = opener.open(req)
#使用install_opener()是一劳永逸的做法,我们使用urlopen()也会直接使用我们定制的代理
#urllib.request.install_opener(opener)
#response = urllib.request.urlopen(req)
html = response.read().decode('utf-8')
print(html)
输出结果:
{
"args": {
"show_env": "1"
},
"headers": {
"Accept-Encoding": "identity",
"Cache-Control": "max-age=0",
"Host": "httpbin.org",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36",
"X-Amzn-Trace-Id": "Root=1-5eaadea7-ee5a5c6007977d5673c60e80",
"X-Forwarded-For": "121.237.148.207",
"X-Forwarded-Port": "80",
"X-Forwarded-Proto": "http"
},
"origin": "121.237.148.207",
"url": "http://httpbin.org/get?show_env=1"
}