爬虫学习，隐藏代理

最新推荐文章于 2024-07-24 08:56:18 发布

sybs

最新推荐文章于 2024-07-24 08:56:18 发布

阅读量788

点赞数

分类专栏：笔记文章标签： python

本文链接：https://blog.csdn.net/qq_45774289/article/details/105858218

版权

笔记专栏收录该内容

13 篇文章 2 订阅

订阅专栏

爬虫学习，隐藏代理

1.修改User-Agent
2.使用代理

我们在做爬虫的过程中有时会遇到这样的情况：最初爬虫正常运行，正常抓取数据然而一杯茶的功夫可能就会出现错误，比如403 Forbidden；出现这样的原因往往是网站采取了一些反爬虫的措施，比如，服务器会检测某个IP在单位时间内的请求次数，如果超过了某个阈值，那么服务器会直接拒绝服务，返回一些错误信息。这时候，隐藏，代理就派上用场了。

1.修改User-Agent

head = {}
head['User-Agent'] = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36'

req = urllib.request.Request(url,data,head)
response = urllib.request.urlopen(req)

也可以使用requests库
安装requests库：pip install requests

链接: Requests库.

2.使用代理

转: httpbin.org的使用.

全网代理IP.
西刺免费代理IP.
使用高匿ip，多试几个，有可能不成功。

import urllib.request

url = "http://httpbin.org/get?show_env=1"

proxy_support = urllib.request.ProxyHandler({"http":"121.237.148.207:3000"})

opener = urllib.request.build_opener(proxy_support)

head = {}
head['User-Agent'] = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36'

req = urllib.request.Request(url,headers = head)
response = opener.open(req)
#使用install_opener()是一劳永逸的做法，我们使用urlopen()也会直接使用我们定制的代理
#urllib.request.install_opener(opener)
#response = urllib.request.urlopen(req)

html = response.read().decode('utf-8')

print(html)

输出结果：

{
  "args": {
    "show_env": "1"
  }, 
  "headers": {
    "Accept-Encoding": "identity", 
    "Cache-Control": "max-age=0", 
    "Host": "httpbin.org", 
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36", 
    "X-Amzn-Trace-Id": "Root=1-5eaadea7-ee5a5c6007977d5673c60e80", 
    "X-Forwarded-For": "121.237.148.207", 
    "X-Forwarded-Port": "80", 
    "X-Forwarded-Proto": "http"
  }, 
  "origin": "121.237.148.207", 
  "url": "http://httpbin.org/get?show_env=1"
}