Python3网络爬虫(三) -- 使用代理，轮换使用各种IP访问，忽略限制

最新推荐文章于 2024-07-18 20:32:06 发布

凡凡不知所错

最新推荐文章于 2024-07-18 20:32:06 发布

阅读量1.1k

点赞数

分类专栏：网络爬虫

本文链接：https://blog.csdn.net/qq_43355223/article/details/85992567

版权

网络爬虫专栏收录该内容

8 篇文章 0 订阅

订阅专栏

对于一些网站，一开始能请求，但是时间久了，网站有可能会封ip，Requests库对此的解决办法：

import requests
proxies = {
  'http': 'http://10.10.1.10:3128',
  'https': 'http://10.10.1.10:1080',
}

requests.get('https://www.taobao.com', proxies=proxies)

如果代理需要HTTP Basic Auth：

import requests
proxies = {
    'https': 'http://user:password@10.10.1.10:3128/',
}
requests.get('https://www.taobao.com', proxies=proxies)

！！！！这里如果我们想用各种不同的ip来访问网站呢？？？
因为对单一IP，很多网站会设置访问间隔。
解决办法，先去免费IP网站爬取所有的IP地址，然后使用这些IP爬取目标网站：

import requests
from bs4 import BeautifulSoup

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; WOW64; rv:43.0) Gecko/20100101 Firefox/43.0'
}
r = requests.get("https://www.xicidaili.com/nt/", headers=headers)
soup = BeautifulSoup(r.text, 'lxml')
ips = soup.findAll('tr')
proxy_list = []
for x in range(1, len(ips)):
    ip = ips[x]
    tds = ip.findAll("td")
    ip_temp = 'http://'+tds[1].contents[0]+":"+tds[2].contents[0]
    proxy_list.append(ip_temp)

# 上面已经获取了IP，下面是爬取目标网站
run_times = 100000
for i in range(run_times):
    for item in proxy_list:
        proxies = {
            'http': item,
            'https': item,
        }
        print(proxies)
        try:
            requests.get('目标网站', proxies=proxies, timeout=1)
            print('ok')
        except:
            continue

凡凡不知所错

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
Python3网络爬虫(三) -- 使用代理，轮换使用各种IP访问，忽略限制

对于一些网站，一开始能请求，但是时间久了，网站有可能会封ip，Requests库对此的解决办法：import requestsproxies = { 'http': 'http://10.10.1.10:3128', 'https': 'http://10.10.1.10:1080',}requests.get('https://www.taobao.com', proxies=...
复制链接

扫一扫

专栏目录