搭建代理IP池

最新推荐文章于 2024-02-18 11:16:23 发布

Panda4u

最新推荐文章于 2024-02-18 11:16:23 发布

阅读量1.8w

点赞数 14

文章标签： python

本文链接：https://blog.csdn.net/panda4u/article/details/119255799

版权

爬取前的准备

爬取有IP内容

检查IP的可用性

上一期讲到在爬取豆瓣电影Top250时，出现ip被封的情况，解决方案给出了两种：

1. 换个WiFi或者热点；

2. 搭建代理IP池。

那么这期就来搭建代理IP池。通常来说，搭建代理IP池，就是爬取代理IP网站，然后做成一个IP的形式，最后在requests请求访问网站的时候proxies指定IP去访问。

爬取前的准备

有很多免费代理IP网站：

免费代理IP http://ip.yqie.com/ipproxy.htm
66免费代理网 http://www.66ip.cn/
89免费代理 http://www.89ip.cn/
无忧代理 http://www.data5u.com/
云代理 http://www.ip3366.net/
快代理 https://www.kuaidaili.com/free/
极速专享代理 http://www.superfastip.com/
HTTP代理IP https://www.xicidaili.com/wt/
小舒代理 http://www.xsdaili.com
西拉免费代理IP http://www.xiladaili.com/
小幻HTTP代理 https://ip.ihuan.me/
全网代理IP http://www.goubanjia.com/
飞龙代理IP http://www.feilongip.com/

我爬取的是快代理这个网站。点击进入快代理网站，如下图：

观察到每页15个数据，第一页的网站 https://www.kuaidaili.com/free/inha/1 后面依次加一。我爬取的150个数据也就是10页的内容，但是这150个数据并不是都有用的。因为IP会过期，人家付费的代理IP可用率也不可能达到100%，白嫖就更不要想了，哈哈哈哈哈哈哈。

爬取有IP内容

搭建代理IP池，我们只需要爬取的是IP和PORT。

值得注意的是IP的格式为 {'HTTP': 'IP:port', 'HTTPS': 'https://IP:port'} 。例如： {'HTTP': '106.45.104.146:3256', 'HTTPS': 'https://106.45.104.146:3256'}。

由于爬取的上一期已经讲了，而且这里爬取的非常简单，这里就直接贴代码了。

import requests
from bs4 import BeautifulSoup
import time


list_ip = []
list_port = []
list_headers_ip = []

for start in range(1,11):

    url = 'https://www.kuaidaili.com/free/inha/{}/'.format(start)       # 每页15个数据，共爬取10页
    print("正在处理url: ",url)

    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.164 Safari/537.36 Edg/91.0.864.71'}
    response = requests.get(url=url, headers=headers)

    soup = BeautifulSoup(response.text, 'html.parser')

    ip = soup.select('#list > table > tbody > tr > td:nth-child(1)')
    port = soup.select('#list > table > tbody > tr > td:nth-child(2)')

    for i in ip:
        list_ip.append(i.get_text())

    for i in port:
        list_port.append(i.get_text())

    time.sleep(1)       # 防止爬取太快，数据爬取不全

# 代理ip的形式:        'http':'http://119.14.253.128:8088'

for i in range(150):
    IP_http = '{}:{}'.format(list_ip[i],list_port[i])
    IP_https = 'https://{}:{}'.format(list_ip[i],list_port[i])
    proxies = {
        'HTTP':IP_http,
        'HTTPS':IP_https
    }
    list_headers_ip.append(proxies)
    # print(proxies)

print(list_headers_ip)

这里就把150个代理IP爬取下来了，刚才说的有一部分的IP是无用的，所以我们需要对代理IP进行有效性分析，简单的来说就是去访问一个网站看是否可行。

检查IP的可用性

使用proxies，指定一个IP去访问网站，另外这里设置了一个时间段。整体意思是，如在3秒内，使用ip去访问url这个网站，成功就说明访问成功，IP可以用。反之不可以用。

response = requests.get(url=url,headers=headers,proxies=ip,timeout=3)

我用的是豆瓣电影做的检测IP的网站。更简单的是，我们完全可以用自己需要爬取的网站做检测，而不需要找其它网站，检测出可以用就直接使用了，反之放弃就是。

# 检查IP的可用性
def check_ip(list_ip):
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.164 Safari/537.36 Edg/91.0.864.71',
               'Connection': 'close'}
    # url = 'https://www.baidu.com'  # 以百度为例，检测IP的可行性
    url = 'https://movie.douban.com/subject/1292052/'

    can_use = []
    for ip in list_ip:
        try:
            response = requests.get(url=url,headers=headers,proxies=ip,timeout=3)      # 在0.1秒之内请求百度的服务器
            if response.status_code == 200:
                can_use.append(ip)
        except Exception as e:
            print(e)

    return can_use

can_use = check_ip(list_headers_ip)
print('能用的代理IP为：',can_use)
# for i in can_use:
#     print(i)
print('能用的代理IP数量为：',len(can_use))

fo = open('IP代理池.txt','w')
for i in can_use:
    fo.write(str(i)+'\n')

fo.close()

运行的结果其中就有10个IP不可用。

将可用的IP写进txt文件中，需要时可以直接读取文件。然后使用，使用的方法和检测IP可用的方法一致。

总结：搭建代理IP的方法主要就是爬虫爬取代理网站，以及requests请求网站用proxies指定IP。

希望能给大家好处，有问题可以直接在评论区问，我会尽我最大的努力为大家解答的。

Panda4u

关注

14
点赞
踩
78

收藏

觉得还不错? 一键收藏
2
评论
搭建代理IP池

目录爬取前的准备爬取有IP内容检查IP的可用性上一期讲到在爬取豆瓣电影Top250时，出现ip被封的情况，解决方案给出了两种： 1. 换个WiFi或者热点； 2. 搭建代理IP池。那么这期就来搭建代理IP池。通常来说，搭建代理IP池，就是爬取代理IP网站，然后做成一个IP的形式，最后在requests请求访问网站的时候proxies指定IP去访问。爬取前的准备有很多...
复制链接

扫一扫