1. 什么是IP代理池
- 学过爬虫的大概都知道UA伪装,这时我们就有必要提到IP代理池了。所以说IP代理池就是一种用于网络爬虫、数据挖掘和访问限制突破等应用场景的技术。帮助您将请求路由到网站并显示其自己的IP地址,同时隐藏您自己的IP地址。
2. 生成IP代理池的代码展示
import time
import random
import requests
from lxml import etree
from fake_useragent import UserAgent
class ProxyFreePool:
"""
抓取快代理免费高匿代理,并测试是否可用,建立免费代理IP池
"""
def __init__(self):
self.url = 'https://www.kuaidaili.com/free/inha/{}/'
self.text_url = "http://baidu.com/"
def get_proxy_pool(self, url):
"""
function: 获取url地址的ip和port
in: url:传入的url地址
out: ip:代理ip
port:端口号
return: None
others: Get IP & Port Func
"""
headers = {'User-Agent': UserAgent().random}
html = requests.get(url=url, headers=headers).text
p = etree.HTML(html)
tr_list = p.xpath("//table[@class='table table-bordered table-striped']/tbody/tr")
for tr in tr_list[1:]:
ip = tr.xpath("./td[1]/text()")[0].strip()
port = tr.xpath("./td[2]/text()")[0].strip()
self.text_proxy(ip, port)
def text_proxy(self, ip, port):
"""
function: 测试一个代理IP是否可用函数
in: ip:代理IP
port:端口号
out: None
return: None
others: Text Proxy Func
"""
proxies = {
'http': 'http://{}:{}'.format(ip, port),
'https': 'https://{}:{}'.format(ip, port)
}
try:
headers = {'User-Agent': UserAgent().random}
res = requests.get(url=self.text_url, headers=headers, timeout=2, proxies=proxies)
if res.status_code == 200:
print(ip, port, '\033[31m可用\033[0m')
with open("proxy.txt", "a") as f:
f.write(ip + ':' + port + '\n')
except Exception as e:
print(ip, port, '不可用')
def run(self):
"""
function: 程序入口函数
in: None
out: None
return: None
others: Program Entry Func
"""
for i in range(1, 1001):
url = self.url.format(i)
self.get_proxy_pool(url=url)
time.sleep(random.randint(1, 2))
if __name__ == '__main__':
spider = ProxyFreePool()
spider.run()
3. 怎么使用
- 学过爬虫的都知道发送请求跟UA伪装,这里的IP代理池在python爬虫中如何使用就关乎到了UA伪装。有一点爬虫基础的都知道头文件或者UA伪装是这样的。
headers = {'User-Agent': UserAgent().random}
headers = {
'user-agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Mobile Safari/537.36 Edg/112.0.1722.48 '
}
proxies = {
'https':'117.29.228.43:64257',
'http':'117.29.228.43:64257'
}
requests.get(url, headers=head, proxies=proxies, timeout=3)
- 如果你是刚入门的话就用最简单的去实现,后续再去学复杂的。