原理:网上免费的代理网站,爬取速度最快的前几位返回出来,具体怎么用:自己把函数的返回值打印出来看看
import re
import requests
PROXY_IPS = []
def get_proxy_ips() -> list:
global PROXY_IPS
if not PROXY_IPS:
contents = requests.get("https://www.kuaidaili.com/free/inha/").text
ips = re.findall(
'<td data-title="IP">([0-9]{1,3}?.[0-9]{1,3}?.[0-9]{1,3}?.[0-9]{1,3}?)</td>',
contents,
)
https = re.findall('<td data-title="类型">(HTTP|HTTPS)</td>', contents)
time = re.findall('<td data-title="响应速度">(.*?)秒</td>', contents)
http_results = sorted(
{(i, h): t for i, h, t in zip(ips, https, time)}.items(), key=lambda x: x[1]
)
PROXY_IPS = [i[0] for i in http_results if float(i[1]) <= 2]
if not PROXY_IPS:
PROXY_IPS = re.findall(
"<td>([0-9]{1,3}?.[0-9]{1,3}?.[0-9]{1,3}?.[0-9]{1,3}?)</td>.*?<td>(HTTP|HTTPS)</td>",
requests.get("https://ip.jiangxianli.com/?anonymity=1").text,
)
return PROXY_IPS