python建立ip代理池_Python爬虫——建立IP代理池

最新推荐文章于 2024-05-09 21:34:47 发布

xiao龟

最新推荐文章于 2024-05-09 21:34:47 发布

阅读量383

点赞数

文章标签： python建立ip代理池

版权声明：本文为博主原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接和本声明。

本文链接：https://blog.csdn.net/weixin_35340181/article/details/113661958

版权

在使用Python爬虫时，经常遇见具有反爬机制的网站。我们可以通过伪装headers来爬取，但是网站还是可以获取你的ip，从而禁掉你的ip来阻止爬取信息。

在request方法中，我们可以通过proxies参数来伪装我们的ip，一些网站上有免费的ip代理网站，可以通过爬取这些ip，经检测后建立ip代理池。

ip代理网站：

(https://www.xicidaili.com/nt/)

(https://www.kuaidaili.com/free/intr/)

推荐一种常用的伪装头方法

from fake_useragent import UserAgent

ua = UserAgent()

headers = {'User-Agent':ua.random}

接下来进入正题

爬取ip(IPPool.py)import requests

from lxml import etree

from fake_useragent import UserAgent

#伪装

ua = UserAgent()

headers = {'User-Agent':ua.random}

def get_ip():

ip_list = [] #路径

url = 'https://www.xicidaili.com/nt/' #ip是有时效的，只爬取第一页

#请求

response = requests.get(url=url,headers=headers)

#设置编码

response.encoding = response.apparent_encoding

response = response.text

response = etree.HTML(response)

tr_list = response.xpath('//tr[@class="odd"]')

for i in tr_list:

#ip

ip = i.xpath('./td[2]/text()')[0] #端口号

port = i.xpath('./td[3]/text()')[0] #协议

agreement = i.xpath('./td[6]/text()')[0] agreement = agreement.lower()

#拼装完整路径

ip = agreement + '://' + ip + ':' + port

ip_list.append(ip)

return ip_list

if __name__ == '__main__':

ip_list = get_ip()

print(ip_list)

测试ip

测试方法一(from multiprocessing.dummy import Pool)

import requests

from multiprocessing.dummy import Pool

#获取爬取到的ip列表

from IPPool import get_ip

test_list = get_ip()

#定义一个全局列表，用来存放有效ip

ip_list = []#ip测试网站

url = 'http://icanhazip.com'

headers = {

'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:70.0) Gecko/20100101 Firefox/70.0'

}

def ip_test(ip):

try:

if ip.split(":")[0] == 'http':

proxies = {

'http': ip

}

else:

proxies = {

'https': ip

}

response = requests.get(url=url, headers=headers, proxies=proxies, timeout=3)

ip_list.append(ip)

print(ip + "可用")

except:

print(ip + "不可用")

if __name__ == '__main__':

pool = Pool(4)

pool.map(ip_test, test_list)

print(ip_list)

print("总共爬取%s个ip，可用ip为：%s，不可用ip为：%s"%(len(test_list),len(ip_list),len(test_list)-len(ip_list)))

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python建立ip代理池_Python爬虫——建立IP代理池

在使用Python爬虫时，经常遇见具有反爬机制的网站。我们可以通过伪装headers来爬取，但是网站还是可以获取你的ip，从而禁掉你的ip来阻止爬取信息。在request方法中，我们可以通过proxies参数来伪装我们的ip，一些网站上有免费的ip代理网站，可以通过爬取这些ip，经检测后建立ip代理池。ip代理网站：(https://www.xicidaili.com/nt/)(https://w...
复制链接

扫一扫

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。