使用Python爬虫建立本地IP代理池

最新推荐文章于 2024-03-23 10:54:13 发布

DJCWDCN

最新推荐文章于 2024-03-23 10:54:13 发布

阅读量879

点赞数 2

分类专栏： Python 文章标签： python爬虫 IP代理池

本文链接：https://blog.csdn.net/djcwdcn/article/details/87364208

版权

Python 专栏收录该内容

3 篇文章 0 订阅

订阅专栏

最近在玩爬虫，发现有些网站对于IP的访问频率有限制，所以写了一段简单的代码从IP代理网站爬取代理IP以供使用（保存文件的格式为“json”类型），代码如下：

import re,telnetlib
import urllib.request
import urllib.error

url="https://www.xicidaili.com/wt"
header="Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36"
#默认文件保存路径为该程序文件所在目录下
name=input("请输入保存的文件名称（默认=D）:")
if(name=="d" or name=="D"):
    name="save_ip"
    
try:
    req=urllib.request.Request(url)
    req.add_header('User-Agent',header)
    data=urllib.request.urlopen(req).read().decode("utf-8")
except urllib.error.URLError as e:
    if hasattr(e,"code"):
        print("---Error code: ",e.code)
    if hasattr(e,"reason"):
        print("---Error reason: ",e.reason)

#用正则表达式和字符串函数提取代理IP和端口号，以字典的形式放回{"ip":"端口"}
def get_dict(data):
    pattern=re.compile("<td>[.\d]+</td>")
    string=str(pattern.findall(data))
    string=re.sub("[\[\]<>td/\']",'',string)
    string=string.replace(' ','')
    ip_list=string.split(',')
    length=len(ip_list)
    i=0
    ip_dict={}
    while(i<length-1):
        try:
            #筛选有效IP代理，设置超时时间为0.1s
            telnetlib.Telnet(ip_list[i], port=ip_list[i+1], timeout=0.1)
        except:
            i+=2
            continue
        else:
            ip_dict[ip_list[i]]=ip_list[i+1]
            i+=2
    return ip_dict
                      
if __name__=="__main__" :                    
    dic=get_dict(data)
    file_name=name+".json"
    with open(file_name,"w") as file:
        json.dump(dic,file)
        file.close()
    print("文件已保存！！！\n有效的IP如下：\n"+str(dic))
    a=input("...")

读取文件：

with open("FileName.json","r") as file:
	data=json.load(file)
	file.close()

DJCWDCN

关注

2
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
使用Python爬虫建立本地IP代理池

最近在玩爬虫，发现有些网站对于IP的访问频率有限制，所以写了一段简单的代码从IP代理网站爬取代理IP以供使用（保存文件的格式为“json”类型），代码如下：import re,telnetlibimport urllib.requestimport urllib.errorurl=&amp;quot;https://www.xicidaili.com/wt&amp;quot;header=&amp;quot;Mozilla/5.0 (Wi...
复制链接

扫一扫