250行代码实现动态IP池的建立

最新推荐文章于 2024-07-22 15:29:05 发布

killeri

最新推荐文章于 2024-07-22 15:29:05 发布

阅读量4.3k

点赞数 3

分类专栏： python爬虫(除scrapy框架）文章标签：动态IP池的建立 python

本文链接：https://blog.csdn.net/killeri/article/details/80051476

版权

python爬虫(除scrapy框架）专栏收录该内容

17 篇文章 1 订阅

订阅专栏

知识储备：requests，BeautifulSoup，re，redis数据库，flask（这个只要一点点，照我的抄都行），对python的类有一定的了解并且能够使用。

我们知道，在爬取网页信息的时候，特别是大量的爬取，有些网站就可能有一些防爬虫的手段，其中封ip就是一个办法，被封了ip怎么办，很简单，换ip再去爬，可是哪里去找这些ip呢？你可以去网站买（有点贵），还有一种办法就是从网上找到免费的ip，一般的代理平台都会有一些免费的代理可以使用。但是，很显然，这些代理的质量肯定不高，可以说十个里面可能没有一个有用。
我呢，学生一个，没有钱买ip，那只有用免费的ip了，但是我又能一个一个的试吧，所以就想到了建立一个ip池（从免费的代理网页爬取，然后测试留下有用的，丢弃没用的）
步骤和思路
一、首先，你要爬取网站是吧，把免费的ip爬出来
二、爬取（BeautifulSoup）的IP肯定大部分是没有用的，所以接下来一步就是（requests）测试ip有没有用
三、有用的ip是不是要存到数据库里面，以便我们随时取用（redis）
四、已经存储到数据库里面的ip肯定有一个时效的，过了时间就没有了对吧，那么我们就需要一个不断（或一段时间）测试数据库里面的ip有没有用，没用的丢弃。
五、我们要实现一个接口，让其他的程序能够顺利的调用存储好的ip（flask）

那么我们按步骤一个个的来讲解贴出代码就好了
首先我们应该把存储ip的代码贴出来，因为后面都会用到它，我们用到一个IP_store.py文件用于，ip的存储和提取（用redis中list数据结构）

# coding:utf-8

# 这一块是代理的存储，将爬取的代理存储到数据库中

from ProxyFile.config import *



class Redis_Operation:
    def put_head(self,ip):
        # 这里将有用IP地址给储存进redis
        R.lpush('IP_list',ip)

    def get_head(self):
        # 这里从列表的开始处取出一个IP
        return R.lpop('IP_list')

    def get_tail(self):
        # 这里从列表的尾部拿出一个IP用于检查
        return R.rpop('IP_list')

    def list_len(self):
        # 返回列表的长度
        return R.llen('IP_list')

RO=Redis_Operation() # 创建一个实例，其他文件会导出这个实例的呀

第二步，爬取网页，并测试捉去的ip是否可用，可用就存储到数据库里
而且规定数据库最多只能用30个ip
这个我创建了一个page_parser.py文件

# coding:utf-8
import requests,re # 用于解析页面
from bs4 import BeautifulSoup as BF
import threading # 导入多线程
from ProxyFile.IP_store import * #这个是另外一个我写的文件，用于存储ip到redis
# 解析免费代理页面，返回各网页的免费代理


class IP_page_parser:
    def __init__(self):
        pass

    def page_manong(self):
        headers={'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.162 Safari/537.36'}
        html=requests.get('https://proxy.coderbusy.com/classical/https-ready.aspx',verify=False,headers=headers)
        # verify不验证安全证书（SSL），headers传进去将requests请求伪装成浏览器请求
        if html.status_code == 200:
        # 确保返回页面
            Soup=BF(html.text,'lxml')
            tbody=Soup.find('tbody')
            tr_list=tbody.find_all('tr')
            for tr in tr_list:
                try:
                    IP_adress=tr.find('td').get_text().strip()
                    IP_port=tr.find('td',class_="port-box").get_text()
                    IP="http://"+IP_adress+":"+IP_port
                    # 用字符串加法构造IP
                    proxies={'http':IP}
                    try:
                        html=requests.get('http://www.baidu.com',proxies=proxies)
                        RO.put_head(IP)
                        if RO.list_len() > 30:
                        #这里定义如果存储的ip大于30个就跳出这个函数
                            return
                        print('valid IP')
                    except Exception:
                        print('invalid IP')
                except Exception:
                    pass
        else:
            print('码农代理出错')

    def page_kuai(self):
        headers={'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.162 Safari/537.36'}
        html=requests.get('https://www.kuaidaili.com/free/',headers=headers,verify=False)
        if html.status_code == 200:
            Soup=BF(html.text,'lxml')
            tbody=Soup.find('tbody')
            tr_list=tbody.find_all('tr')
            for tr in tr_list:
                try:
                    IP_adress=tr.find('td').get_text()
                    IP_port=tr.find('td',attrs={'data-title':"PORT"}).get_text()
                    IP="http://"+IP_adress+":"+IP_port
                    proxies={'http':IP}
                    try:
                        html = requests.get('http://www.baidu.com', proxies=proxies)
                        RO.put_head(IP)
                        if RO.list_len() > 30:
                            return
                        print('valid IP')
                    except Exception:
                        print('invalid IP')
                except Exception:
                    pass
        else:
            print('快代理出错')
    def page_xici(self):

        headers={'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.162 Safari/537.36'}
        html=requests.get("http://www.xicidaili.com/",headers=headers,verify=False)

        if html.status_code == 200:
            htmltext=html.text
            pattern=re.compile('td.*?img.*?</td>\s*?<td>(.*?)</td>\s*?<td>(\d+)</td',re.S)
            IP_zu=pattern.findall(htmltext)
            for tr in IP_zu:
                try:
                    IP='http://'+tr[0]+':'+tr[1]
                    try:
                        proxies = {'http': IP}
                        html = requests.get('http://www.baidu.com', proxies=proxies)
                        RO.put_head(IP)
                        if RO.list_len() > 30:
                            return
                        print('valid IP')
                    except Exception:
                        print('invalid IP')
                except Exception:
                    pass
        else:
            print('西刺代理出错')

    def page_data5u(self):
        headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.162 Safari/537.36'}

        html = requests.get("http://www.data5u.com/free/gnpt/index.shtml", headers=headers, verify=False)
        if html.status_code == 200:
            Soup=BF(html.text,'lxml')
            li=Soup.find('li',style="text-align:center;")
            ul=li.find_all('ul',class_="l2")
            for tr in ul:
                try:
                    IP_adress=tr.find('span').get_text()
                    IP_port=tr.find('span',style="width: 100px;").get_text()
                    IP="http://"+IP_adress+":"+IP_port
                    try:
                        proxies = {'http': IP}
                        html = requests.get('http://www.baidu.com', proxies=proxies)
                        RO.put_head(IP)
                        if RO.list_len() > 30:
                            return
                        print('valid IP')
                    except Exception:
                        print('invalid IP')
                except Exception:
                    pass
class run_parser:
# 这里用于在其他的文件中调用这个文件的函数和方法
    # 用于调用上面的进程
    def Run_Parser(self):
        x = IP_page_parser()
        process_list = []
        # 这里我开起了一个多线程，同时对多个页面进行抓取和测试
        t1 = threading.Thread(target=x.page_manong, args=())
        process_list.append(t1)
        t2 = threading.Thread(target=x.page_kuai, args=())
        process_list.append(t2)
        t3 = threading.Thread(target=x.page_xici, args=())
        process_list.append(t3)
        t4 = threading.Thread(target=x.page_data5u, args=())
        process_list.append(t4)

        for i in process_list:
            i.start()
        for i in process_list:
            i.join()

RP=run_parser() # 这个用于导出上面类的实例。

if __name__=='__main__':
    x=IP_page_parser()
    process_list=[]
    t1=threading.Thread(target=x.page_manong,args=())
    process_list.append(t1)
    t2=threading.Thread(target=x.page_kuai,args=())
    process_list.append(t2)
    t3=threading.Thread(target=x.page_xici,args=())
    process_list.append(t3)
    t4=threading.Thread(target=x.page_data5u,args=())
    process_list.append(t4)

    for i in process_list:
        i.start()
    for i in process_list:
        i.join()

上面一段是最长的代码，一百五十行，而且大部分是重复的。你可以看完的
上面一步几乎已经完成了大部分的工作，接下来我们要测定已经存储的ip有没有用，注意，上面一步和这一步一定是有且只有一个在执行，做法我们接下来会给出
这是代码

mport requests
from ProxyFile.IP_store import Redis_Operation as R_O
# 注意对IP_store.py文件的引用
from ProxyFile.IP_store import *


class List_Ip_test:

    def get_and_test(self):
        # 从列表的尾部取出一个ip
        ip=str(RO.get_tail(),encoding='utf-8')
        # redis导出的数据都是bytes类型的，所以我们必须将其str化，必须家enconding参数，详见《python学习手册》高级话题部分
        proxies = {'http': ip}
        # 测试ip有没有用
        html = requests.get('http://www.baidu.com', proxies=proxies)
        if html.status_code == 200:
            RO.put_head(ip)
            print('valid IP')
        else:
            print('丢弃无用的ip')

LIT=List_Ip_test() # 创建一个实例，用于其他文件的引用

好了，这次我们真的完成了几乎距大部分的工作了，接下来还有一个调用的文件
我只把代码贴出来把
首先，文件api.py,这是一个接口文件，

# coding:utf-8

# 用于做接口，使其他的程序能够获得这个程序的开发出来的有用的IP


from flask import Flask
from ProxyFile.IP_store import *


__all__ = ['app']

app = Flask(__name__)

@app.route('/')
def get_proxy():
    return  RO.get_head()

app.run() # 当你运行这段代码时，在浏览器中输入localhost:5000,就会出现ip

接下来是scheduler.py文件，用与调用整个程序

# coding:utf-8

# 用于对redis数据库的一些调用，检查IP和添加IP
from ProxyFile.page_parser import *
from ProxyFile.IP_store import Redis_Operation as R_O
from ProxyFile.IP_store import *
from ProxyFile.list_IP_test import *
import time

class Add_and_Check:
    def add_and_check(self):
        # 当ip池中小于十个ip那么就在网页上爬取，否则就不断测试现在的ip是不是还有用
        while True:
        # 程序是一直在运行的，运行着Run_Parser()函数或者是get_and_text()函数
            if RO.list_len() < 30:
                RP.Run_Parser()
            else:
                LIT.get_and_test()
            time.sleep(30) # 当数据库中有了三十个ip时可以休息一下在从新运行


AC=Add_and_Check()
AC.add_and_check()

**上面就是整个程序的代码，想要在别的程序中调用ip是可以用这段代码**

import requests

def get_proxy():
r = requests.get(‘http://127.0.0.1:5000‘)
return r.text # 这个就是我们要的可以使用的ip


**虽然完成了，但总是觉得程序的健壮性不是很好，但有说不上来，如过您能找出来，请留言跟我讲一下，谢谢。**
最后贴一张数据库中的ip图
![用的是redis的可视化工具](https://img-blog.csdn.net/20180423154621759?watermark/2/text/aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L2tpbGxlcmk=/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70)

killeri

关注

3
点赞
踩
16

收藏

觉得还不错? 一键收藏
5
评论
250行代码实现动态IP池的建立

知识储备：requests，BeautifulSoup，re，redis数据库，flask（这个只要一点点，照我的抄都行），对python的类有一定的了解并且能够使用。我们知道，在爬取网页信息的时候，特别是大量的爬取，有些网站就可能有一些防爬虫的手段，其中封ip就是一个办法，被封了ip怎么办，很简单，换ip再去爬，可是哪里去找这些ip呢？你可以去网站买（有点贵），还有一种办法就是从网上找到免费...
复制链接

扫一扫

专栏目录