多线程爬取免费代理IP并检测是否有效

最新推荐文章于 2021-06-15 11:24:15 发布

BRUIN.

最新推荐文章于 2021-06-15 11:24:15 发布

阅读量266

点赞数

分类专栏： Python爬虫文章标签：多线程 python 队列

本文链接：https://blog.csdn.net/I_I___LO_VE___YA/article/details/104429662

版权

Python爬虫专栏收录该内容

38 篇文章 2 订阅

订阅专栏

首先导入所需的库

import requests
from bs4 import BeautifulSoup
import time
import random
import csv
import threading
from queue import Queue

开启两个线程，第一个线程：通过继承的方法创建一个爬取代理ip的爬虫线程，里面主要两个函数get_html()获取页面源码、parse_html()解析页面源码并将IP数据加入队列。

class IpAgentSpider(threading.Thread):
    def __init__(self, que):
        super().__init__()

        self.headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
                                     'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.87 Safari/537.36'}
        self.que = que

    def run(self):
    	# 因为爬取的是网站URL结构比较简单，所以通过循环来获取URL，每爬取一页数据就休眠0-5s
        for i in range(30, 2000):
        	# 获取html
            html = self.get_html(i)
            try:
            	# 解析html 获取IP并加入队列que
                self.parse_html(html)
            except Exception as e:
                print("解析页面错误：", e)
                time.sleep(random.random() * 50)
            else:
                time.sleep(random.random() * 20)

    def get_html(self, i):
        url = 'https://www.kuaidaili.com/free/inha/{}/'.format(i)
        try:
            response = requests.get(url, headers=self.headers)
        except Exception as e:
            print("第%d页" % i, e)
        else:
            print("获取第%d页" % i)
            return response.text

    def parse_html(self, response):
        # 解析html获取数据
        html = BeautifulSoup(response, 'lxml')
        tbody = html.find('tbody')
        trs = tbody.find_all('tr')

        for tr in trs:
            tds = tr.find_all('td')
            # 代理IP信息写入列表
            ip = tds[0].string
            port = tds[1].string
            tp = tds[3].string
            data = [tp, ip, port]
            # 添加入队列
            print(self.que.qsize(), data)
            self.que.put(data)

爬取一页的结果如下：

另外一个线程功能是，从队列获取IP测试是否有效：test()，保存有效IP：save()。也是通过继承线程的方式创建，通过继承的方式创建线程便于封装。

class Test(threading.Thread):
    def __init__(self, que):
        super().__init__()

        self.que = que
        self.lock = Lock()
        self.URL = 'http://www.whrenai.com/'

    def run(self):
        while True:
            data = self.test()
            if data:
                self.save(data)

    def test(self):
        # 测试ip是否有效 http://http.hunbovps.com/article-id-423.html
        try:
            if self.que.qsize() != 0:
                data = self.que.get()
                dic_data = {data[0]: '{}:{}'.format(data[1], data[2])}
                response = requests.get(self.URL, timeout=8, proxies=dic_data)
                print("检测中：", dic_data, response.status_code)
            else:
                return
        except Exception as e:
            print("检验失败：", e)
        else:
            if len(response.text) < 39739:
                print("代理ip：{}无效！".format(dic_data))
            else:
                print("代理ip：{}有效！".format(dic_data))
                return data

    def save(self, data):
        # 写入有效IP
        with open('proxies.csv', 'a', newline='') as f:
            w = csv.writer(f)
            w.writerow(data)

我测试IP可用性，随便拿了个的网站：http://www.whrenai.com/
试了一下请求网站，打印了正常响应内容的长度为39739，所以测试代理IP时响应的数据不到这个长度的应该就是无效的。

import requests

r = requests.get('http://www.whrenai.com/',headers= {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
                                     'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.87 Safari/537.36'})
print(len(r.text))

最后就是程序入口啦，首先实例化队列，实例化爬虫和测试线程，将队列传入爬虫线程和测试线程

if __name__ == '__main__':
    que = Queue()       # 实例化队列
    s = IpAgentSpider(que)      # 实例化爬虫
    t = Test(que)       # 实例化测试
    s.start()
    t.start()

然后附上最后运行的效果：
在这里插入图片描述

BRUIN.

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
多线程爬取免费代理IP并检测是否有效

首先导入所需的库import requestsfrom bs4 import BeautifulSoupimport timeimport randomimport csvimport threadingfrom threading import Lockfrom queue import Queue开启两个线程，第一个线程：通过继承的方法创建一个爬取代理ip的爬虫线程，里面主...
复制链接

扫一扫

专栏目录