反爬虫：python多进程获取代理加入队列并用代理爬虫

最新推荐文章于 2022-10-13 09:41:23 发布

宽客Z

最新推荐文章于 2022-10-13 09:41:23 发布

阅读量240

点赞数

分类专栏： python爬虫反爬虫文章标签： python proxy queue

本文链接：https://blog.csdn.net/weixin_42052331/article/details/106969130

版权

python爬虫同时被 2 个专栏收录

3 篇文章 0 订阅

订阅专栏

反爬虫

1 篇文章 0 订阅

订阅专栏

写在前面

我们都知道，免费代理网站的代理质量都不高，主要体现在，比如代理A前一秒可用，后一秒可能就用不了了。所以如果你爬取的代理池，和用这些代理访问目标网站之间的时间过长，这些代理很可能就用不了。所以我就想了一个办法，能不能用多进程一边获取代理，一边用这些代理爬虫，提高代理的利用率。

主要用到的库和知识点

requests库
fake_useragent库，伪造浏览器访问代理网站，因为怕被代理网站封了我的IP
telnetlib库，测试一个代理是否可用
multiprocessing库的Process，多进程实现一边爬取代理，一边用代理爬虫
multiprocessing库的Queue，实现进程间信息的交互

代码

具体思路可以见注释

# encoding=utf-8
import requests
from fake_useragent import UserAgent
from lxml import etree
import telnetlib
import time
from multiprocessing import Queue, Process
'''
get_proxy方法的作用
1.‘循环’从泥马代理官网的代理中进行第一次筛选，筛选出可用的代理
2.将第一次筛选的代理放入‘队列q_0’
3.当‘队列q_1’不为空时，结束循环

get_page方法的作用
1.从‘队列q_0’里获取代理
2.用步骤1的代理访问目标页面，进行二次筛选，访问不了目标页面的话，该代理就被筛掉
3.若访问成功，则‘加入0’到‘队列q_1’，结束循环，返回页面源代码；若访问不成功，则返回到步骤1

'''
#
# 2.
def get_proxy(q_0, q_1):
    n = 0
    while q_1.empty():  # 如果队列q_1为空，说明get_page方法并没有成功访问到目标页面，循环继续
        n += 1
        print('第{0}次爬取'.format(n))
        ua = UserAgent()
        useragent = ua.random  # 伪造随机浏览器
        headers = {'User-Agent': useragent}
        url = 'http://www.nimadaili.com/'
        res = requests.get(url, headers)
        if res.status_code == 200:
            pageContent = res.text
            html = etree.HTML(pageContent)
            proxy = html.xpath('//*[@id="overflow"]/table/tbody/tr/td[1]/node()')
            proxy = proxy[:30]
        else:
            proxy = []
            print('代理网页访问失败！')
        for pr in proxy[:-1]:
            ip, port = str(pr).split(':')
            try:
                telnetlib.Telnet(ip, port, timeout=2)  # 测试代理是否可用
                print(str(pr), 'OK!')
                q_0.put(str(pr))  # 队列q_0里放入第一次筛选后，可用的代理
            except Exception as e:
                print(str(pr), 'ERROR!')
        print('WAIT...')
        time.sleep(6)

def get_page(q_0, q_1, url):
    while True:
        pro = str(q_0.get())  # 获取队列q_0里，可用的代理
        print('Get %s from queue.' % pro)
        proxies = {
            'http': 'http://' + pro,
            'https': 'https://' + pro,
        }
        try:
            response = requests.get(url, proxies=proxies)  # 用代理访问目标页面
            break
        except Exception as e:
            continue
    response.encoding = 'utf-8'
    pageContent = response.text
    print('pageContent:', pageContent)
    q_1.put(0)
    return pageContent

if __name__=='__main__':
    q_0 = Queue()
    q_1 = Queue()
    url = 'https://www.baidu.com'
    pw = Process(target=get_proxy, args=(q_0, q_1))
    pr = Process(target=get_page, args=(q_0, q_1, url))
    pw.start()
    pr.start()
    pw.join()
    pr.join()

写在后面

如果大家有问题，可以一起讨论哈！

宽客Z

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
反爬虫：python多进程获取代理加入队列并用代理爬虫

这里写自定义目录标题写在前面主要用到的库和知识点代码写在后面写在前面我们都知道，免费代理网站的代理质量都不高，主要体现在，比如代理A前一秒可用，后一秒可能就用不了了。所以如果你爬取的代理池，和用这些代理访问目标网站之间的时间过长，这些代理很可能就用不了。所以我就想了一个办法，能不能用多进程一边获取代理，一边用这些代理爬虫，提高代理的利用率。主要用到的库和知识点requests库fake_useragent库，伪造浏览器访问代理网站，因为怕被代理网站封了我的IPtelnetlib库，测试一个代理
复制链接

扫一扫

专栏目录