python并发编程笔记3

python并发编程笔记3

是根据蚂蚁学Python的视频做的笔记,方便自己后续回顾

视频链接:BV1bK411A7tV

老师的源码

这一份笔记对应的是视频的P5

P5-Python实现生产者消费者爬虫

1、多组件的Pipeline技术架构

复杂的事情一般都不会一下子做完,而是会分很多中间步骤一步步完成

模块与模块之间协同处理数据的架构叫Pipeline,而中间的模块(处理器)叫Processor

生产者生产结果通过中间数据传给消费者进行消费

生产者以输入数据作为他的原料,消费者以它的输出作为最终的输出数据

在这里插入图片描述

2、生产者消费者爬虫的架构

在这里插入图片描述

3、多线程数据通信的queue.Queue

queue.Queue可以用于多线程之间的、线程安全的数据通信

3.1、导入类库

import queue

3.2、创建Queue

q=queue.Queue

3.3、添加元素

q.put(item)

3.4、获取元素

item = q.get()

3.5、查询状态

# 查看元素的多少
q.qsize()

#判断是否为空
q.empty()

# 判断是否为满
q.full()

4、代码编写实现生产者消费者爬虫

先提前安装好Bs4这个包

pip3 install Beautifulsoup4
# blog_spider.py

import requests
from bs4 import BeautifulSoup

# 注意这里不要使用老师提供的这个(可能加入了防爬),否则你后面的爬取的永远只有第一页的数据
# urls = [f"https://www.cnblogs.com/#p{page}" for page in range(1, 51)]
# 请将urls改成这个
urls = [
    f"https://www.cnblogs.com/sitehome/p/{page}"
    for page in range(1, 50 + 1)
]

# 生产者,生产的结果是HTML
def craw(url):
    r = requests.get(url)
    # 返回的是该网页的HTML信息
    return r.text


# 消费者 该函数是解析html信息中的信息,其中href是链接,get_text是标题
def parse(html):
    # class="post-item-title"
    soup = BeautifulSoup(html, "html.parser")
    links = soup.find_all("a", class_="post-item-title")
    return [(link["href"], link.get_text()) for link in links]


if __name__ == '__main__':
    for result in parse(craw(urls[3])):
        print(result)
import queue
import blog_spider
import time
import random
import threading


# :后面是标明类型,方便方法的提示
def do_craw(url_queue: queue.Queue, html_queue: queue.Queue):
    while True:
        # 取得一个url
        url = url_queue.get()
        # 返回这个url的HTML信息
        html = blog_spider.craw(url)
        # 将结果放到这个html_queue里面
        html_queue.put(html)

        # 打印日志
        print(threading.current_thread().name, f"craw{url}",
              "url_queue_size=", url_queue.qsize())

        # 随机睡眠,防止请求过快被封ip
        time.sleep(random.randint(1, 2))


# 将结果写进一个文件当中,指定的文件作为参数fout传入这个函数当中
def do_parse(html_queue: queue.Queue, fout):
    while True:
        # 获取生产者put进html_queue的数据
        html = html_queue.get()
        # 解析HTNL信息
        results = blog_spider.parse(html)
        # 将获取到的results列表的里面的元组内容写进fout这个文件当中
        for result in results:
            fout.write(str(result) + "\n")

        # 打印日志
        print(threading.current_thread().name, f"results_size=", len(results),
              "html_queue_size=", html_queue.qsize())

        # 随机睡眠,防止请求过快被封ip
        time.sleep(random.randint(1, 2))


if __name__ == '__main__':
    # 创建Queue对象
    url_queue = queue.Queue()
    html_queue = queue.Queue()
    # 将urls里的每一页的url(https://www.cnblogs.com/#p{page}) put 进url_queue里面
    for url in blog_spider.urls:
        url_queue.put(url)

    # 生产者3个线程:
    for idx in range(3):
        t = threading.Thread(target=do_craw, args=(url_queue, html_queue),
                             name=f"craw{idx}")
        t.start()

    # 构建/打开保存数据的文件对象
    # 这里后面也要添加编码类型
    fout = open("02_data.txt", "w", encoding='utf-8')
    # 消费者2个线程:
    for idx in range(2):
        t = threading.Thread(target=do_parse, args=(html_queue, fout),
                             name=f"parse{idx}")
        t.start()

5、输出的日志分析

craw0、1、2是创建的3个生产者线程

parse0、1是创建的2个消费者线程

url_queue作为输入数据(原料)经过生产者处理不断的减少,将生产出的html信息输入(put)到html_queue中

所以url_queue_size的大小不断变小,html_queue_size在浮动变化,因为新添加进去的html信息在不断被消费者获取(get)

随后后面因为生产者的线程比消费者多,消费者处理html_queue处理不过来,html_queue_size逐渐变大,直到生产者将url_queue全部处理完(此时的url_queue_size也为0),html_queue_size的大小将不会继续增大,只会随着消费者的处理逐渐减少为0

craw1 crawhttps://www.cnblogs.com/#p2 url_queue_size= 47
craw2 crawhttps://www.cnblogs.com/#p3 url_queue_size= 47
craw0 crawhttps://www.cnblogs.com/#p1 url_queue_size= 47
parse0 results_size= 20 html_queue_size= 1
parse1 results_size= 20 html_queue_size= 1
parse0 results_size= 20 html_queue_size= 0
craw2 crawhttps://www.cnblogs.com/#p4 url_queue_size= 46
parse1 results_size= 20 html_queue_size= 0
craw0 crawhttps://www.cnblogs.com/#p6 url_queue_size= 44
craw1 crawhttps://www.cnblogs.com/#p5 url_queue_size= 44
parse0 results_size= 20 html_queue_size= 1
parse1 results_size= 20 html_queue_size= 0
craw2 crawhttps://www.cnblogs.com/#p7 url_queue_size= 43
parse0 results_size= 20 html_queue_size= 0
craw0 crawhttps://www.cnblogs.com/#p8 url_queue_size= 41
parse1 results_size= 20 html_queue_size= 0
craw1 crawhttps://www.cnblogs.com/#p9 url_queue_size= 40
craw2 crawhttps://www.cnblogs.com/#p10 url_queue_size= 40
parse0 results_size= 20 html_queue_size= 1
craw1 crawhttps://www.cnblogs.com/#p11 url_queue_size= 38
craw2 crawhttps://www.cnblogs.com/#p12 url_queue_size= 38
parse0 results_size= 20 html_queue_size= 2
parse1 results_size= 20 html_queue_size= 1
craw0 crawhttps://www.cnblogs.com/#p13 url_queue_size= 37
craw2 crawhttps://www.cnblogs.com/#p14 url_queue_size= 36
parse0 results_size= 20 html_queue_size= 2
parse1 results_size= 20 html_queue_size= 1
craw1 crawhttps://www.cnblogs.com/#p15 url_queue_size= 35
craw0 crawhttps://www.cnblogs.com/#p16 url_queue_size= 34
craw2 crawhttps://www.cnblogs.com/#p17 url_queue_size= 33
parse0 results_size= 20 html_queue_size= 3
parse1 results_size= 20 html_queue_size= 2
craw1 crawhttps://www.cnblogs.com/#p18 url_queue_size= 32
parse0 results_size= 20 html_queue_size= 2
parse1 results_size= 20 html_queue_size= 1
craw0 crawhttps://www.cnblogs.com/#p19 url_queue_size= 31
craw1 crawhttps://www.cnblogs.com/#p20 url_queue_size= 29
craw2 crawhttps://www.cnblogs.com/#p21 url_queue_size= 29
craw1 crawhttps://www.cnblogs.com/#p22 url_queue_size= 27
craw2 crawhttps://www.cnblogs.com/#p23 url_queue_size= 27
parse0 results_size= 20 html_queue_size= 5
parse1 results_size= 20 html_queue_size= 4
craw0 crawhttps://www.cnblogs.com/#p24 url_queue_size= 26
parse0 results_size= 20 html_queue_size= 4
parse1 results_size= 20 html_queue_size= 3
craw0 crawhttps://www.cnblogs.com/#p25 url_queue_size= 25
craw1 crawhttps://www.cnblogs.com/#p26 url_queue_size= 23
craw2 crawhttps://www.cnblogs.com/#p27 url_queue_size= 23
parse0 results_size= 20 html_queue_size= 5
craw0 crawhttps://www.cnblogs.com/#p28 url_queue_size= 22
parse1 results_size= 20 html_queue_size= 5
craw2 crawhttps://www.cnblogs.com/#p29 url_queue_size= 21
parse0 results_size= 20 html_queue_size= 5
craw1 crawhttps://www.cnblogs.com/#p30 url_queue_size= 20
craw2 crawhttps://www.cnblogs.com/#p31 url_queue_size= 19
craw0 crawhttps://www.cnblogs.com/#p32 url_queue_size= 18
parse1 results_size= 20 html_queue_size= 7
craw2 crawhttps://www.cnblogs.com/#p33 url_queue_size= 17
parse0 results_size= 20 html_queue_size= 6
parse1 results_size= 20 html_queue_size= 6
craw0 crawhttps://www.cnblogs.com/#p34 url_queue_size= 15
craw1 crawhttps://www.cnblogs.com/#p35 url_queue_size= 15
parse1 results_size= 20 html_queue_size= 7
craw2 crawhttps://www.cnblogs.com/#p36 url_queue_size= 14
parse0 results_size= 20 html_queue_size= 7
craw0 crawhttps://www.cnblogs.com/#p37 url_queue_size= 12
craw1 crawhttps://www.cnblogs.com/#p38 url_queue_size= 12
parse0 results_size= 20 html_queue_size= 8
parse1 results_size= 20 html_queue_size= 7
craw1 crawhttps://www.cnblogs.com/#p39 url_queue_size= 10
craw2 crawhttps://www.cnblogs.com/#p40 url_queue_size= 10
parse0 results_size= 20 html_queue_size= 8
parse1 results_size= 20 html_queue_size= 7
craw0 crawhttps://www.cnblogs.com/#p41 url_queue_size= 9
parse1 results_size= 20 html_queue_size= 7
craw1 crawhttps://www.cnblogs.com/#p42 url_queue_size= 7
craw2 crawhttps://www.cnblogs.com/#p43 url_queue_size= 7
parse0 results_size= 20 html_queue_size= 8
craw0 crawhttps://www.cnblogs.com/#p44 url_queue_size= 6
parse1 results_size= 20 html_queue_size= 8
craw0 crawhttps://www.cnblogs.com/#p45 url_queue_size= 4
craw1 crawhttps://www.cnblogs.com/#p46 url_queue_size= 3
craw2 crawhttps://www.cnblogs.com/#p47 url_queue_size= 3
parse0 results_size= 20 html_queue_size= 9
parse1 results_size= 20 html_queue_size= 9
craw0 crawhttps://www.cnblogs.com/#p48 url_queue_size= 2
parse0 results_size= 20 html_queue_size= 9
craw1 crawhttps://www.cnblogs.com/#p49 url_queue_size= 0
craw2 crawhttps://www.cnblogs.com/#p50 url_queue_size= 0
parse1 results_size= 20 html_queue_size= 10
parse0 results_size= 20 html_queue_size= 9
parse1 results_size= 20 html_queue_size= 8
parse0 results_size= 20 html_queue_size= 7
parse1 results_size= 20 html_queue_size= 6
parse1 results_size= 20 html_queue_size= 5
parse0 results_size= 20 html_queue_size= 4
parse1 results_size= 20 html_queue_size= 3
parse0 results_size= 20 html_queue_size= 2
parse1 results_size= 20 html_queue_size= 1
parse0 results_size= 20 html_queue_size= 0
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值