11-控制线程数量

# JFZero

已于 2022-02-21 19:10:58 修改

阅读量341

点赞数

分类专栏： python高级文章标签：爬虫 python java

于 2022-02-21 18:54:33 首次发布

本文链接：https://blog.csdn.net/weixin_50348308/article/details/123053400

版权

python高级专栏收录该内容

15 篇文章 0 订阅

订阅专栏

Semaphore 是用于控制进入数量的锁

文件：读、写，写一般只是用于一个线程写，但读可以允许有多个线程同时进行控制读取文件的线程做爬虫，同一时间内的爬虫并发数量，控制爬虫并发数量，避免被限制

Semaphore内部是调用了condition的逻辑，
Queue内部也是调用了condition
Semaphore的使用步骤：
1、让爬虫类继承threading.Thread，通过重写__init__方法，增加sem属性。
2、重写run方法，爬取页面
3、创建threading.Semaphore对象，并设置数量属性值
4、创建爬虫类对象时，传递Semaphore的对象，
5、在开启每一个线程之前，都让Semaphore对象调用acquire方法
6、在每一个线程爬取结束后，都让Semaphore对象调用release方法

import threading
class HtmlSpider(threading.Thread):
    def __init__(self,url,sem):
        super().__init__()
        self.url = url
        self.sem = sem
    def run(self):
        time.sleep(2)
        print("got html text success")
        # 爬取完一个就让Semaphore对象调用1次release，这样Semaphore维护的数量就会加回来
        self.sem.release()
class UrlProducer(threading.Thread):
    def __init__(self,sem):
        super().__init__()
        self.sem = sem
    def run(self):
        for i in range(20):
            # semaphore的对象每次调用accquire方法时，都会使它维护的数量减1
            self.sem.acquire()
            html_thread = HtmlSpider(f"https://baidu.com/{i}",self.sem)
            html_thread.start()

if __name__ == "__main__":
    # Semaphore李彪
    sem = threading.Semaphore(3)
    url_producer = UrlProducer(sem)
    url_producer.start()

conditon是用于复杂的线程间同步，是最复杂的同步锁

condition实现了__enter__和__exit__，所以可以用with语句
condition内部使用Lock或RLock，
condition内部的__enter__是调用了acquire方法，而__exit__是调用了release方法
condition中的wait方法，是等待某个条件变量的通知
切换线程后，线程内部有独立隔离的程序计数器，用于记录线程运行的字节码位置，从而在切换线程后，从相应的字节码位置接着往下执行！！！！！！
多线程的精髓所在！！！这内部就是通过yield实现的多线程的原理！！！！yield为什么能从断点继续，原来如此！
启动顺序很重要，先启动一开始就wait的对象

import threading

class XiaoAi(threading.Thread):
    def __init__(self,condition):
        super().__init__(name = "小爱")
        self.condition = condition
    def run(self):
        with self.condition:
            self.condition.wait()
            print(f"{name}:在")
            self.condition.notify()

            self.condition.wait()
            print(f"{name}:好啊")
            self.condition.notify()

            self.condition.wait()
            print(f"{name}:我住长江尾")
            self.condition.notify()

class TianMao(threading.Thread):
    def __init__(self,condition):
        super().__init__(name = "天猫精灵")
        self.condition = condition
    def run(self):
        with self.condition:
            print(f"{name}:小爱同学")
            self.condition.notify()
            self.condition.wait()

            print(f"{name}:我们来对古诗吧")
            self.condition.notify()
            self.condition.wait()

            print(f"{name}:君住长江头")
            self.condition.notify()
            self.condition.wait()

if __name__ == "__main__":
    condition = threading.Condition()
    xiaoai = XiaoAi(condition)
    tianmao = TianMao()
    xiaoai.start()
    tianmao.start()

线程管理包threadingpoolexecutor，让线程池自己去调度url
线程池不只是进行并发数量控制，还可以获取某个线程的状态或任务以及返回值
当一个线程完成的时候我们主线程就能立即执行
futures可以让多线程和多进程编码接口一致

import time
from concurrent import futures

def get_html(times):
    time.sleep(times)
    print(f"get page{} success{times}")
    return times
executor = ThreadPoolExecutor(max_workers=2)
# 通过submit函数提交执行的函数到线程池中
# task1、task2是futures类产生的对象，submit是立即返回，不阻塞
task1 = executor.submit(get_html,(3))
task2 = executor.submit(get_html,(2))
# done判断某个任务（task1）是否完成
print(task1.done())
# result获取某个任务的执行结果
print(task1.result)
# 可以通过cancel可以将任务取消，如果取消成功就打印True,不成功就打印False
# cancel只能取消未开始的任务，进行中或已完成的任务是不能取消的
print(task2.cancel())

要获取已经成功的task返回

urls = [3,2,4]

# as_complete是生成器，只返回已经完成的任务对象（即future），需要for循环去遍历获取,ps:线程是异步线程
# 方法1：as_complete是个生成器，按执行结束的先后顺序返回任务对象（即future）
all_task = [executor.submit(get_html,(url)) for url in urls]
for future in as_completed(all_task):
    data = future.result()
    print(f"get{data} page success")
# 方法2：通过excutor获取已经完成的task,按照urls列表顺序返回future.result()，直接就返回data(即future.result)，而不是future
for data in executor.map(get_html,urls):
    print(f"get{data} page")

future未来对象，更像是task返回容器，多线程、多进程、协程都是根据future的设计理念，很重要

wait(all_task)，默认会等到所有的子线程完成，再去执行主线程，但wait可以设置参数when_return
参数1：FIRST_COMPLETED = ‘FIRST_COMPLETED’
参数2：FIRST_EXCEPTION = ‘FIRST_EXCEPTION’
参数3：ALL_COMPLETED = ‘ALL_COMPLETED’
参数4：AS_COMPLETED = ‘_AS_COMPLETED’

from concurrent.futures import ThreadPoolExecutor, as_completed, wait, FIRST_COMPLETED
from concurrent.futures import Future
urls = [3,2,4]
all_task = [executor.submit(get_html,(url)) for url in urls]
# 打印出第一个任务就回到主线程。
wait(all_task,return_when = FIRST_COMPLETED)
print("回到主线程")

# JFZero

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
11-控制线程数量

Semaphore 是用于控制进入数量的锁文件：读、写，写一般只是用于一个线程写，但读可以允许有多个线程同时进行控制读取文件的线程做爬虫，同一时间内的爬虫并发数量，控制爬虫并发数量，避免被限制Semaphore内部是调用了condition的逻辑，Queue内部也是调用了conditionSemaphore的使用步骤：1、让爬虫类继承threading.Thread，通过重写__init__方法，增加sem属性。2、重写run方法，爬取页面3、创建threading.Semaphore对象
复制链接

扫一扫

专栏目录