Python 并发编程

最新推荐文章于 2024-07-12 16:16:27 发布

redfish95

最新推荐文章于 2024-07-12 16:16:27 发布

阅读量256

点赞数

分类专栏： python 文章标签： python 大数据开发语言

本文链接：https://blog.csdn.net/redfish95/article/details/129250367

版权

python 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

Python 并发编程介绍

1、为什么要引入并发编程？

场景1：一个网络爬虫，按顺序爬取花了一个小时，采用并发下载减少到20分钟
场景2：一个APP的应用，优化前每次打开app需要3秒，采用异步并发提升到每次200毫秒

目的：提升程序运行的速度！

有哪些程序提速的方法？
单线程串行->多线程并发->多CPU并行->多机器并行
对应的python技术：
不加改造的程序->threading->multiprocessing->hadoop/hive/spark

python对并发编程的支持

多线程：threading，利用CPU和IO可以同时执行的原理，让CPU不必等待IO的完成
多进程：multiprocessing，利用多核CPU的能力，真正的并行执行任务
异步IO：asyncio，在单线程利用CPU和IO同时执行的原理，实现函数异步执行利用Lock对资源加锁，防止冲突访问
利用Queue实现不同线程/进程之间的相互通信，实现生产者-消费者模式
利用线程池Pool/进程池Pool，简化线程/进程的任务提交、等待结束、获取结果
使用subprocess启动外部程序的进程，并进行输入/输出交互

怎样选择多进程、多线程、多协程

1、什么是CPU密集型计算，什么是IO密集型计算

CPU密集型也叫计算密集型，是指IO可以在很短的时间就可以完成，CPU需要大量的计算和处理，特点是CPU占用率相当高
例如：压缩解压缩、正则表达式、加密解密
IO密集型指的是系统运行大部分的状态是CPU在等IO（硬盘/内存）的读写操作，CPU的占用率仍然较低
例如：文件处理程序、网络爬虫程序、读写数据库程序

2、多进程、多线程、多协程的对比

一个进程可以启动N个线程，一个线程可以启动N个协程，这三者是包含关系

多进程（Process）

优点：可以使用多核CPU并行计算
缺点：占用资源最多，可启动数目比线程少
适用于CPU密集型计算

多线程（Thread）

优点：相对进程，更轻量级，占用资源少
缺点：相比进程，多线程只能并发执行，不能使用多CPU相比协程，启动数目有限制，占用内存资源，有线程切换开销
适用于IO密集型计算，同时运行的任务数目要求不多

协程Coroutine

优点：内存开销最小、启动协程数量最多
缺点：支持的库有限制，代码实现复杂
适用于IO密集型、需要超多任务运行，有现有库支持

3、怎样根据任务选择对应技术？

如果是CPU密集型，选择多进程
如果是IO密集型，进一步判断：是否超多任务量？有现成协程库支持？实现复杂度可接受？如果不满足要求，选择多线程，否则选择多协程

GIL是什么？

全局解释器锁（Global Interpreter Lock，GIL）
是计算机程序设计语言解释器用于同步线程的一种机制，它使得任何时刻只有一个线程在执行，即使在多核心处理器上，使用GIL的解释器也只允许同一时间执行一个线程
在这里插入图片描述如何规避GIL带来的限制？
1、多线程threading机制依然有用，用于IO密集型计算
因为在IO期间，线程会释放GIL实现CPU和IO的并行，因此多线程用于IO密集型计算依然可以大幅提升速度，但是多线程用于CPU密集型计算时，只会拖慢速度。
2、使用multiprocessing的多进程机制实现并行计算、利用多核CPU优势，为了应对GIL的问题，Python提供了multiprocessing

Python使用多线程的方法

基本范式

1、准备一个函数

def my_func(a, b):
	do_craw(a, b)

2、怎样创建一个线程

import threading
t = threading.Thread(target=my_func, args=(100, 200))

3、启动线程

t.start()

4、等待结束

t.join()

网页爬虫示例

import threading
import time
import numpy as np
import requests


urls = [f'https://www.cnblogs.com/#p{page}' for page in range(1, 51)]


def craw(url):
    r = requests.get(url)
    print(url, len(r.text))

class Timer:
    def __init__(self):
        self.times = []
        self.start()
    
    def start(self):
        self.tik = time.time()

    def stop(self):
        self.times.append(time.time() - self.tik)
        return self.times[-1]

    def avg(self):
        return sum(self.times) / len(self.times)

    def sum(self):
        return sum(self.times)

    def cumsum(self):
        return np.array(self.times).cumsum().tolist()  

def single_thread():
    print('single thread begin')
    for url in urls:
        craw(url)
    print('single thread end')

def multi_thread():
    print('multi thread begin')
    threads = list()
    for url in urls:
        threads.append(
            threading.Thread(target=craw, args=(url,))
        )

    for thread in threads:
        thread.start()

    for thread in threads:
        thread.join()
    print('multi thread end')


if __name__ == '__main__':
    timer = Timer()
    single_thread()
    print(f'single thread cost: {timer.stop():.5f} sec')
    timer.start()
    multi_thread()
    print(f'multi thread cost: {timer.stop():.5f} sec')

Python实现生产者消费者多线程爬虫

多组件的Pipeline基础架构

在这里插入图片描述多线程数据通信的queue.Queue
queue.Queue可以用于多线程之间的、线程安全的数据通信
1、导入类库

import queue

2、创建Queue

q = queue.Queue()

3、添加元素

q.put(item)

4、获取元素

item = q.get()

5、查询状态

# 查看元素的个数 
q.qsize()
# 判断是否为空 
q.empty()
# 判断是否已满 
q.full()

消费者生产者爬虫示例

import queue
import time
import random
import threading
import requests
from bs4 import BeautifulSoup

urls = [f'https://www.cnblogs.com/#p{page}' for page in range(1, 51)]


def craw(url):
    r = requests.get(url)
    return r.text

def parse(html):
    soup = BeautifulSoup(html, 'html.parser')
    links = soup.find_all('a', class_='post-item-title')
    return [(link['href'], link.get_text()) for link in links]


def do_craw(url_queue:queue.Queue, html_queue:queue.Queue):
    while True:
        url = url_queue.get()
        html = craw(url)
        html_queue.put(html)
        print(threading.current_thread().name, f'craw {url}',
              'url_queue.size=', url_queue.qsize())
        time.sleep(random.randint(1, 2))

def do_parse(html_queue:queue.Queue, fout):
    while True:
        html = html_queue.get()
        results = parse(html)
        for result in results:
            fout.write(str(result) + '\n')
        print(threading.current_thread().name, f'results.size=', len(results),
              'html_queue.size=', html_queue.qsize())
        time.sleep(random.randint(1, 2))


if __name__ == '__main__':
    url_queue = queue.Queue()
    html_queue = queue.Queue()
    for url in urls:
        url_queue.put(url)

    for idx in range(3):
        t = threading.Thread(target=do_craw, args=(url_queue, html_queue), name=f'craw {idx}')
        t.start()

    fout = open('01.data.txt', 'w')
    for idx in range(2):
        t = threading.Thread(target=do_parse, args=(html_queue, fout), name=f'parse {idx}')
        t.start()

Python线程安全问题及解决方案

1、线程安全的概念

线程安全是指某个函数、函数库在多线程环境中被调用时，能够正确地处理多个线程之间的共享变量，使程序功能正确完成。
由于线程执行随时会发生切换，就造成了不可预料的结果，出现线程不安全

2、Lock用于解决线程安全问题

用法1：try-finally模式

import threading
lock = threading.Lock()
lock.acquire()
try:
	# do something
finally:
	lock.release()

用法2：with模式

import threading
lock = threading.Lock()
with lock:
	# do something

取钱代码示例

如果代码中，没有lock，那么最终的余额会是负值！

import threading
import time

lock = threading.Lock()

class Account(object):
    def __init__(self, balance):
        self.balance = balance

def draw(account, amount):
    with lock:
        if account.balance >= amount:
            time.sleep(0.1)
            print(threading.current_thread().name, "取钱成功")
            account.balance -= amount
            print(threading.current_thread().name, "余额", account.balance)
        else:
            print(threading.current_thread().name, "取钱失败，余额不足")

if __name__ == "__main__":
    account = Account(1000)
    ta = threading.Thread(name='ta', target=draw, args=(account, 800))
    tb = threading.Thread(name='tb', target=draw, args=(account, 800))

    ta.start()
    tb.start()

线程池ThreadPoolExecutor

1、线程池的基本原理

在这里插入图片描述新建线程，系统需要分配资源，回收线程，系统需要回收资源，如果可以重用线程，则可以减去新建/终止的开销。

线程池的基本原理就基于此，可以减少新建/终止线程的开销
在这里插入图片描述

线程池的好处

提升性能：因为减去了大量新建、终止线程的开销，重用了线程资源；
适用场景：适合处理突发性大量请求或需要大量线程完成任务，但实际处理时间较短
防御功能：能有效避免系统因为创建线程过多，而导致系统负荷过大响应变慢等问题
代码优势：使用线程池的语法比自己新建线程执行更加简洁

ThreadPoolExecutor使用语法

from concurrent.futures import ThreadPoolExecutor, as_completed

用法1：map函数，很简单，注意map的结果和入参顺序对应的

with ThreadPoolExecutor() as pool:
	results = pool.map(craw, urls)
	for result in results:
		print(result)

用法2：future模式，更强大，注意如果用as_completed顺序是不定的

with ThreadPoolExecutor() as pool:
	futures = [pool.submit(craw, url) for url in urls]
	for future in futures:
		print(futures.result())
	for future in as_completed(futures):
		print(future.result())

使用线程池改造爬虫程序

import concurrent.futures
import requests
from bs4 import BeautifulSoup

urls = [f'https://www.cnblogs.com/#p{page}' for page in range(1, 51)]


def craw(url):
    r = requests.get(url)
    return r.text

def parse(html):
    soup = BeautifulSoup(html, 'html.parser')
    links = soup.find_all('a', class_='post-item-title')
    return [(link['href'], link.get_text()) for link in links]

# craw
with concurrent.futures.ThreadPoolExecutor() as pool:
    htmls = pool.map(craw, urls)
    htmls = list(zip(urls, htmls))
    for url, html in htmls:
        print(url, len(html))

print('craw over')
# parse
with concurrent.futures.ThreadPoolExecutor() as pool:
    futures = dict()
    for url, html in htmls:
        future = pool.submit(parse, html)
        futures[future] = url

    # for future, url in futures.items():
    #     print(url, future.result())

    for future in concurrent.futures.as_completed(futures):
        url = futures[future]
        print(url, future.result())

使用多进程multiprocessing模块

有了多线程，为什么还要使用多进程？

在这里插入图片描述 multiprocessing模块是Python为了解决GIL缺陷引入的模块，原理是用多进程在CPU上并行执行

多进程知识梳理

在这里插入图片描述

计算素数代码对比

import math
from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor
import time

PRIMES = [112272535095293] * 100

class Timer:
    def __init__(self):
        self.times = []
        self.start()
    
    def start(self):
        self.tik = time.time()

    def stop(self):
        self.times.append(time.time() - self.tik)
        return self.times[-1]

    def avg(self):
        return sum(self.times) / len(self.times)

    def sum(self):
        return sum(self.times)

    def cumsum(self):
        return np.array(self.times).cumsum().tolist()  

def is_prime(n):
    if n < 2:
        return False
    if n == 2:
        return True
    if n % 2 == 0:
        return False
    sqrt_n = int(math.floor(math.sqrt(n)))
    for i in range(3, sqrt_n + 1, 2):
        if n % i == 0:
            return False
    return True

def single_thread():
    for num in PRIMES:
        is_prime(num)

def multi_thread():
    with ThreadPoolExecutor() as pool:
        pool.map(is_prime, PRIMES)

def multi_process():
    with ProcessPoolExecutor() as pool:
        pool.map(is_prime, PRIMES)

if __name__ == '__main__':
    timer = Timer()
    single_thread()
    print(timer.stop())
    timer.start()
    multi_thread()
    print(timer.stop())
    timer.start()
    multi_process()
    print(timer.stop())

结果如下，可以看出，多线程反而最慢，而多进程最快

29.008396863937378
34.24095010757446
5.421916484832764

Python异步IO

协程，在单线程中实现并发
在这里插入图片描述

Python异步IO库介绍：asyncio

import asyncio

# 获取事件循环
loop = asyncio.get_event_loop()

# 定义协程
async def myfunc(url):
	await get_url(url)

# 创建task列表
tasks = [loop.create_task(myfunc(url)) for url in urls]

# 执行爬虫事件列表
loop.run_until_complete(asyncio.wait(tasks))

注意：要用到异步IO编程中，依赖的库必须支持异步IO特性
爬虫引用中：requests不支持异步需要使用aiohttp

例子：

import asyncio
import aiohttp
import blog_spider
import time

async def async_craw(url):
    print("craw url:", url)
    async with aiohttp.ClientSession() as session:
        async with session.get(url) as resp:
            result = await resp.text()
            print(f"craw url: {url}, {len(result)}")

loop = asyncio.get_event_loop()

tasks = [loop.create_task(async_craw(url)) for url in blog_spider.urls]
start_time = time.time()
loop.run_until_complete(asyncio.wait(tasks))
end_time = time.time()
print('use time second:', end_time - start_time)