一文教你面对高并发任务时如何选择：多进（线）程 VS 异步函数

_冷眸_

已于 2024-07-12 23:51:07 修改

阅读量1k

点赞数 28

分类专栏： Python 并发并行文章标签：开发语言 python

于 2024-06-29 20:58:04 首次发布

本文链接：https://blog.csdn.net/pydaxing_pdx/article/details/140070460

版权

Python 同时被 3 个专栏收录

1 篇文章 0 订阅

订阅专栏

并发

1 篇文章 0 订阅

订阅专栏

并行

1 篇文章 0 订阅

订阅专栏

face_img

阅读原文

在现代软件开发中，处理高并发和网络I/O密集型任务是一个常见的挑战。Python提供了多种方法来处理并发，其中最常用的是多进（线）程和异步编程。本文将探讨这两种技术在实际应用中的性能差异，并通过实验来比较它们在处理大量网络请求时的效率。

1、多进（线）程和异步函数

1.1、多进（线）程

多进（线）程允许多个任务在同一程序中并行运行。每个线程占用一定的系统资源，如CPU时间和内存。多进（线）程适合于同时执行多个独立任务，尤其是在多核CPU上。

优点

可以实现真正的并行执行。
在多核处理器上，可以显著提高程序的执行效率。

缺点

线程管理需要消耗额外的资源。
线程之间的同步和通信可能导致复杂的竞态条件和死锁问题。

1.2、异步函数

异步编程是一种单线程的任务调度方式，它通过事件循环来管理任务的执行。这种方式非常适合处理I/O密集型任务，如网络请求和文件操作。

优点

高效的I/O处理能力，不会阻塞主线程。
减少了线程创建和上下文切换的开销。

缺点

编程模型相对复杂，需要理解事件循环和回调机制。
在CPU密集型任务中表现不佳，因为所有任务都在同一个线程中执行。

2、性能对比

异步编程和多进（线）程都是实现并发的有效手段，但它们各有优势和适用场景。异步编程通常用于I/O密集型任务，如文件操作和网络请求，而多进（线）程则可以同时处理多个任务，尤其是在多核处理器上。

2.1、多进（线）程

在使用多进（线）程的时候，一定要注意Python的全局解释器锁（Global Interpreter Lock，简称GIL）机制。我们先看一个多线程的例子，通过实验来直观的说明。

机器配置：Mac-Pro, Apple M2, 10核

2.1.1、Threading

以下是一个使用threading实现多线程的例子

import threading
from datetime import datetime

def cpu_bound_task(idx):
    # 执行一个计算密集型任务
    count = 0
    for i in range(100000000):
        count += i

# 任务数
num_tasks = 1

# 创建多于CPU核心数的线程
threads = []
for i in range(num_tasks):
    thread = threading.Thread(target=cpu_bound_task, args=(i,))
    threads.append(thread)

start_time = datetime.now()

# 启动所有线程
for thread in threads:
    thread.start()

# 等待所有线程完成
for thread in threads:
    thread.join()

end_time = datetime.now()
print(f"Time: {(end_time-start_time).total_seconds()} seconds")

时间消耗和任务数之间的关系：

Tasks	Time (s)
1	2.469811
2	4.875715
3	7.155812
5	11.53091
10	24.403462

可以看到，完成所有任务的总时间和任务数几乎呈线性增长关系，明明是多线程并发执行，为什么时间会成倍增长呢。

因为在Python中，由于全局解释器锁（Global Interpreter Lock，简称GIL），在任何给定时刻只允许一个线程执行Python字节码。这意味着即使你的机器有多个CPU核心，使用Python的标准threading库进行多线程编程时，这些线程在执行计算密集型任务时实际上并不会并行执行，而是会在单个核心上交替执行。

2.1.2、Multiprocessing

我们换一种方式，使用Multiprocessing的Process来试试呢。

from multiprocessing import Process
from datetime import datetime

def cpu_bound_task(idx):
    count = 0
    for i in range(100000000):
        count += i

if __name__ == '__main__':
    num_tasks = 1
    processes = []
    for i in range(num_tasks):
        process = Process(target=cpu_bound_task, args=(i,))
        processes.append(process)

    start_time = datetime.now()

    # 启动所有进程
    for process in processes:
        process.start()

    # 等待所有进程完成
    for process in processes:
        process.join()

    end_time = datetime.now()
    print(f"Time: {(end_time - start_time).total_seconds()} seconds")

时间消耗和任务数之间的关系：

Tasks	Time (s)
1	2.983705
2	3.308298
3	3.138572
5	3.76091
10	5.613843

可以看到，不论多少水个任务，时间消耗在3-5s之间，这个小的波动是因为进程越多，进程之间的资源调度和切换需要时间，并且个人机器上还有其他程序在运行，会占用部分核，所以这个波动是正常的，实现了并行的处理。

2.1.3、Concurrent.Futures

1. concurrent.futures 之 ThreadPoolExecutor

我们继续使用Multiprocessing中的ThreadPoolExecutor来尝试。

import concurrent.futures
from datetime import datetime

def task_function(idx):
    count = 0
    for i in range(100000000):
        count += i

def main():
    num_tasks = 1  # 你想要运行的任务数量
    start_time = datetime.now()  # 开始计时

    with concurrent.futures.ThreadPoolExecutor(max_workers=1) as executor:
        # 使用executor.map来并发执行
        executor.map(task_function, range(num_tasks))

    end_time = datetime.now()  # 结束计时
    total_duration = (end_time - start_time).total_seconds()
    print(f"Total duration: {total_duration} seconds")

if __name__ == "__main__":
    main()

时间消耗和任务数之间的关系：

Tasks	num_workers	Time (s)
1	1	2.392556
2	1	5.000087
3	1	7.298594
5	1	12.322661
10	1	24.470803
2	2	4.907088
3	3	7.498068
5	5	12.665615
5	10	24.979614

又出现了这种问题，当num_workers为1的时候，总时长随着任务数增加而成倍增长，这个很正常，因为只有一个worker，所有任务串行处理。可是当任务数和worker数一样的时候，总时长依然是随着任务数成倍增长，这里依然是python的GIL机制在作怪。

2. Concurrent.Futures 之 ProcessPoolExecutor

我们再使用concurrent.futures中的ProcessPoolExecutor来尝试。

import concurrent.futures
from datetime import datetime

def task_function(idx):
    count = 0
    for i in range(100000000):
        count += i

def main():
    num_tasks = 1  # 你想要运行的任务数量
    start_time = datetime.now()  # 开始计时

    with concurrent.futures.ProcessPoolExecutor(max_workers=1) as executor:
        # 使用executor.map来并发执行
        results = executor.map(task_function, range(num_tasks))

    end_time = datetime.now()  # 结束计时
    total_duration = (end_time - start_time).total_seconds()
    print(f"Total duration: {total_duration} seconds")

if __name__ == "__main__":
    main()

时间消耗和任务数之间的关系：

Tasks	num_workers	Time (s)
1	1	3.018315
2	2	3.198022
3	3	3.391358
5	5	3.84788
5	10	5.737346

很显然，当woker数和任务数一样的时候，随着任务数的增长，总的消耗时间基本相同，实现了并行处理。

使用Threading和concurrent.futures中的ThreadPoolExecutor容易受到GIL机制的影响，有时候并不能实现真正的并行。而Multiprocessing的Process，concurrent.futures中的ProcessPoolExecutor则能绕过GIL机制实现并行处理。

注意的是，GIL机制对计算密集型任务有较为明显的影响，但对于网络IO型任务的影响基本上可以忽略。以下是一个网络IO型任务的例子，总时长受任务数量基本可以忽略不计。

import threading
import requests
import time

urls = [
           'https://jsonplaceholder.typicode.com/posts/1',
           'https://jsonplaceholder.typicode.com/posts/2',
           'https://jsonplaceholder.typicode.com/posts/3',
           'https://jsonplaceholder.typicode.com/posts/4',
           'https://jsonplaceholder.typicode.com/posts/5'
       ] * 10

def fetch_data(url):
    response = requests.get(url)

start_time = time.time()

threads = []
for url in urls:
    thread = threading.Thread(target=fetch_data, args=(url,))
    threads.append(thread)
    thread.start()

for thread in threads:
    thread.join()

end_time = time.time()
print(f"Time taken with multithreading: {end_time - start_time} seconds")

2.2、异步函数

即使GIL机制对网络IO型任务影响不大，但是在有大量网络IO型任务需要并发的时候，仍然建议采用异步函数的方式来实现。因为多线程或者多进程机制需要为每一个任务创建一个线程或进程，会大大消耗机器的资源，如果不是计算密集型任务，这种资源的消耗是完全没有必要的。

而异步编程通常使用单线程事件循环来管理多个I/O操作，这样可以避免线程上下文切换的开销。对于大量小而频繁的I/O操作，异步编程可以更高效地利用系统资源。下面是一个异步并发的例子。

import aiohttp
import asyncio
import time

urls = [
    'https://jsonplaceholder.typicode.com/posts/1',
    'https://jsonplaceholder.typicode.com/posts/2',
    'https://jsonplaceholder.typicode.com/posts/3',
    'https://jsonplaceholder.typicode.com/posts/4',
    'https://jsonplaceholder.typicode.com/posts/5'
] * 100

async def fetch_data(session, url):
    async with session.get(url) as response:
        pass

async def main():
    async with aiohttp.ClientSession() as session:
        tasks = [fetch_data(session, url) for url in urls]
        await asyncio.gather(*tasks)

start_time = time.time()

asyncio.run(main())

end_time = time.time()
print(f"Time taken with async: {end_time - start_time} seconds")

3、使用建议

3.1、异步编程的优势

资源效率：异步编程通常使用单线程事件循环来管理多个I/O操作，这样可以避免线程上下文切换的开销。对于大量小而频繁的I/O操作，异步编程可以更高效地利用系统资源。
简单的并发模型：异步编程通过回调、async/await等机制来实现并发，避免了多线程编程中的一些复杂问题，如死锁、竞态条件等。
可扩展性：在处理大量并发连接（如高并发的网络服务器）时，异步编程通常比多线程更具可扩展性，因为它不需要为每个连接创建一个线程。