Python 多进程多线程浅析

最新推荐文章于 2024-08-06 17:45:33 发布

kevinstarry

最新推荐文章于 2024-08-06 17:45:33 发布

阅读量274

点赞数

文章标签： python 开发语言 numpy

本文链接：https://blog.csdn.net/qq_41133428/article/details/127401800

版权

本文探讨了Python中multiprocessing模块的进程池使用，强调全局变量在多进程中的共享问题，推荐常量处理以避免数据冲突。通过实例展示了进程池与多线程的区别，以及在IO密集和CPU密集场景下的最佳实践。

摘要由CSDN通过智能技术生成

python多进程

你知道吗python多进程的写法有两种，一种是multiprocessing，另一种是concurrent.futures.processpoolexecutor，二者皆可，这里以multiprocessing举例

multiprocessing

关于多进程，这里谈的是进程池Pool，直接看代码吧，事实上也可以不看，我想表达的是使用pool操作多进程时候可能不注意就出现的问题。我的建议是：全局变量尽量为常量，不要修改。或者深入了解多进程探寻解决之法。

一句话描述就是：采用进程池管理多进程时候，所有的进程之间并非完全不共享全局变量

通常情况下，采用多进程，每一个进程会复制一份代码，通常我们把这些进程认为是数据独立的。

详细表述是：

使用进程池管理多进程时需要注意全局变量最好为常量
当全局变量g_demo_list = [],在使用进程池管理多进程过程中,g_demo_list .append(data),
必须注意到当该进程执行完毕后并不会退出而是会继续执行下个任务,这就意味着这两个进程之间存在一定的共享数据
假定进程池创建了进程A,进程A执行完了任务a后不会直接退出,会继续执行任务b.
如果任务a和任务b都同时对g_demo_list 使用了append方法,那么g_demo_list 在a中的操作会被记录,
在b中会存有a的操作数据,
通常我们不希望这种情况发生,所以应当尽量避免这种情况发生,即全局变量应该为常量

import os
import time
from multiprocessing import cpu_count, Pool

g_demo_num = 100
g_demo_list = ["demo"]
task_num = 50
max_processes = cpu_count()


def my_info(num):
    print(f"process id is:{os.getpid()}, current num is:{num}")
    delay = num if num < 2 else 1
    for i in range(num):
        time.sleep(delay)
    g_demo_list.append(num)
    print(f"num = {g_demo_num + 1}, list = {g_demo_list}")


def print_error(value):
    print("error:", value)


def multi_core():
    pool = Pool(processes=3)
    for i in range(task_num):
        pool.apply_async(my_info, args=(i,), error_callback=print_error)
    pool.close()
    pool.join()


if __name__ == '__main__':
    multi_core()

python多进程和多线程展示

这里将多进程和多线程写在了一起，方便展示。

顺便说一下多线程和多进程的正确使用：IO密集形任务使用多线程，CPU密集型任务使用多进程

IO密集指的是，例如需要频繁的读写磁盘，下载网页之类的，因为计算机cpu处理速度很快而IO受制于硬盘的读取速度，普通机械盘一般80mb/s左右，而cpu速度：2.0Ghz的CPU计算速度是每秒2000000000次，这里指的是运行次数（重要事情说三遍）。总之就是差别非常大。对于IO密集型，多进程无济于事，因为不管你开多少个核，磁盘的读取速度卡住了瓶颈，没法突破。

CPU密集型指的是，举例：numpy的矩阵运算，例如使用numpy做一个100000010000001000000的三维矩阵的求方差，均值，或者其他数学类计算这种的，对算力需求极大的。多进程可以充分发挥多个核心的作用，让计算机所有核同时开始工作（现在的计算机一般都是多个核心），提高算力。

进程等于qq，线程等于你在qq中和某个好友聊天。你可以和多个好友同时聊天，你回复完好友A的消息后，在等待好友A回复你消息的时候，你可以切换线程和好友B聊天。

import threading
import traceback
from concurrent.futures import ThreadPoolExecutor
import random
from functools import wraps
from multiprocessing import Pool, cpu_count
from time import sleep


def error_handler(func):
    @wraps(func)
    def wrapper(*args, **kwargs):
        try:
            return func(*args, **kwargs)
        except Exception as e:
            print(f"function {func.__name__} error")
            print(f"{e}\n{traceback.format_exc()}")
            return "error"

    return wrapper


class MultiThread(object):
    """
    多线程的使用方式：
    1. 调用ThreadPoolExecutor创建一个线程池
    2. 定义一个普通函数作为线程任务
    3. 调用ThreadPoolExecutor的submit方法提交线程任务，线程池会分配空闲线程
    4. 调用ThreadPoolExecutor的shutdown方法关闭线程池（不可再提交任务）
    温馨提示：
    如果不调用shutdown，可能会存在主线程先跑完，子线程还在继续执行的情况
    多线程操作全局变量时候，为了保证线程安全，注意加锁，with的使用方式最为简洁
    """

    def __init__(self):
        self.task_nums = 100
        self.task_names = [f"thread_task_{i}" for i in range(self.task_nums)]

        self.max_thread = 10
        self.pool = ThreadPoolExecutor(max_workers=self.max_thread)

        self.use_lock = True
        self.g_numbers = 0

    def deal_with_tasks(self, task_name, thread_lock):
        print(f"deal with task start:{task_name}")
        # wait_time = random.randrange(3)
        # sleep(wait_time)
        if self.use_lock:
            with thread_lock:
                self.g_numbers = self.g_numbers + 1
            """ is equivalent to(so let's use with):
            thread_lock.acquire()
            try:
                self.g_numbers = self.g_numbers + 1
            finally:
                thread_lock.release()
            """
        else:
            self.g_numbers = self.g_numbers + 1
        print(f"deal with task end  :{task_name}")
        return f"res_{task_name}"

    @staticmethod
    def task_done(res):
        print(f"task_done the result:{res.result()}")

    def main(self):
        thread_lock = threading.Lock()
        for my_task in self.task_names:
            sub_thread = self.pool.submit(self.deal_with_tasks, my_task, thread_lock)
            sub_thread.add_done_callback(self.task_done)
        self.pool.shutdown()
        print(f"main func end.")


class MultiProcess(object):
    def __init__(self):
        self.task_nums = 100
        self.task_names = [f"process_task_{i}" for i in range(self.task_nums)]

        self.max_process = cpu_count()  # 核的数量
        self.pool = Pool(processes=self.max_process)

        self.enable_error = False

    def __getstate__(self):
        self_dict = self.__dict__.copy()
        del self_dict['pool']
        return self_dict

    def __setstate__(self, state):
        self.__dict__.update(state)

    @error_handler
    def deal_with_tasks(self, task_name):
        print(f"deal with task start:{task_name}")
        sleep(1)
        # do your things
        """
        当我们在执行多进程任务时候，指定的函数可能会在运行中出现错误，我们这里举例：ZeroDivisionError
        我们必须要考虑到出错的可能性，这里我们使用 "装饰器" 来解决
        并注意到两行代码：from functools import wraps  @wraps(func)
        由于python的多进程获取返回值是需要序列化的，但是装饰器是无法序列化的，
        而functools.wraps可以解除这种束缚，从而变的可以序列化，所以引入这两行代码
        getstate,setstate的引入也是为了解决序列化问题（当我们不在类中写的时候只需要将装饰器可序列化即可）
        
        必须注意到，并非只有装饰器一种解决方法，例如还可以：
        try:
            real_deal_with_tasks(task_name)
            # in real_deal_with_tasks do your things
        except Exception as e:
            print(e)
        """
        if self.enable_error:
            my_error = 1 / 0
        print(f"deal with task end  :{task_name}")
        ...

    @staticmethod
    def print_error(res):
        print(f"error: {res}")

    def main(self):
        for my_task in self.task_names:
            self.pool.apply_async(self.deal_with_tasks, args=(my_task,), error_callback=self.print_error)
        self.pool.close()
        self.pool.join()
        ...


if __name__ == '__main__':
    demo = MultiThread()
    demo.main()

    demo = MultiProcess()
    demo.main()