python中的多线程、多进程

简介

使用Python可以快速地编写程序,但是python对多线程的支持却不好,在Python2中,更多地使用多进程。在Python3中,引入了concurrent,便于多线程/进程开发。

Python GIL

Python代码的执行由Python解释器进行控制,目前Python的解释器有多种,比较著名的有CPython、PyPy、Jython等。其中CPython为最广泛使用的Python解释器,是最早的由c语言开发。

在OS中,支持多个线程同时执行。 但在Python设计之初考虑到在Python解释器的主循环中执行Python代码,于是CPython中设计了全局解释器锁GIL(Global Interpreter Lock)机制,用于管理解释器的访问,Python线程的执行必须先竞争到GIL权限才能执行。
因此无论是单核还是多核CPU,任意给定时刻只有一个线程会被Python解释器执行,无法多线程运行。并这也是为什么在多核CPU上,Python的多线程有时效率并不高的根本原因。

Python2中高性能解决方法

Python多任务的解决方案主要由这么几种:

  • 启动多进程,每个进程只有一个线程,通过多进程执行多任务;
  • 启动单进程,在进程内启动多线程,通过多线程执行多任务;
  • 启动多进程,在每个进程内再启动多个线程,同时执行更多的任务–这样子太复杂,实际上效果并不好,使用的更少。

使用多进程

多进程的package对应的是multiprocessing。

先看一下Process类。

'''
from multiprocessing.process import Process, current_process, active_children
'''

class Process(object):
    '''
    Process objects represent activity that is run in a separate process

    The class is analagous to `threading.Thread`
    '''
    _Popen = None

    def __init__(self, group=None, target=None, name=None, args=(), kwargs={}):
        assert group is None, 'group argument must be None for now'
        count = _current_process._counter.next()
        self._identity = _current_process._identity + (count,)
        self._authkey = _current_process._authkey
        self._daemonic = _current_process._daemonic
        self._tempdir = _current_process._tempdir
        self._parent_pid = os.getpid()
        self._popen = None
        self._target = target
        self._args = tuple(args)
        self._kwargs = dict(kwargs)
        self._name = name or type(self).__name__ + '-' + \
                     ':'.join(str(i) for i in self._identity)

一个简单的Process的使用示例:

from multiprocessing import Process

def f(name):
    print 'hello', name

if __name__ == '__main__':
    p = Process(target=f, args=('bob',))
    p.start()
    p.join()

多线程处理

线程处理的package是threading.

先简单看一下Thread类

# Main class for threads

class Thread(_Verbose):
    """A class that represents a thread of control.

    This class can be safely subclassed in a limited fashion.

    """
    __initialized = False
    # Need to store a reference to sys.exc_info for printing
    # out exceptions when a thread tries to use a global var. during interp.
    # shutdown and thus raises an exception about trying to perform some
    # operation on/with a NoneType
    __exc_info = _sys.exc_info
    # Keep sys.exc_clear too to clear the exception just before
    # allowing .join() to return.
    __exc_clear = _sys.exc_clear

    def __init__(self, group=None, target=None, name=None,
                 args=(), kwargs=None, verbose=None):
        """This constructor should always be called with keyword arguments. Arguments are:

        *group* should be None; reserved for future extension when a ThreadGroup
        class is implemented.

        *target* is the callable object to be invoked by the run()
        method. Defaults to None, meaning nothing is called.

        *name* is the thread name. By default, a unique name is constructed of
        the form "Thread-N" where N is a small decimal number.

        *args* is the argument tuple for the target invocation. Defaults to ().

        *kwargs* is a dictionary of keyword arguments for the target
        invocation. Defaults to {}.

        If a subclass overrides the constructor, it must make sure to invoke
        the base class constructor (Thread.__init__()) before doing anything
        else to the thread.


"""

简单示例

#!/usr/bin/python
from threading import Thread

def count(n):
    print "begin count..." "\r\n"
    while n > 0:
        n-=1
    print "done."

def test_ThreadCount():
    t1 = Thread(target=count,args=(1000000,))
    print("start thread.")
    t1.start()
    print "join thread." 
    t1.join()

if __name__ == '__main__':    
    test_ThreadCount()

输出:

start thread.
begin count...
join thread.

done.

使用多进程和多线程性能对比

测试代码是网友的,使用了timeit, 请先安装此包。

#!/usr/bin/python
from threading import Thread
from multiprocessing import Process,Manager
from timeit import timeit

def count(n):
    while n > 0:
        n-=1

def test_normal():
    count(1000000)
    count(1000000)

def test_Thread():
    t1 = Thread(target=count,args=(1000000,))
    t2 = Thread(target=count,args=(1000000,))
    t1.start()
    t2.start()
    t1.join()
    t2.join()


def test_Process():
    t1 = Process(target=count,args=(1000000,))
    t2 = Process(target=count,args=(1000000,))
    t1.start()
    t2.start()
    t1.join()
    t2.join()

if __name__ == '__main__':
    print "test_normal",timeit('test_normal()','from __main__ import test_normal',number=10)
    print "test_Thread",timeit('test_Thread()','from __main__ import test_Thread',number=10)
    print "test_Process",timeit('test_Process()','from __main__ import test_Process',number=10)

执行后的输出结果:

test_normal 1.0291161
test_Thread 7.5084157
test_Process 1.6441867

可见,直接使用方法反而最快,使用Process次之,使用Thread最慢。单这个测试只是运算测试。如果有IO类的慢速操作时,还是要使用Process或者Thread。

python3中的concurrent.futures包

使用java或者CSharp的开发者,对future应该比较了解。这个是用以并发支持。
在Python3.2中提供了concurrent.futures包, 而python 2.7需要安装futures模块,使用命令pip install futures安装即可.

模块concurrent.futures给开发者提供一个执行异步调用的高级接口。concurrent.futures基本上就是在Python的threadingmultiprocessing模块之上构建的抽象层,更易于使用。尽管这个抽象层简化了这些模块的使用,但是也降低了很多灵活性。

这里最重要的是类Executor,当然Executor是抽象类,具体的实现类有2个,分别是ThreadPoolExecutorProcessPoolExecutor,正如名字所示,分别对应着Thread和Process的执行池。

看一下ProcessPoolExecutor定义, 缺省地,最大的工作任务应该和CPU数量匹配。


class ProcessPoolExecutor(_base.Executor):
    def __init__(self, max_workers=None):
        """Initializes a new ProcessPoolExecutor instance.

        Args:
            max_workers: The maximum number of processes that can be used to
                execute the given calls. If None or not given then as many
                worker processes will be created as the machine has processors.
        """
        _check_system_limits()

        if max_workers is None:
            self._max_workers = multiprocessing.cpu_count()
        else:
            if max_workers <= 0:
                raise ValueError("max_workers must be greater than 0")

            self._max_workers = max_workers

再看一下ThreadPoolExecutor的定义, 最重叠IO上(或者参考CompleteIO),处理最大的工作数量应该cpu数量的5倍。

class ThreadPoolExecutor(_base.Executor):
    def __init__(self, max_workers=None):
        """Initializes a new ThreadPoolExecutor instance.

        Args:
            max_workers: The maximum number of threads that can be used to
                execute the given calls.
        """
        if max_workers is None:
            # Use this number because ThreadPoolExecutor is often
            # used to overlap I/O instead of CPU work.
            max_workers = (cpu_count() or 1) * 5
        if max_workers <= 0:
            raise ValueError("max_workers must be greater than 0")

        self._max_workers = max_workers
        self._work_queue = queue.Queue()
        self._threads = set()
        self._shutdown = False
        self._shutdown_lock = threading.Lock()

看一个简单的示例,改编自网友的程序:

#!/usr/bin/python2
import os
import urllib

from concurrent.futures import ThreadPoolExecutor
from concurrent.futures import as_completed
from concurrent.futures import ProcessPoolExecutor

def downloader(url):
    req = urllib.urlopen(url)
    if (req != None):
        print "begin down", url 
    filename = os.path.basename(url)
    ext = os.path.splitext(url)[1]
    if not ext:
        raise RuntimeError("URL does not contain an extension")

    with open(filename,"wb") as file_handle:
        while True:
            chunk = req.read(1024)
            if not chunk:
                break
            file_handle.write(chunk)
        msg = "Finished downloading {filename}".format(filename = filename)
        return msg

def mainProcess(urls):
    with ProcessPoolExecutor(max_workers = 5) as executor:
        futures = [executor.submit(downloader,url) for url in urls]
        for future in as_completed(futures):
            print(future.result())

def mainThread(urls):
    with ThreadPoolExecutor(max_workers = 5) as executor:
        futures = [executor.submit(downloader,url) for url in urls]
        for future in as_completed(futures):
            print(future.result())

if __name__ == "__main__":
    urls1 = [
        "http://www.irs.gov/pub/irs-pdf/f1040.pdf",
        "http://www.irs.gov/pub/irs-pdf/f1040a.pdf",
        "http://www.irs.gov/pub/irs-pdf/f1040ez.pdf"]
    urls2 = [
        "http://www.irs.gov/pub/irs-pdf/f1040es.pdf",
        "http://www.irs.gov/pub/irs-pdf/f1040sb.pdf"]

    mainProcess(urls1)
    mainThread(urls2)

执行3次,输出如下:

----1
begin down http://www.irs.gov/pub/irs-pdf/f1040ez.pdf
begin down http://www.irs.gov/pub/irs-pdf/f1040a.pdf
begin down http://www.irs.gov/pub/irs-pdf/f1040.pdf
Finished downloading f1040ez.pdf
Finished downloading f1040.pdf
Finished downloading f1040a.pdf
begin down http://www.irs.gov/pub/irs-pdf/f1040es.pdf
begin down http://www.irs.gov/pub/irs-pdf/f1040sb.pdf
Finished downloading f1040sb.pdf
Finished downloading f1040es.pdf

----2
begin down http://www.irs.gov/pub/irs-pdf/f1040.pdfb
egin down http://www.irs.gov/pub/irs-pdf/f1040ez.pdf
begin down http://www.irs.gov/pub/irs-pdf/f1040a.pdf
Finished downloading f1040ez.pdf
Finished downloading f1040a.pdf
Finished downloading f1040.pdf
begin down http://www.irs.gov/pub/irs-pdf/f1040es.pdf
begin down http://www.irs.gov/pub/irs-pdf/f1040sb.pdf
Finished downloading f1040sb.pdf
Finished downloading f1040es.pdf

----3
begin down http://www.irs.gov/pub/irs-pdf/f1040.pdf
begin down http://www.irs.gov/pub/irs-pdf/f1040a.pdf
Finished downloading f1040.pdf
Finished downloading f1040a.pdf
begin down http://www.irs.gov/pub/irs-pdf/f1040ez.pdf
Finished downloading f1040ez.pdf
begin down http://www.irs.gov/pub/irs-pdf/f1040sb.pdf
begin down http://www.irs.gov/pub/irs-pdf/f1040es.pdf
Finished downloading f1040sb.pdf
Finished downloading f1040es.pdf
  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值