Python 多进程多线程_python worker = downloadworker(queue) 创建四个工作线程-CSDN博客

本文链接：https://blog.csdn.net/Kevin_QQ/article/details/51543772

# 多线程 threading

线程是最出名的实现并发和并行的方式之一。操作系统一般提供了线程的特性。线程比进程要小，而且共享同一块内存空间，可以共享全局变量，而多进程不能。

from queue import Queue
from threading import Thread
 
class DownloadWorker(Thread):
  def __init__(self, queue):
    Thread.__init__(self)
    self.queue = queue
 
  def run(self):
    while True:
      # Get the work from the queue and expand the tuple
      # 从队列中获取任务并扩展tuple
      directory, link = self.queue.get()
      download_link(directory, link)
      self.queue.task_done()
 
def main():
  ts = time()
  client_id = os.getenv('IMGUR_CLIENT_ID')
  if not client_id:
    raise Exception("Couldn't find IMGUR_CLIENT_ID environment variable!")
  download_dir = setup_download_dir()
  links = [l for l in get_links(client_id) if l.endswith('.jpg')]
  # Create a queue to communicate with the worker threads
  queue = Queue()
  # Create 8 worker threads
  # 创建八个工作线程
  for x in range(8):
    worker = DownloadWorker(queue)
    # Setting daemon to True will let the main thread exit even though the workers are blocking
    # 将daemon设置为True将会使主线程退出，即使worker都阻塞了
    # 如果setDaemon(True)，那么主线程结束，所有子线程都强制被结束
    # if setDaemon(False), 等同于线程join; default = False
    worker.daemon = True
    worker.start()
  # Put the tasks into the queue as a tuple
  # 将任务以tuple的形式放入队列中
  for link in links:
    logger.info('Queueing {}'.format(link))
    queue.put((download_dir, link))
  # Causes the main thread to wait for the queue to finish processing all the tasks
  # 让主线程等待队列完成所有的任务
  # if 不用join，虽然setDaemon(False)同于线程join，但主程序会先跑完以下其它的行(print, etc.)，然后等待线程结束
  queue.join()
  print('Took {}'.format(time() - ts))

由于GIL（global interpreter lock）的缘故，在这个进程中同一时间只有一个线程在运行。因此，这段代码是并发的但不是并行的。而它仍然变快的原因是这是一个IO密集型的任务。进程下载图片时根本毫不费力，而主要的时间都花在了等待网络上。这就是为什么线程可以提供很大的速度提升。每当线程中的一个准备工作时，进程可以不断转换线程。使用Python或其他有GIL的解释型语言中的线程模块实际上会降低性能。如果你的代码执行的是CPU密集型的任务，例如解压gzip文件，使用线程模块将会导致执行时间变长。对于CPU密集型任务和真正的并行执行，我们可以使用多进程（multiprocessing）模块。

线程相对于进程的优势在于同一进程下的不同线程之间的数据共享更加容易。在 GIL 的机制下，一个线程访问解释器之后，其他的线程就需要等待这个线程释放之后才可以访问。这种处理方法在单处理器下面并没有什么问题，单处理器的本质是无法并行的。但是再多处理器下面，这种方法会导致无法利用多核的优势。

# 多进程 multiprocessing

为了使用多进程，我们得建立一个多进程池。通过它提供的map方法，我们把URL列表传给池，然后8个新进程就会生成，它们将并行地去下载图片。这就是真正的并行，不过这是有代价的。整个脚本的内存将会被拷贝到各个子进程中。在我们的例子中这不算什么，但是在大型程序中它很容易导致严重的问题。

应该尽量避免多进程共享资源。多进程共享资源必然会带来进程间相互竞争。而这种竞争又会造成race condition，我们的结果有可能被竞争的不确定性所影响。但如果需要，我们依然可以通过共享内存和Manager对象这么做。

from functools import partial
from multiprocessing.pool import Pool
 
def main():
  ts = time()
  client_id = os.getenv('IMGUR_CLIENT_ID')
  if not client_id:
    raise Exception("Couldn't find IMGUR_CLIENT_ID environment variable!")
  download_dir = setup_download_dir()
  links = [l for l in get_links(client_id) if l.endswith('.jpg')]
  download = partial(download_link, download_dir)
  with Pool(8) as p:
    p.map(download, links)
  print('Took {}s'.format(time() - ts))

# 总结

如果你的代码是IO密集型的，线程和多进程都可以帮到你。多进程比线程更易用，但是消耗更多的内存。如果你的代码是CPU密集型的，多进程就明显是更好的选择——特别是所使用的机器是多核或多CPU的。对于网络应用，在你需要扩展到多台机器上执行任务，RQ是更好的选择。