Python程序运行时间和进程池的详解

最新推荐文章于 2024-08-05 19:57:48 发布

置顶 ivalue2333

最新推荐文章于 2024-08-05 19:57:48 发布

阅读量1.8k

点赞数

分类专栏： Python

本文链接：https://blog.csdn.net/ivalue/article/details/80112803

版权

Python 专栏收录该内容

9 篇文章 0 订阅

订阅专栏

python==2.7

elasticsearch==6.2.0

1：背景介绍，最近有一个需求，从ELK日志系统前一天的日志中提取url，url要求去重，然后呢，我用了cosine相似度和字典树匹配两种方案来去重，比较之下，字典树的效果还是要好很多的。

现在遇到的瓶颈是有点慢，慢了当然就是想多多线程多进程咯，不过好像python的多线程不怎么能提高效率，于是考虑多进程。

2：运行时间

    a = 100
    s1 = time.time()
    # 不要去数个数了，小目标，一个亿
    for x in range(100000000):
        if a > x:
            pass
        else:
            pass
    e1 = time.time()
    print(e1-s1)
    # e1-s1 大概为17.6s，ps:cpu:Xeon(R) E5-2650 v4

    s1 = time.time()
    for x in range(100000000):
        pass
    e1 = time.time()
    print(e1 - s1)
    # e1 - s1 大概为7s

从上面可以看出，当循环数量上去了之后（我在分析日志时，发现确实上去了），一个if差不多会带来1.5倍的性能开销（在我这个例子中）。或者这样来说，在1亿次的循环中，一个if会额外消耗15秒。

综上，程序中最消耗时间的是循环，（双重循环的话就是n* n），其次是判断语句，一些io啊。值得注意的是很多操作都是用循环实现的。例如很多的数据结构中的查找和更新。

使用多进程的原因，程序有些逻辑是相对更加耗时的，比如在一亿次的循环中的操作，每次都需要1.5秒（这个可以用sleep函数模拟），那么串行执行就实在是一個很糟糕的选择了，这时多线程，多进程就是一个很好的选择。在Python中，我选择了多进程。下面是进程池的详细解释，看完之后基本能理解线程池中的调用了。

# -*- coding: utf-8 -*-
import multiprocessing
import time
from multiprocessing import Pool


def run(num):
  #fn: 函数参数是数据列表的一个元素
  time.sleep(1)
  print(multiprocessing.current_process().name)

def modify_data(num):
  time.sleep(2)
  return num

def get_data():
  for num in range(20):
    yield num

if __name__ == "__main__":

  print('===我是分割线1:map===')
  pool = Pool(10)
  s1 = time.time()
  # 主进程阻塞，等待进程池完成
  pool.map(run, get_data())
  e1 = time.time()
  pool.close()
  pool.join()
  print("并行执行时间1：", int(e1 - s1))

  print('===我是分割线2:map_async===')
  pool = Pool(10)
  s1 = time.time()
  pool.map_async(run, get_data())
  # 这里计算出的e1和s1是一样的，因为是异步的
  # e1 = time.time()
  pool.close()
  # 不要去执行下一条命令，等pool中的命令执行完
  pool.join()
  # 这里计算出e1和s1就不一样
  e1 = time.time()
  print("并行执行时间2：", int(e1 - s1))

  # 结论：1和2效果相当，耗时相同

  print('===我是分割线3:apply===')
  pool = Pool(10)
  s1 = time.time()
  for x in get_data():
    # 这个东西如果这样处理的话就不是并行的了， 而且在Python3之后也不推荐
    pool.apply(run, (x,))
  e1 = time.time()
  pool.close()
  pool.join()
  print("串行执行时间3：", int(e1 - s1))

  print('===我是分割线4:apply_async===')
  pool = Pool(10)
  s1 = time.time()
  for x in get_data():
    # 传参数注意传tuple,list
    pool.apply_async(run, [x])
  pool.close()
  pool.join()
  print('flag')
  e1 = time.time()
  print("并行执行时间4：", int(e1 - s1))

  # 结论 3 是串行执行，4 是并行执行，1，2，4效率相当
  # conclusion, the 1,2,4 is a parallel execution, while the 3 is a serial execution

  print('===我是分割线5:串行运算===')
  ll = []
  s1 = time.time()
  for x in get_data():
    # 传参数注意传tuple,list
    temp = modify_data(x)
    if temp > 1:
      ll.append(temp)
  print('flag')
  e1 = time.time()
  print("串行执行时间5：", int(e1 - s1))
  s1 = time.time()
  for l in ll:
    print(l)
  e1 = time.time()
  print("获得数据时间：", int(e1 - s1))

  print('注意apply_async的返回值是一个AppResult的类型，你需要使用get()方法才能得到你的数据')
  print('===我是分割线6:apply_async中修改，获取返回值===')
  ll = []
  pool = Pool(10)
  s1 = time.time()
  # 在这个循环里的语句也会影响程序的并发性
  for x in get_data():
    # 传参数注意传tuple,list
    temp = pool.apply_async(modify_data, [x])
    print(temp)
    ll.append(temp)
  pool.close()
  pool.join()
  print('flag')
  e1 = time.time()
  print("并行执行时间6：", int(e1 - s1))
  s1 = time.time()
  for l in ll:
    print(l.get())
  e1 = time.time()
  print("获得数据时间：", int(e1 - s1))

  print('===我是分割线7:apply_async中修改，获取返回值===')
  ll = []
  pool = Pool(10)
  s1 = time.time()
  # 在这个循环里的语句也会影响程序的并发性
  # 如下面的例子，在这里调用了temp（ApplyResult）的get()方法，程序就会串行执行
  for x in get_data():
    # 传参数注意传tuple,list
    temp = pool.apply_async(modify_data, [x])
    print(temp.get())
    ll.append(temp)
  pool.close()
  pool.join()
  print('flag')
  e1 = time.time()
  print("串行执行时间7：", int(e1 - s1))
  s1 = time.time()
  for l in ll:
    print(l.get())
  e1 = time.time()
  print("获得数据时间：", int(e1 - s1))

  # 结合6,7，得出结论，在循环中不能调用get()方法，如果调用，程序将会串行执行，
  # 这应该是Python多进程对使用者的限制，

ivalue2333

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Python程序运行时间和进程池的详解

python==2.7elasticsearch==6.2.01：背景介绍，最近有一个需求，从ELK日志系统前一天的日志中提取url，url要求去重，然后呢，我用了cosine相似度和字典树匹配两种方案来去重，比较之下，字典树的效果还是要好很多的。现在遇到的瓶颈是有点慢，慢了当然就是想多多线程多进程咯，不过好像python的多线程不怎么能提高效率，于是考虑多进程。 2：运行时间...
复制链接

扫一扫

专栏目录