python线程池使用和问题记录

靖淮CAFEBABE

已于 2023-03-14 09:06:17 修改

阅读量418

点赞数

分类专栏： Python 文章标签： python 开发语言

于 2023-03-09 13:35:50 首次发布

本文链接：https://blog.csdn.net/weixin_42774786/article/details/129421126

版权

Python 专栏收录该内容

19 篇文章 0 订阅

订阅专栏

记录一次使用多线程的问题

背景

最近工作有个需求根据文件中的数据请求中台服务，然后解析返回值。文件中每行代表一个参数，使用post方式携带参数请求中台接口。

分析：需要处理的数据量非常大（近200w行），处理每行数据网络通信的时间约为0.5s，单线程处理需要半个月，中台服务的qps设定为50。粗略估计网络通信时间极大于线程切换所需时间，所以考虑使用多线程完成。

多线程相关的含义不再赘述，参见：多线程和多进程

线程池 ThreadPoolExecutor

ThreadPoolExecutor来自concurrent.futures模块，用于创建一个线程池，常用方法如下：

submit(fn, *args, **kwargs)：将 fn 函数提交给线程池。*args 代表传给 fn 函数的参数，*kwargs 代表以关键字参数的形式为 fn 函数传入参数。
map(func, *iterables, timeout=None, chunksize=1)：该函数类似于全局函数 map(func, *iterables)，只是该函数将会启动多个线程，以异步方式立即对 iterables 执行 map 处理。
shutdown(wait=True)：关闭线程池。

使用示例：

#!/usr/bin/env python
import threading
import time
def test(value1, value2=None):
    print("%s threading is printed %s, %s"%(threading.current_thread().name, value1, value2))
    time.sleep(2)
    return 'finished'

def test_result(future):
    print(future.result())

if __name__ == "__main__":
    import numpy as np
    from concurrent.futures import ThreadPoolExecutor
    threadPool = ThreadPoolExecutor(max_workers=4, thread_name_prefix="test_")
    for i in range(0,10):
        future = threadPool.submit(test, i,i+1)
#         future.add_done_callback(test_result)
        print(future.result())

    threadPool.shutdown(wait=True)
    print('main finished')

问题

我的代码：从file1中一次读取参数，请求接口并将返回值写到file2中

def send_request(parameter):
    server_url = ""
    head = {
        xxxx
    }
    req_data = '{...}' 
    idx = 0

    while idx < 10:
        try:
            response_data = requests.post(server_url, data=req_data, headers=head, timeout=120)
            if response_data.status_code == 200:
                ret = json.loads(response_data.text)
								...
                #print time.time() - s_t
                lock.acquire()
                with open(file2,'a') as fw:
                  fw.write()
                  lock.release()
                  return
            idx += 1
        except Exception as e:
            print >> sys.stderr, "request err:%s\n%s" %(url, e)
            idx += 1
            continue
if __name__ == '__main__':

    file1 = sys.argv[1]
    file2 = sys.argv[2]

    cnt = 0
    ori_url = 0
    badcase_dict = dict()
    bg_time = time.time()
    thread_pool = ThreadPoolExecutor(max_workers=10, thread_name_prefix="truncate_process_")
    with open(file1,'r') as f:
        for line in f.readlines():
						parameter = line..
            obj = thread_pool.submit(send_request,parameter)

            if cnt % 1000 == 0:
                print "=======================progress:%d======================" %(cnt)
                print "----time past:%f----" %(time.time()-bg_time)
            cnt += 1
    thread_pool.shutdown(wait=True)
    #time.sleep(10)
    print "-----------------end time: %f---------------" %(time.time()-bg_time)

**过程问题记录：**开始使用小文件（大概2000行）测试代码，能够正常请求并将返回值写到file2中，并且file2中的数据量符合预期，但是使用基线文件丢在后台跑的时候，发现python进程结束后的结果不符合预期：file2中的数据量远小于file1，并且nohup日志显示进程不是正确结束（没有打印最后一行日志）。

当时最先想到的是线程数太多，qps超了，尝试设置不同的最大线程数并同步使用不同大小的file1，问题并没有解决，这时候基本可以排除qps的问题。baidu发现一个问题：**多线程条件下程序出现问题不会报错！**那么大概率可能是程序出现问题导致进程退出，没有执行完。

后来尝试重新执行程序，发现当程序执行一段时间后，终端卡顿明显，无法正常执行命令。这个情况下很有可能就是内存超负载！再想一下线程池的原理，应该是线程池堆积，爆内存了，继续baidu发现ThreadPoolExecutor默认使用无界队列，当消费者消费速度大于生产速度时，逐渐会发生线程堆积，系统内存会因队列中的元素堆积增多而耗尽。找到问题，就好解决了，使用有界队列重写线程池继承ThreadPoolExecutor。python线程池有界队列和无界队列

问题解决：

class BoundedThreadPoolExecutor(ThreadPoolExecutor):
    def __init__(self, max_workers=None, thread_name_prefix=''):
        super().__init__(max_workers,thread_name_prefix)
        self._work_queue = queue.Queue(max_workers * 2)