I am really not understanding this because this is indirectly the
same as previous as it waits to collect all tasks before launching
them.
不,你错了。使用asyncio.ensure_future创建{a1}时,它立即开始执行call_api协同程序。以下是asyncio中的任务的工作方式:import asyncio
async def test(i):
print(f'{i} started')
await asyncio.sleep(i)
async def main():
tasks = [
asyncio.ensure_future(test(i))
for i
in range(3)
]
await asyncio.sleep(0)
print('At this moment tasks are already started')
await asyncio.wait(tasks)
if __name__ == '__main__':
loop = asyncio.get_event_loop()
loop.run_until_complete(main())
输出:
^{pr2}$
您的方法的问题在于process_individual_file实际上并不是异步的:它执行大量与CPU相关的工作,而没有将控制权返回到异步事件循环。这是一个问题-功能块事件循环,使不可能执行的任务。在
我认为您可以使用的非常简单但有效的解决方案是在执行数次process_individual_file之后,使用asyncio.sleep(0)手动将控制返回到事件循环,例如,在读取每一行时:async def process_individual_file(source, input_file):
tasks = []
limit = 2000
with open(source+input_file) as sf:
for line in sf:
await asyncio.sleep(0) # Return control to event loop to allow it execute tasks
json_array.append(form_json(line))
limit -= 1
if limit == 0:
tasks.append(asyncio.ensure_future(call_api(json_array)))
limit = 2000
await asyncio.wait(tasks)
升级版:there will be more than millions of requests to be done and hence I am
feeling uncomfortable to store future objects for all of them in a
list
这很有道理。如果你运行数百万个并行网络请求,就不会有什么好结果。在这种情况下设置限制的常用方法是使用同步原语,如asyncio.Semaphore。在
我建议您让generator从文件中获取json_array,并在添加新任务之前获取{},并在任务就绪时释放它。您将得到干净的代码,以避免许多并行运行的任务。在
看起来像这样:def get_json_array(input_file):
json_array = []
limit = 2000
with open(input_file) as sf:
for line in sf:
json_array.append(form_json(line))
limit -= 1
if limit == 0:
yield json_array # generator will allow split file-reading logic from adding tasks
json_array = []
limit = 2000
sem = asyncio.Semaphore(50) # don't allow more than 50 parallel requests
async def process_individual_file(input_file):
for json_array in get_json_array(input_file):
await sem.acquire() # file reading wouldn't resume until there's some place for newer tasks
task = asyncio.ensure_future(call_api(json_array))
task.add_done_callback(lambda t: sem.release()) # on task done - free place for next tasks
task.add_done_callback(lambda t: print(t.result())) # print result on some call_api done