在Twisted中,我们可能需要同时执行多个任务,并希望在所有任务完成后执行某个操作。在下面的例子中,我们有两个假设的任务:
- 任务1:从生成器获取URL并使用Twisted的Cooperator批量下载它们。
- 任务2:获取下载的源代码并异步解析它。
我们希望将所有获取和解析任务封装到一个单一的Deferred对象中,当所有页面下载和所有源代码解析完成后回调。
2、解决方案
方案一:
from twisted.internet import defer, task, reactor, threads
from twisted.web.client import getPage
BATCH_SIZE = 5
def main_task():
result = defer.Deferred()
state = {'count': 0, 'done': False}
def on_parse_finish(r):
state['count'] -= 1
if state['done'] and state['count'] == 0:
result.callback(True)
def process(source):
deferred = parse(source)
state['count'] += 1
deferred.addCallback(on_parse_finish)
def fetch_urls():
for url in get_urls():
deferred = getPage(url)
deferred.addCallback(process)
yield deferred
def on_finish(r):
state['done'] = True
deferreds = []
coop = task.Cooperator()
urls = fetch_urls()
for _ in xrange(BATCH_SIZE):
deferreds.append(coop.coiterate(urls))
main_tasks = defer.DeferredList(deferreds)
main_tasks.addCallback(on_finish)
return defer.DeferredList([main_tasks, result])
# `main_task` is meant to be used with `blockingCallFromThread`
# The following should block until all fetch/parse tasks are completed:
# threads.blockingCallFromThread(reactor, main_task)
这个方案的缺点是,解析任务的数量没有限制,当网络速度很快而解析器速度很慢时,可能会导致内存使用量无限增加。
方案二:
from twisted.internet import defer, task
from twisted.web.client import getPage
BATCH_SIZE = 5
def main_task(reactor):
def fetch_urls():
for url in get_urls():
yield getPage(url).addCallback(parse)
coop = task.Cooperator()
urls = fetch_urls()
return (defer.DeferredList([coop.coiterate(urls)
for _ in xrange(BATCH_SIZE)])
.addCallback(task_finished))
task.react(main_task)
这个方案限制了并行下载的数量,但解析任务是顺序执行的,当网络速度很快而解析器速度很慢时,解析任务可能会成为瓶颈。
方案三:
from twisted.internet import defer, task
from twisted.web.client import getPage
PARALLEL_FETCHES = 5
PARALLEL_PARSES = 10
def main_task(reactor):
parseSemaphore = defer.DeferredSemaphore(PARALLEL_PARSES)
def parseWhenReady(r):
def parallelParse(_):
parse(r).addBoth(
lambda result: parseSemaphore.release().addCallback(
lambda _: result
)
)
return parseSemaphore.acquire().addCallback(parallelParse)
def fetch_urls():
for url in get_urls():
yield getPage(url).addCallback(parseWhenReady)
coop = task.Cooperator()
urls = fetch_urls()
return (defer.DeferredList([coop.coiterate(urls)
for _ in xrange(PARALLEL_FETCHES)])
.addCallback(lambda done:
defer.DeferredList(
[parseSemaphore.acquire()
for _ in xrange(PARALLEL_PARSES)]
))
.addCallback(task_finished))
task.react(main_task)
这个方案限制了并行下载和解析的数量,当网络速度很快而解析器速度很慢时,解析任务不会成为瓶颈,但会限制并行下载的数量。
根据你的具体需求,你可以选择最适合你的方案。