我可以这样做:import grequests
from bs4 import BeautifulSoup
def get_urls_from_response(r):
soup = BeautifulSoup(r.text)
urls = [link.get('href') for link in soup.find_all('a')]
return urls
def print_url(args):
print args['url']
def recursive_urls(urls):
"""
Given a list of starting urls, recursively finds all descendant urls
recursively
"""
if len(urls) == 0:
return
rs = [grequests.get(url, hooks=dict(args=print_url)) for url in urls]
responses = grequests.map(rs)
url_lists = [get_urls_from_response(response) for response in responses]
urls = sum(url_lists, []) # flatten list of lists into a list
recursive_urls(urls)
我还没有测试代码,但总的想法就在那里。在
请注意,我使用^{}而不是{}来提高性能。grequest基本上是gevent+request,根据我的经验,这类任务的速度要快得多,因为你用gevent异步检索链接。在
编辑:以下是不使用递归的相同算法:
^{pr2}$