在爬取列表页时,通常我们需要翻页,最简单的翻页实现是递归调用,伪代码如下
def crawl_list(url):
next_url = crawl(url) #process html data ,extract next url
if next_url is not None:
crawl_list(next_url)
此种方式存在的问题是:
1.递归次数过多,会抛出RuntimeError: maximum recursion depth exceeded while calling a Python object
2.运行程序占用内存过多
改进代码:
def crawl_list(urls):
for start_url in urls:
queue = [start_url]
while queue:
next_url =queue.pop(0)
next_url = crawl(next_url)
if next_url is not None:
queue.append(next_url)
通过列表维护一个fifo的队列,消除递归调用带来的问题