所谓的并发下载,也就是启动多线程和多进程下载。
多线程爬虫实现部分示例如下,多线程默认内存共享
def process_queue():
while True:
try:
url = crawl_queue.pop()
except IndexError:
break
else:
html = D(url) # 下载
...
threads = []
while threads or crawl_queue:
for thread in threads:
if not thread.is_alive():
threads.remove(thread)
while len(threads) < max_threads and crawl_queue:
thread = threading.Thread(target = process_queue)
thread.setDaemon(True) # 设为“守护线程”
thread.start()
threads.append(thread)
当有 url 可爬取时,上面的代码中的循环会不断创建线程,直到达到 线程池的最大值。在爬取过程中,如果当前队列中没有更多的可以爬取的 url 时,线程会提前停止。当发现新的 url 在队列中需要下载时,并且线程数未达到最大值,又会创建一个新的下载线程。
多进程爬虫实现部分示例如下
def threaded_crawler(...):
...
crawl_queue.push(seed_url)
def process_queue():
while True:
try:
url = crawl_queue.pop()
except KeyError:
break
else:
...
crawl_queue.complete(url)
import multiprocessing
def process_link_crawler(args,**kwargs):
num_cpus = multiprocessing.cpu_count()
print "Starting {} processes".format(num_cpus)
processes = []
for i in range(num_cpus):
p = multiprocessing.Process(target = threaded_crawler,
args = [args],kwargs = kwargs)
p.start()
processes.append(p)
for p in processes:
p.join()