爬虫（4）-- 并发下载

最新推荐文章于 2024-08-03 14:18:54 发布

ouprince

最新推荐文章于 2024-08-03 14:18:54 发布

阅读量400

点赞数

分类专栏： python 网络爬虫文章标签：并发下载爬虫

python 网络爬虫专栏收录该内容

9 篇文章 0 订阅

订阅专栏

所谓的并发下载，也就是启动多线程和多进程下载。

多线程爬虫实现部分示例如下，多线程默认内存共享

def process_queue():
    while True:
        try:
            url = crawl_queue.pop()
        except IndexError:
            break
        else:
            html = D(url)  # 下载
            ...

threads = []
while threads or crawl_queue:
    for thread in threads:
        if not thread.is_alive():
            threads.remove(thread)
    while len(threads) < max_threads and crawl_queue:
        thread = threading.Thread(target = process_queue)
        thread.setDaemon(True)     # 设为“守护线程”
        thread.start()
        threads.append(thread)

当有 url 可爬取时，上面的代码中的循环会不断创建线程，直到达到线程池的最大值。在爬取过程中，如果当前队列中没有更多的可以爬取的 url 时，线程会提前停止。当发现新的 url 在队列中需要下载时，并且线程数未达到最大值，又会创建一个新的下载线程。

多进程爬虫实现部分示例如下

def threaded_crawler(...):
    ...
    crawl_queue.push(seed_url)
    def process_queue():
        while True:
            try:
                url = crawl_queue.pop()
            except KeyError:
                break
            else:
                ...
                crawl_queue.complete(url)

import multiprocessing
def process_link_crawler(args,**kwargs):
    num_cpus = multiprocessing.cpu_count()
    print "Starting {} processes".format(num_cpus)
    processes = []
    for i in range(num_cpus):
        p = multiprocessing.Process(target = threaded_crawler,
                                        args = [args],kwargs = kwargs)
        p.start()
        processes.append(p)
    for p in processes:
        p.join()