【WebScraping】并行下载_多线程爬虫&多进程爬虫

原创 2016年11月08日 14:57:46

当一个线程等待下载时,进程可以切换到其他线程执行,避免浪费cpu时间,即:将下载分发到多个进程和线程中
【思路整理】
针对待爬取的URL队列
(1)若将队列存储在本地内存中,则只能用单独的进程处理该队列,
但进程里可以分为多个线程,对该进程的不同部分进行处理,
用多线程爬虫实现;
(2)若将队列单独存储(MongoDB队列),则不同服务器上的爬虫能协同处理同一个爬虫任务,实现多个进程同时处理同一队列;
与此同时,每个进程里仍可以多线程,(每个新进程启动多线程爬虫)
用多进程爬虫实现

【代码实现】
<缓存类>

#coding: utf-8
try:
    import cPickle as pickle
except ImportError:
    import pickle
import zlib
from datetime import datetime, timedelta
from pymongo import MongoClient
from bson.binary import Binary


class MongoCache:
    """
    Wrapper around MongoDB to cache downloads

    >>> cache = MongoCache()
    >>> cache.clear()
    >>> url = 'http://example.webscraping.com'
    >>> result = {'html': '...'}
    >>> cache[url] = result
    >>> cache[url]['html'] == result['html']
    True
    >>> cache = MongoCache(expires=timedelta())
    >>> cache[url] = result
    >>> # every 60 seconds is purged http://docs.mongodb.org/manual/core/index-ttl/
    >>> import time; time.sleep(60)
    >>> cache[url]
    Traceback (most recent call last):
     ...
    KeyError: 'http://example.webscraping.com does not exist'
    """

    def __init__(self, client=None, expires=timedelta(days=30)):
        """
        client: mongo database client
        expires: timedelta of amount of time before a cache entry is considered expired
        """
        # if a client object is not passed
        # then try connecting to mongodb at the default localhost port
        """通过默认端口连接到mongodb"""
        self.client = MongoClient('localhost', 27017) if client is None else client
        # create collection to store cached webpages,
        # which is the equivalent of a table in a relational database
        """存储网页缓存"""
        self.db = self.client.cache
        self.db.webpage.create_index('timestamp', expireAfterSeconds=expires.total_seconds())

    def __contains__(self, url):
        try:
            self[url]
        except KeyError:
            return False
        else:
            return True

    def __getitem__(self, url):
        """Load value at this URL
           获得该URL缓存
        """
        record = self.db.webpage.find_one({'_id': url})
        if record:
            # return record['result']
            return pickle.loads(zlib.decompress(record['result']))
        else:
            raise KeyError(url + ' does not exist')

    def __setitem__(self, url, result):
        """Save value for this URL
           保存该URL缓存
        """
        # record = {'result': result, 'timestamp': datetime.utcnow()}
        record = {'result': Binary(zlib.compress(pickle.dumps(result))), 'timestamp': datetime.utcnow()}
        self.db.webpage.update({'_id': url}, {'$set': record}, upsert=True)

    def clear(self):
        self.db.webpage.drop()

<下载器类(带缓存、含限速器)>

#coding: utf-8
import urlparse
import urllib2
import random
import time
from datetime import datetime, timedelta
import socket

DEFAULT_AGENT = 'wswp'
DEFAULT_DELAY = 5
DEFAULT_RETRIES = 1
DEFAULT_TIMEOUT = 60

"""支持缓存的下载器(缓存检查、限速功能)
重构成类: 参数只要在构造方法时设置一次,实现在后续下载中多次复用
"""
class Downloader:
    def __init__(self, delay=DEFAULT_DELAY, user_agent=DEFAULT_AGENT, proxies=None, num_retries=DEFAULT_RETRIES,
                 timeout=DEFAULT_TIMEOUT, opener=None, cache=None):
        socket.setdefaulttimeout(timeout)
        self.throttle = Throttle(delay) #限速
        self.user_agent = user_agent
        self.proxies = proxies
        self.num_retries = num_retries
        self.opener = opener
        self.cache = cache

    """实现在下载前检查缓存的功能"""
    def __call__(self, url):
        result = None
        if self.cache:
            """检查缓存是否已经定义"""
            try:
                """加载cache内缓存数据"""
                result = self.cache[url]
            except KeyError:
                # url is not available in cache
                pass
            else:
                """若该url已被缓存,检查之前的下载中是否遇到了服务端错误
                   如果没有发生过服务器端错误,则可以继续使用该缓存"""
                if self.num_retries > 0 and 500 <= result['code'] < 600:
                    # server error so ignore result from cache and re-download
                    result = None
        if result is None:
            """缓存结果不可用,正常下载该url,结果添加到缓存中"""
            # result was not loaded from cache so still need to download
            self.throttle.wait(url)
            proxy = random.choice(self.proxies) if self.proxies else None
            headers = {'User-agent': self.user_agent}
            result = self.download(url, headers, proxy=proxy, num_retries=self.num_retries)
            if self.cache:
                # save result to cache
                self.cache[url] = result
        return result['html']

    def download(self, url, headers, proxy, num_retries, data=None):
        print 'Downloading:', url
        request = urllib2.Request(url, data, headers or {})
        opener = self.opener or urllib2.build_opener()
        if proxy:
            proxy_params = {urlparse.urlparse(url).scheme: proxy}
            opener.add_handler(urllib2.ProxyHandler(proxy_params))
        try:
            response = opener.open(request)
            html = response.read()
            code = response.code
        except Exception as e:
            print 'Download error:', str(e)
            html = ''
            if hasattr(e, 'code'):
                code = e.code
                if num_retries > 0 and 500 <= code < 600:
                    # retry 5XX HTTP errors
                    return self._get(url, headers, proxy, num_retries - 1, data)
            else:
                code = None
        return {'html': html, 'code': code}

#限速器
class Throttle:
    """Throttle downloading by sleeping between requests to same domain
    """

    def __init__(self, delay):
        # amount of delay between downloads for each domain
        self.delay = delay
        # timestamp of when a domain was last accessed
        self.domains = {}

    def wait(self, url):
        """Delay if have accessed this domain recently
        """
        domain = urlparse.urlsplit(url).netloc
        last_accessed = self.domains.get(domain)
        if self.delay > 0 and last_accessed is not None:
            sleep_secs = self.delay - (datetime.now() - last_accessed).seconds
            if sleep_secs > 0:
                time.sleep(sleep_secs)
        self.domains[domain] = datetime.now()

<定制处理特定网页的回调函数> AlexaCallback

import csv
from zipfile import ZipFile
from StringIO import StringIO
from mongo_cache import MongoCache


class AlexaCallback:
    def __init__(self, max_urls=1000):
        self.max_urls = max_urls
        self.seed_url = 'http://s3.amazonaws.com/alexa-static/top-1m.csv.zip'

    def __call__(self, url, html):
        if url == self.seed_url:
            urls = []
            cache = MongoCache()
            with ZipFile(StringIO(html)) as zf:
                csv_filename = zf.namelist()[0]
                for _, website in csv.reader(zf.open(csv_filename)):
                    if 'http://' + website not in cache:
                        urls.append('http://' + website)
                        if len(urls) == self.max_urls:
                            break
            return urls

<多线程爬虫>

#coding: utf-8
import time
import threading
import urlparse
from downloader import Downloader

SLEEP_TIME = 1



def threaded_crawler(seed_url, delay=5, cache=None, scrape_callback=None, user_agent='wswp', proxies=None, num_retries=1, max_threads=10, timeout=60):
    """Crawl this website in multiple threads
    """
    # the queue of URL's that still need to be crawled
    #crawl_queue = Queue.deque([seed_url])
    crawl_queue = [seed_url]
    # the URL's that have been seen
    seen = set([seed_url])
    D = Downloader(cache=cache, delay=delay, user_agent=user_agent, proxies=proxies, num_retries=num_retries, timeout=timeout)

    """多个线程中启动process_queue,并等待其完成
       其中使用了定制的回调函数,实现对不同网页的处理
    """
    def process_queue():
        while True:
            try:
                url = crawl_queue.pop()
            except IndexError:
                # crawl queue is empty
                break
            else:
                html = D(url)
                if scrape_callback:
                    """scrape_callback针对不同网站进行定制
                    传入回调可实现该爬虫对不同网站的处理
                    """
                    try:
                        links = scrape_callback(url, html) or []
                    except Exception as e:
                        print 'Error in callback for: {}: {}'.format(url, e)
                    else:
                        for link in links:
                            #规范化link
                            link = normalize(seed_url, link)
                            # check whether already crawled this link
                            if link not in seen:
                                seen.add(link)
                                # add this new link to queue
                                crawl_queue.append(link)


    """等待所有下载线程完成
    """
    # wait for all download threads to finish
    threads = []
    while threads or crawl_queue:
        # the crawl is still active
        for thread in threads:
            if not thread.is_alive():
                # remove the stopped threads
                threads.remove(thread)
        """当线程池中线程未达到最大值 且尚有URL可爬取,不断创建新线程
        """
        while len(threads) < max_threads and crawl_queue:
            # can start some more threads
            thread = threading.Thread(target=process_queue)
            thread.setDaemon(True) # set daemon so main thread can exit when receives ctrl-c
            thread.start()
            threads.append(thread)
        # all threads have been processed
        # sleep temporarily so CPU can focus execution on other threads
        time.sleep(SLEEP_TIME)


def normalize(seed_url, link):
    """Normalize this URL by removing hash and adding domain
    通过删除哈希和添加域规范这个网址
    """
    link, _ = urlparse.urldefrag(link) # remove hash to avoid duplicates
    return urlparse.urljoin(seed_url, link)

<多线程爬虫测试>

# -*- coding: utf-8 -*-

import sys
from threaded_crawler import threaded_crawler
from mongo_cache import MongoCache
from alexa_cb import AlexaCallback


def main(max_threads):
    scrape_callback = AlexaCallback()
    cache = MongoCache()
    #cache.clear()
    threaded_crawler(scrape_callback.seed_url, scrape_callback=scrape_callback, cache=cache, max_threads=max_threads, timeout=10)


if __name__ == '__main__':
    #max_threads = int(sys.argv[1])
    max_threads = 3
    main(max_threads)

<多进程爬虫队列>

#coding: utf-8
from datetime import datetime, timedelta
from pymongo import MongoClient, errors


class MongoQueue:
    """
    >>> timeout = 1
    >>> url = 'http://example.webscraping.com'
    >>> q = MongoQueue(timeout=timeout)
    >>> q.clear() # ensure empty queue
    >>> q.push(url) # add test URL
    >>> q.peek() == q.pop() == url # pop back this URL
    True
    >>> q.repair() # immediate repair will do nothin
    >>> q.pop() # another pop should be empty
    >>> q.peek()
    >>> import time; time.sleep(timeout) # wait for timeout
    >>> q.repair() # now repair will release URL
    Released: test
    >>> q.pop() == url # pop URL again
    True
    >>> bool(q) # queue is still active while outstanding
    True
    >>> q.complete(url) # complete this URL
    >>> bool(q) # queue is not complete
    False
    """

    # possible states of a download
    """定义队列的三种状态"""
    OUTSTANDING, PROCESSING, COMPLETE = range(3)

    def __init__(self, client=None, timeout=300):
        """
        host: the host to connect to MongoDB
        port: the port to connect to MongoDB
        timeout: the number of seconds to allow for a timeout
        """
        self.client = MongoClient() if client is None else client
        self.db = self.client.cache
        self.timeout = timeout  #超过处理时限,则认为处理过程出现了错误,利用repair()函数修复

    def __nonzero__(self):
        """Returns True if there are more jobs to process
        """
        record = self.db.crawl_queue.find_one(
            {'status': {'$ne': self.COMPLETE}}
        )
        return True if record else False

    def push(self, url):
        """Add new URL to queue if does not exist
        """
        try:
            """添加一个新的URL时,状态为OUTSTANDING"""
            self.db.crawl_queue.insert({'_id': url, 'status': self.OUTSTANDING})
        except errors.DuplicateKeyError as e:
            pass # this is already in the queue

    def pop(self):
        """Get an outstanding URL from the queue and set its status to processing.
        If the queue is empty a KeyError exception is raised.
        """
        """从队列中取出一个URL,状态切换为PROCESSING"""
        record = self.db.crawl_queue.find_and_modify(
            query={'status': self.OUTSTANDING},
            update={'$set': {'status': self.PROCESSING, 'timestamp': datetime.now()}}
        )
        if record:
            return record['_id']
        else:
            self.repair()
            raise KeyError()

    def peek(self):
        record = self.db.crawl_queue.find_one({'status': self.OUTSTANDING})
        if record:
            return record['_id']

    """下载结束,队列状态COMPLETE"""
    def complete(self, url):
        self.db.crawl_queue.update({'_id': url}, {'$set': {'status': self.COMPLETE}})

    """取出的URL无法正常完成时,处理URL进程被终止的情况"""
    def repair(self):
        """Release stalled jobs
        """
        record = self.db.crawl_queue.find_and_modify(
            query={
                'timestamp': {'$lt': datetime.now() - timedelta(seconds=self.timeout)},
                'status': {'$ne': self.COMPLETE}
            },
            update={'$set': {'status': self.OUTSTANDING}}
        )
        if record:
            print 'Released:', record['_id']

    def clear(self):
        self.db.crawl_queue.drop()

<多进程爬虫>

#coding: utf-8
import time
import urlparse
import threading
import multiprocessing
from mongo_cache import MongoCache
from mongo_queue import MongoQueue
from downloader import Downloader

SLEEP_TIME = 1


def threaded_crawler(seed_url, delay=5, cache=None, scrape_callback=None, user_agent='wswp', proxies=None, num_retries=1, max_threads=10, timeout=60):
    """Crawl using multiple threads
    """
    # the queue of URL's that still need to be crawled
    crawl_queue = MongoQueue()  #基于Mongodb的爬虫队列
    crawl_queue.clear()
    crawl_queue.push(seed_url)  #队列内部已经实现了对重复URL的处理
    D = Downloader(cache=cache, delay=delay, user_agent=user_agent, proxies=proxies, num_retries=num_retries, timeout=timeout)

    def process_queue():
        while True:
            # keep track that are processing url
            try:
                url = crawl_queue.pop()
            except KeyError:
                # currently no urls to process
                break
            else:
                html = D(url)
                if scrape_callback:
                    try:
                        links = scrape_callback(url, html) or []
                    except Exception as e:
                        print 'Error in callback for: {}: {}'.format(url, e)
                    else:
                        for link in links:
                            # add this new link to queue
                            crawl_queue.push(normalize(seed_url, link))
                """最后调用complete()方法,用于记录该URL已经被成功解析"""
                crawl_queue.complete(url)


    # wait for all download threads to finish
    threads = []
    while threads or crawl_queue:
        for thread in threads:
            if not thread.is_alive():
                threads.remove(thread)
        while len(threads) < max_threads and crawl_queue.peek():
            # can start some more threads
            thread = threading.Thread(target=process_queue)
            thread.setDaemon(True) # set daemon so main thread can exit when receives ctrl-c
            thread.start()
            threads.append(thread)
        time.sleep(SLEEP_TIME)


def process_crawler(args, **kwargs):
    #获取可用的CPU数量
    num_cpus = multiprocessing.cpu_count()  
    #pool = multiprocessing.Pool(processes=num_cpus)
    print 'Starting {} processes'.format(num_cpus)
    processes = []
    """在每个新进程中启动多线程爬虫"""
    for i in range(num_cpus):
        p = multiprocessing.Process(target=threaded_crawler, args=[args], kwargs=kwargs)
        #parsed = pool.apply_async(threaded_link_crawler, args, kwargs)
        p.start()
        processes.append(p)
    # wait for processes to complete
    """等待所有进程完成执行"""
    for p in processes:
        p.join()

"""规范化URL"""
def normalize(seed_url, link):
    """Normalize this URL by removing hash and adding domain
    """
    link, _ = urlparse.urldefrag(link) # remove hash to avoid duplicates
    return urlparse.urljoin(seed_url, link)

<多进程爬虫测试>

import sys
from process_crawler import process_crawler
from mongo_cache import MongoCache
from alexa_cb import AlexaCallback


def main(max_threads):
    scrape_callback = AlexaCallback()
    cache = MongoCache()
    cache.clear()
    process_crawler(scrape_callback.seed_url, scrape_callback=scrape_callback, cache=cache, max_threads=max_threads, timeout=10)


if __name__ == '__main__':
    max_threads = int(sys.argv[1])
    main(max_threads)
版权声明:ShirleyPaul原创,未经博主允许不得转载

python3+PyQt5 创建多线程网络应用-TCP客户端和TCP服务器

本文在上文的基础上重新实现支持多线程的服务器。
  • xiaoyangyang20
  • xiaoyangyang20
  • 2017年05月07日 23:01
  • 1876

Python学习之多进程并发爬虫

以前做过Python的爬虫,不过那只爬取贴吧内容,比较简单,只是用来刚开始练练手的。这段时间又重新看Python,看到了正则表达式,于是想对爬虫再深入的了解下,主要是对爬虫的线程以及进程学习。 爬虫...
  • LEoe_
  • LEoe_
  • 2017年03月25日 19:04
  • 4235

python爬虫(中)--多进程和多线程

前面我们见到了基本爬虫的请求、提取和保存,这是一个基本爬虫应该有的结构,那么这时候的这个爬虫有了能爬能存的能力,但是这种能力是很弱的,弱主要体现在三点:①爬虫本身健壮性并不高,有很多情况不一定考虑到;...
  • qq_29245097
  • qq_29245097
  • 2016年08月26日 17:41
  • 3518

爬虫实战4—多线程与多进程爬虫

文章说明:本文是在学习一个网络爬虫课程时所做笔记,文章如有不对的地方,欢迎指出,积极讨论。 一、表单及登录 登录的核心是为了获得cookie,登录成功后,header会有设置cookie的相关信息,...
  • Duxianzi
  • Duxianzi
  • 2017年08月09日 19:08
  • 387

Python多线程、异步+多进程爬虫实现代码

安装Tornado 省事点可以直接用grequests库,下面用的是tornado的异步client。 异步用到了tornado,根据官方文档的例子修改得到一个简单的异步爬虫类。可以参考下最新的文档...
  • zhangtian6691844
  • zhangtian6691844
  • 2016年06月17日 15:36
  • 1109

多进程+多线程打造高效率爬虫

爬虫的作用是将互联网的公开数据进行抓取。对于一些付费数据是需要登录对应账号后才能进行获取相应数据的。 那,那么我们是否可以大胆的设想下,自己搭建一个平台,用户可以通过我们的平台来免费...
  • Px01Ih8
  • Px01Ih8
  • 2017年11月06日 00:00
  • 92

python多线程、异步、多进程+异步爬虫

安装Tornadopip install tornado python的多线程比较鸡肋,使用tornado可以实现异步的爬取,代码也比较简单,使用了coroutine后也可以不用回调了。代码如下,最...
  • WangPegasus
  • WangPegasus
  • 2015年09月18日 18:44
  • 3617

【Python爬虫4】并发并行下载

1一百万个网站 1用普通方法解析Alexa列表 2复用爬虫代码解析Alexa列表 2串行爬虫 3并发并行爬虫 0并发并行工作原理 1多线程爬虫 2多进程爬虫 4性能对比这篇将介绍使用多线程和多进程这两...
  • u014134180
  • u014134180
  • 2017年02月17日 13:08
  • 2225

python爬虫——多进程multiprocessing

其实多进程相对来说不是特别适合用来做爬虫,因为多进程比较适用于计算密集型,而爬虫是IO密集型,因此多进程爬虫对速度的提升不是特别明显,但是将爬虫改为多进程比较简单,只需简单的几行代码即可搞定,所以在修...
  • qq_23926575
  • qq_23926575
  • 2017年07月30日 10:57
  • 735

python爬虫:编写多进程爬虫学习笔记

# -*- coding: utf-8 -*- """ Created on Sat Oct 22 21:01:23 2016 @author: hhxsym """ import request...
  • u010035907
  • u010035907
  • 2016年10月25日 00:04
  • 958
内容举报
返回顶部
收藏助手
不良信息举报
您举报文章:【WebScraping】并行下载_多线程爬虫&多进程爬虫
举报原因:
原因补充:

(最多只允许输入30个字)