【WebScraping】并行下载_多线程爬虫&多进程爬虫

原创 2016年11月08日 14:57:46

当一个线程等待下载时,进程可以切换到其他线程执行,避免浪费cpu时间,即:将下载分发到多个进程和线程中
【思路整理】
针对待爬取的URL队列
(1)若将队列存储在本地内存中,则只能用单独的进程处理该队列,
但进程里可以分为多个线程,对该进程的不同部分进行处理,
用多线程爬虫实现;
(2)若将队列单独存储(MongoDB队列),则不同服务器上的爬虫能协同处理同一个爬虫任务,实现多个进程同时处理同一队列;
与此同时,每个进程里仍可以多线程,(每个新进程启动多线程爬虫)
用多进程爬虫实现

【代码实现】
<缓存类>

#coding: utf-8
try:
    import cPickle as pickle
except ImportError:
    import pickle
import zlib
from datetime import datetime, timedelta
from pymongo import MongoClient
from bson.binary import Binary


class MongoCache:
    """
    Wrapper around MongoDB to cache downloads

    >>> cache = MongoCache()
    >>> cache.clear()
    >>> url = 'http://example.webscraping.com'
    >>> result = {'html': '...'}
    >>> cache[url] = result
    >>> cache[url]['html'] == result['html']
    True
    >>> cache = MongoCache(expires=timedelta())
    >>> cache[url] = result
    >>> # every 60 seconds is purged http://docs.mongodb.org/manual/core/index-ttl/
    >>> import time; time.sleep(60)
    >>> cache[url]
    Traceback (most recent call last):
     ...
    KeyError: 'http://example.webscraping.com does not exist'
    """

    def __init__(self, client=None, expires=timedelta(days=30)):
        """
        client: mongo database client
        expires: timedelta of amount of time before a cache entry is considered expired
        """
        # if a client object is not passed
        # then try connecting to mongodb at the default localhost port
        """通过默认端口连接到mongodb"""
        self.client = MongoClient('localhost', 27017) if client is None else client
        # create collection to store cached webpages,
        # which is the equivalent of a table in a relational database
        """存储网页缓存"""
        self.db = self.client.cache
        self.db.webpage.create_index('timestamp', expireAfterSeconds=expires.total_seconds())

    def __contains__(self, url):
        try:
            self[url]
        except KeyError:
            return False
        else:
            return True

    def __getitem__(self, url):
        """Load value at this URL
           获得该URL缓存
        """
        record = self.db.webpage.find_one({'_id': url})
        if record:
            # return record['result']
            return pickle.loads(zlib.decompress(record['result']))
        else:
            raise KeyError(url + ' does not exist')

    def __setitem__(self, url, result):
        """Save value for this URL
           保存该URL缓存
        """
        # record = {'result': result, 'timestamp': datetime.utcnow()}
        record = {'result': Binary(zlib.compress(pickle.dumps(result))), 'timestamp': datetime.utcnow()}
        self.db.webpage.update({'_id': url}, {'$set': record}, upsert=True)

    def clear(self):
        self.db.webpage.drop()

<下载器类(带缓存、含限速器)>

#coding: utf-8
import urlparse
import urllib2
import random
import time
from datetime import datetime, timedelta
import socket

DEFAULT_AGENT = 'wswp'
DEFAULT_DELAY = 5
DEFAULT_RETRIES = 1
DEFAULT_TIMEOUT = 60

"""支持缓存的下载器(缓存检查、限速功能)
重构成类: 参数只要在构造方法时设置一次,实现在后续下载中多次复用
"""
class Downloader:
    def __init__(self, delay=DEFAULT_DELAY, user_agent=DEFAULT_AGENT, proxies=None, num_retries=DEFAULT_RETRIES,
                 timeout=DEFAULT_TIMEOUT, opener=None, cache=None):
        socket.setdefaulttimeout(timeout)
        self.throttle = Throttle(delay) #限速
        self.user_agent = user_agent
        self.proxies = proxies
        self.num_retries = num_retries
        self.opener = opener
        self.cache = cache

    """实现在下载前检查缓存的功能"""
    def __call__(self, url):
        result = None
        if self.cache:
            """检查缓存是否已经定义"""
            try:
                """加载cache内缓存数据"""
                result = self.cache[url]
            except KeyError:
                # url is not available in cache
                pass
            else:
                """若该url已被缓存,检查之前的下载中是否遇到了服务端错误
                   如果没有发生过服务器端错误,则可以继续使用该缓存"""
                if self.num_retries > 0 and 500 <= result['code'] < 600:
                    # server error so ignore result from cache and re-download
                    result = None
        if result is None:
            """缓存结果不可用,正常下载该url,结果添加到缓存中"""
            # result was not loaded from cache so still need to download
            self.throttle.wait(url)
            proxy = random.choice(self.proxies) if self.proxies else None
            headers = {'User-agent': self.user_agent}
            result = self.download(url, headers, proxy=proxy, num_retries=self.num_retries)
            if self.cache:
                # save result to cache
                self.cache[url] = result
        return result['html']

    def download(self, url, headers, proxy, num_retries, data=None):
        print 'Downloading:', url
        request = urllib2.Request(url, data, headers or {})
        opener = self.opener or urllib2.build_opener()
        if proxy:
            proxy_params = {urlparse.urlparse(url).scheme: proxy}
            opener.add_handler(urllib2.ProxyHandler(proxy_params))
        try:
            response = opener.open(request)
            html = response.read()
            code = response.code
        except Exception as e:
            print 'Download error:', str(e)
            html = ''
            if hasattr(e, 'code'):
                code = e.code
                if num_retries > 0 and 500 <= code < 600:
                    # retry 5XX HTTP errors
                    return self._get(url, headers, proxy, num_retries - 1, data)
            else:
                code = None
        return {'html': html, 'code': code}

#限速器
class Throttle:
    """Throttle downloading by sleeping between requests to same domain
    """

    def __init__(self, delay):
        # amount of delay between downloads for each domain
        self.delay = delay
        # timestamp of when a domain was last accessed
        self.domains = {}

    def wait(self, url):
        """Delay if have accessed this domain recently
        """
        domain = urlparse.urlsplit(url).netloc
        last_accessed = self.domains.get(domain)
        if self.delay > 0 and last_accessed is not None:
            sleep_secs = self.delay - (datetime.now() - last_accessed).seconds
            if sleep_secs > 0:
                time.sleep(sleep_secs)
        self.domains[domain] = datetime.now()

<定制处理特定网页的回调函数> AlexaCallback

import csv
from zipfile import ZipFile
from StringIO import StringIO
from mongo_cache import MongoCache


class AlexaCallback:
    def __init__(self, max_urls=1000):
        self.max_urls = max_urls
        self.seed_url = 'http://s3.amazonaws.com/alexa-static/top-1m.csv.zip'

    def __call__(self, url, html):
        if url == self.seed_url:
            urls = []
            cache = MongoCache()
            with ZipFile(StringIO(html)) as zf:
                csv_filename = zf.namelist()[0]
                for _, website in csv.reader(zf.open(csv_filename)):
                    if 'http://' + website not in cache:
                        urls.append('http://' + website)
                        if len(urls) == self.max_urls:
                            break
            return urls

<多线程爬虫>

#coding: utf-8
import time
import threading
import urlparse
from downloader import Downloader

SLEEP_TIME = 1



def threaded_crawler(seed_url, delay=5, cache=None, scrape_callback=None, user_agent='wswp', proxies=None, num_retries=1, max_threads=10, timeout=60):
    """Crawl this website in multiple threads
    """
    # the queue of URL's that still need to be crawled
    #crawl_queue = Queue.deque([seed_url])
    crawl_queue = [seed_url]
    # the URL's that have been seen
    seen = set([seed_url])
    D = Downloader(cache=cache, delay=delay, user_agent=user_agent, proxies=proxies, num_retries=num_retries, timeout=timeout)

    """多个线程中启动process_queue,并等待其完成
       其中使用了定制的回调函数,实现对不同网页的处理
    """
    def process_queue():
        while True:
            try:
                url = crawl_queue.pop()
            except IndexError:
                # crawl queue is empty
                break
            else:
                html = D(url)
                if scrape_callback:
                    """scrape_callback针对不同网站进行定制
                    传入回调可实现该爬虫对不同网站的处理
                    """
                    try:
                        links = scrape_callback(url, html) or []
                    except Exception as e:
                        print 'Error in callback for: {}: {}'.format(url, e)
                    else:
                        for link in links:
                            #规范化link
                            link = normalize(seed_url, link)
                            # check whether already crawled this link
                            if link not in seen:
                                seen.add(link)
                                # add this new link to queue
                                crawl_queue.append(link)


    """等待所有下载线程完成
    """
    # wait for all download threads to finish
    threads = []
    while threads or crawl_queue:
        # the crawl is still active
        for thread in threads:
            if not thread.is_alive():
                # remove the stopped threads
                threads.remove(thread)
        """当线程池中线程未达到最大值 且尚有URL可爬取,不断创建新线程
        """
        while len(threads) < max_threads and crawl_queue:
            # can start some more threads
            thread = threading.Thread(target=process_queue)
            thread.setDaemon(True) # set daemon so main thread can exit when receives ctrl-c
            thread.start()
            threads.append(thread)
        # all threads have been processed
        # sleep temporarily so CPU can focus execution on other threads
        time.sleep(SLEEP_TIME)


def normalize(seed_url, link):
    """Normalize this URL by removing hash and adding domain
    通过删除哈希和添加域规范这个网址
    """
    link, _ = urlparse.urldefrag(link) # remove hash to avoid duplicates
    return urlparse.urljoin(seed_url, link)

<多线程爬虫测试>

# -*- coding: utf-8 -*-

import sys
from threaded_crawler import threaded_crawler
from mongo_cache import MongoCache
from alexa_cb import AlexaCallback


def main(max_threads):
    scrape_callback = AlexaCallback()
    cache = MongoCache()
    #cache.clear()
    threaded_crawler(scrape_callback.seed_url, scrape_callback=scrape_callback, cache=cache, max_threads=max_threads, timeout=10)


if __name__ == '__main__':
    #max_threads = int(sys.argv[1])
    max_threads = 3
    main(max_threads)

<多进程爬虫队列>

#coding: utf-8
from datetime import datetime, timedelta
from pymongo import MongoClient, errors


class MongoQueue:
    """
    >>> timeout = 1
    >>> url = 'http://example.webscraping.com'
    >>> q = MongoQueue(timeout=timeout)
    >>> q.clear() # ensure empty queue
    >>> q.push(url) # add test URL
    >>> q.peek() == q.pop() == url # pop back this URL
    True
    >>> q.repair() # immediate repair will do nothin
    >>> q.pop() # another pop should be empty
    >>> q.peek()
    >>> import time; time.sleep(timeout) # wait for timeout
    >>> q.repair() # now repair will release URL
    Released: test
    >>> q.pop() == url # pop URL again
    True
    >>> bool(q) # queue is still active while outstanding
    True
    >>> q.complete(url) # complete this URL
    >>> bool(q) # queue is not complete
    False
    """

    # possible states of a download
    """定义队列的三种状态"""
    OUTSTANDING, PROCESSING, COMPLETE = range(3)

    def __init__(self, client=None, timeout=300):
        """
        host: the host to connect to MongoDB
        port: the port to connect to MongoDB
        timeout: the number of seconds to allow for a timeout
        """
        self.client = MongoClient() if client is None else client
        self.db = self.client.cache
        self.timeout = timeout  #超过处理时限,则认为处理过程出现了错误,利用repair()函数修复

    def __nonzero__(self):
        """Returns True if there are more jobs to process
        """
        record = self.db.crawl_queue.find_one(
            {'status': {'$ne': self.COMPLETE}}
        )
        return True if record else False

    def push(self, url):
        """Add new URL to queue if does not exist
        """
        try:
            """添加一个新的URL时,状态为OUTSTANDING"""
            self.db.crawl_queue.insert({'_id': url, 'status': self.OUTSTANDING})
        except errors.DuplicateKeyError as e:
            pass # this is already in the queue

    def pop(self):
        """Get an outstanding URL from the queue and set its status to processing.
        If the queue is empty a KeyError exception is raised.
        """
        """从队列中取出一个URL,状态切换为PROCESSING"""
        record = self.db.crawl_queue.find_and_modify(
            query={'status': self.OUTSTANDING},
            update={'$set': {'status': self.PROCESSING, 'timestamp': datetime.now()}}
        )
        if record:
            return record['_id']
        else:
            self.repair()
            raise KeyError()

    def peek(self):
        record = self.db.crawl_queue.find_one({'status': self.OUTSTANDING})
        if record:
            return record['_id']

    """下载结束,队列状态COMPLETE"""
    def complete(self, url):
        self.db.crawl_queue.update({'_id': url}, {'$set': {'status': self.COMPLETE}})

    """取出的URL无法正常完成时,处理URL进程被终止的情况"""
    def repair(self):
        """Release stalled jobs
        """
        record = self.db.crawl_queue.find_and_modify(
            query={
                'timestamp': {'$lt': datetime.now() - timedelta(seconds=self.timeout)},
                'status': {'$ne': self.COMPLETE}
            },
            update={'$set': {'status': self.OUTSTANDING}}
        )
        if record:
            print 'Released:', record['_id']

    def clear(self):
        self.db.crawl_queue.drop()

<多进程爬虫>

#coding: utf-8
import time
import urlparse
import threading
import multiprocessing
from mongo_cache import MongoCache
from mongo_queue import MongoQueue
from downloader import Downloader

SLEEP_TIME = 1


def threaded_crawler(seed_url, delay=5, cache=None, scrape_callback=None, user_agent='wswp', proxies=None, num_retries=1, max_threads=10, timeout=60):
    """Crawl using multiple threads
    """
    # the queue of URL's that still need to be crawled
    crawl_queue = MongoQueue()  #基于Mongodb的爬虫队列
    crawl_queue.clear()
    crawl_queue.push(seed_url)  #队列内部已经实现了对重复URL的处理
    D = Downloader(cache=cache, delay=delay, user_agent=user_agent, proxies=proxies, num_retries=num_retries, timeout=timeout)

    def process_queue():
        while True:
            # keep track that are processing url
            try:
                url = crawl_queue.pop()
            except KeyError:
                # currently no urls to process
                break
            else:
                html = D(url)
                if scrape_callback:
                    try:
                        links = scrape_callback(url, html) or []
                    except Exception as e:
                        print 'Error in callback for: {}: {}'.format(url, e)
                    else:
                        for link in links:
                            # add this new link to queue
                            crawl_queue.push(normalize(seed_url, link))
                """最后调用complete()方法,用于记录该URL已经被成功解析"""
                crawl_queue.complete(url)


    # wait for all download threads to finish
    threads = []
    while threads or crawl_queue:
        for thread in threads:
            if not thread.is_alive():
                threads.remove(thread)
        while len(threads) < max_threads and crawl_queue.peek():
            # can start some more threads
            thread = threading.Thread(target=process_queue)
            thread.setDaemon(True) # set daemon so main thread can exit when receives ctrl-c
            thread.start()
            threads.append(thread)
        time.sleep(SLEEP_TIME)


def process_crawler(args, **kwargs):
    #获取可用的CPU数量
    num_cpus = multiprocessing.cpu_count()  
    #pool = multiprocessing.Pool(processes=num_cpus)
    print 'Starting {} processes'.format(num_cpus)
    processes = []
    """在每个新进程中启动多线程爬虫"""
    for i in range(num_cpus):
        p = multiprocessing.Process(target=threaded_crawler, args=[args], kwargs=kwargs)
        #parsed = pool.apply_async(threaded_link_crawler, args, kwargs)
        p.start()
        processes.append(p)
    # wait for processes to complete
    """等待所有进程完成执行"""
    for p in processes:
        p.join()

"""规范化URL"""
def normalize(seed_url, link):
    """Normalize this URL by removing hash and adding domain
    """
    link, _ = urlparse.urldefrag(link) # remove hash to avoid duplicates
    return urlparse.urljoin(seed_url, link)

<多进程爬虫测试>

import sys
from process_crawler import process_crawler
from mongo_cache import MongoCache
from alexa_cb import AlexaCallback


def main(max_threads):
    scrape_callback = AlexaCallback()
    cache = MongoCache()
    cache.clear()
    process_crawler(scrape_callback.seed_url, scrape_callback=scrape_callback, cache=cache, max_threads=max_threads, timeout=10)


if __name__ == '__main__':
    max_threads = int(sys.argv[1])
    main(max_threads)
版权声明:ShirleyPaul原创,未经博主允许不得转载

相关文章推荐

Scrapy爬虫实战三:获取代理

本文项目采用python3.6版本语言,利用scrapy框架进行爬取。 该项目实现的功能是获取http://www.proxy360.cn和http://www.xicidaili.com网站中的代理...

python多线程、异步、多进程+异步爬虫

安装Tornadopip install tornado python的多线程比较鸡肋,使用tornado可以实现异步的爬取,代码也比较简单,使用了coroutine后也可以不用回调了。代码如下,最...

多进程 多线程 异步 爬虫(1)

**多进程 多线程 异步 爬虫 忽略爬虫具体规则策略cookie登录等,专注高性能,高并发。**初步爬取煎蛋图片,存图片链接到mongodb#!/usr/bin/python #-*- coding...

python爬虫(中)--多进程和多线程

前面我们见到了基本爬虫的请求、提取和保存,这是一个基本爬虫应该有的结构,那么这时候的这个爬虫有了能爬能存的能力,但是这种能力是很弱的,弱主要体现在三点:①爬虫本身健壮性并不高,有很多情况不一定考虑到;...

python queue和多线程的爬虫 与 JoinableQueue和多进程的爬虫

多线程加queue的爬虫 以自己的csdn博客为例(捂脸,算不算刷自己博客访问量啊,哈哈哈) 代码比较简单,有注释: # -*-coding:utf-8-*- """ ayou """ import...

分布式爬虫-->1 多线程多进程爬虫

在做机器学习,数据挖掘时,常常需要训练针对某个特定应用场景的模型,或者想进行一些文本分析的工作。但是有时候又缺少我们想要的数据集,这个时候我们就要自己上某些特定的网站进行网络爬虫来获取大量数据。这也是...

爬虫实战4—多线程与多进程爬虫

文章说明:本文是在学习一个网络爬虫课程时所做笔记,文章如有不对的地方,欢迎指出,积极讨论。 一、表单及登录 登录的核心是为了获得cookie,登录成功后,header会有设置cookie的相关信息,...

并行编程实战记录----多线程与MPI多进程

工作半年以来,大部分时间都在做RNN的研究,尤其是通过lstm(long-short term memory)构建识别模型。我专注的是使用rnnlib工具开展模型的训练工作,以搭建有效的识别模型。Rn...
  • davidie
  • davidie
  • 2015年12月21日 18:06
  • 3393

Python 中的多线程,多进程,并发,并行,同步,通信

本文简单介绍了Python中并发和并行的机理,如何实现并发和并,以及一些多线程,多进程之间通信和同步的问题...

Python串行运算、并行运算、多线程、多进程对比实验

Python发挥不了多核处理器的性能(据说是受限于GIL,被锁住只能用一个CPU核心,关于这个,这里有篇文章),但是可以通过Python的multiprocessing(多进程)模块或者并行运算模块(...
内容举报
返回顶部
收藏助手
不良信息举报
您举报文章:【WebScraping】并行下载_多线程爬虫&多进程爬虫
举报原因:
原因补充:

(最多只允许输入30个字)