前面已经实现了链接爬虫、数据获取爬虫以及缓存功能。前面实现的都是串行下载网页的爬虫,只有前一次下载完成以后才会启动新的下载。爬取规模较小的网站时,串行下载尚可应对,如果面对的是大型网站时,串行下载效率就很低下了。
现在开始逐步实现使用多线程和多进程这两种下载的并发爬虫。
首先通过Alexa网站获取到最受欢迎的100万个网站列表(可直接下载一个压缩文件,网址:http://s3.amazonaws.com/alexa-static/top-1m.csv.zip)。
首先获取压缩文件的内容:
# alexaCB.py
import csv
from zipfile import ZipFile
from io import StringIO
from .mongoCache import MongoCache
class AlexaCallback:
def __init__(self, maxUrls=1000):
self.maxUrls = maxUrls
self.seedUrl = 'http://s3.amazonaws.com/alexa-static/top-1m.csv.zip'
def __call__(self, url, html):
if url == self.seedUr:
urls = []
cache = MongoCache()
with ZipFile(StringIO(html)) as zf:
csvFilename = zf.namelist()[0]
for _, website in csv.reader(zf.open(csvFilename)):
if 'http://' + website not in cache:
urls.append('http://' + website)
if (len(urls) == self.maxUrls):
break
return urls
使用之前开发的爬虫,修改scrapeCallbak的接口为上面这个爬虫即可。
多线程爬虫
在python中实现多线程编程相对来说比较简单。可以保留与之前开发的链接爬虫类似的队列结构,只是改为在多个线程中启动爬虫循环,以便并行下载这些链接。代码如下:
# threadCrawler.py
import time
import threading
import urllib.parse
from downloader import Downloader
SLEEP_TIME = 3
def threadCrawler(seedUrl, delay=5, cache=None, scrapeCallback=None, userAgent='wswp', proxies=None, numRetries=1, maxThreads=10, timeout=60):
"""
Crawl this website in multiple threads
"""
crawlQueue = [seedUrl]
# The url's that have been seen
seen = set([seedUrl])
downloader = Downloader(cache=cache, delay=delay, userAgent=userAgent, proxies=proxies, numRetries=numRetries, timeout=timeout)
def processQueue():
while True:
try:
url = crawlQueue.pop()
except IndexError:
break
else:
html = downloader(url)
if scrapeCallback:
try:
links = scrapeCallback(url, html) or []
except Exception as e:
print('Error in callback for: {}:{}'.format(url, e))
else:
for link in links:
link = normalize(seedUrl, link)
if link not in seen:
seen.add(link)
crawlQueue.append(link)
# wait for all download threads to finish
threads = []
while threads or crawlQueue:
# the crawl is still active
for thread in threads:
if not thread.is_alive():
# remove the stopped threads
threads.remove(thread)
while len(threads) < maxThreads and crawlQueue:
# can start some more threads
thread = threading.Thread(target=processQueue)
thread.setDaemon(True) # set daemon so main thread can exit when receives ctrl-c
thread.start()
threads.append(thread)
# all threads have been processed
# sleep temporarily so CPU an focus execution on other threads
time.sleep(SLEEP_TIME)
def normalize(seedUrl, link):
"""
Normalize this url by removing hash and adding domain
"""
link, _ = urllib.parse.urlfrag(link)
return urllib.parse.urljon(seedUrl, link)
当有url可爬取时,上面的多线程爬虫中的循环会不断的创建线程,直到达到线程池的最大值。在爬取的过程中,如果队列中没有更多可以爬取的url时,线程会提前停止。
多进程爬虫
为了进一步改善性能,对多线程进行再度扩展,使其支持多进程。目前的爬虫队列都是存储在本地中,其他进程都无法处理这一爬虫。为了解决这个问题,需要把队列转移到其他进程可访问的队列中。单独存储队列,意味着就算是不同的服务器上的爬虫也能狗协同处理同一个爬虫任务。如果想要拥有更加健壮的队列,需要考虑使用专门的消息传输工具,比如Celery。这里通过复用MongoDB进行单独存储。MongoDB实现的队列代码如下:
# MongoQueue.py
from datetime import datetime, timedelta
from pymongo import MongoClient, errors
class MongoQueue:
# possilbe states of a download
OUTSTANDING, PROCESSING, COMPLETE = range(3)
def __init__(self, client=None, timeout=300):
"""
:param client: MongoDB server IP address
:param timeout:
"""
self.client = MongoClient() if client is None else client
self.db = self.client.cache
self.timeout = timeout
def __nonzero__(self):
"""
Returns true if there are more jobs to process
:return:
"""
record = self.db.crawlQueue.find_one(
{'status':{'$ne': self.COMPLETE}}
)
return True if record else False
def push(self, url):
"""
Add new url to queue if does not exist
:param url:
:return:
"""
try:
self.db.crawlQueue.insert({'_id': url, 'status': self.OUTSTANDING})
except errors.DuplicateKeyError as e:
pass
def pop(self):
"""
Get an outstanding url from the queue and set its status to processing.
If the queue is empty a KeyError exception is raised.
:return:
"""
record = self.db.crawlQueue.find_and_modify(
query={'status': self.OUTSTANDING},
update={'$set': {'status': self.PROCESSING, 'timestamp': datetime.now()}}
)
if record:
return record['_id']
else:
self.repair()
raise KeyError
def peek(self):
record = self.db.crawlQueue.find_one({'status': self.OUTSTANDING})
if record:
return record['_id']
def complete(self, url):
self.db.crawlQueue.update({'_id': url}, {'$set': {'status': self.COMPLETE}})
def repair(self):
"""
Release stalled jobs
:return:
"""
record = self.db.crawlQueue.find_and_modify(
query={
'timestamp': {'$lt': datetime.now() - timedelta(seconds=self.timeout)},
'status': {'$ne': self.COMPLETE}
},
update={'$set': {'status': self.OUTSTANDING}}
)
if record:
print('Released:', record['_id'])
def clear(self):
self.db.crawlQueue.drop()
上面的代码中对处理url定义了3种状态: OUTSTANDING、PROCESSING和COMPLETE。当添加一个新的url时,其状态为OUTSTANDING;当url从队列中取出准备下载时,状态为PROCESSING,下载结束后,状态为COMPLETE。大部分代码都是关注从队列中取出的url无法正常完成时的处理,比如处理进程被终止,为了避免这种情况,使用了一个timeout参数,默认值是300秒。在repair方法中,如果某个url的处理时间超过timeout的值,就认定处理出错,状态被重置为OUTSTANDING,以便再次处理。
多进程爬虫实现代码如下:
# processCrawler.py
import time
import urllib.parse
import threading
import multiprocessing
from mongoCache import MongoCache
from mongoQueue import MongoQueue
from downloader import Downloader
SLEEP_TIME = 1
def threadedCrawler(seedUrl, delay=5, cache=None, scrapeCallbak=None, userAgent='wswp', proxies=None, numRetries=1, maxThreads=10,timeout=60):
"""
crawl using multiple processing
"""
crawlQueue = MongoQueue()
crawlQueue.clear()
crawlQueue.push(seedUrl)
downloader = Downloader(cache=cache, delay=delay, userAgent=userAgent, proxies=proxies, numRetries=numRetries, timeout=timeout)
def processQueue():
while True:
# keep track that are processing url
try:
url = crawlQueue.pop()
except KeyError:
# Currently no urls to process
break
else:
html = downloader(url)
if scrapeCallbak:
try:
links = scrapeCallbak(url, html) or []
except Exception as e:
print('Error in callback for: {}:{}'.format(url, e))
else:
for link in links:
# add this new link to queue
crawlQueue.push(normalize(seedUrl, link))
crawlQueue.complete(url)
# wait for all download threads to finish
threads = []
while threads or crawlQueue:
for thread in threads:
if not thread.is_alive():
threads.remove(thread)
while len(threads) < maxThreads and crawlQueue.peek():
# can start some more threads
thread = threading.Thread(target=processQueue)
thread.setDaemon(True)
thread.start()
threads.append(thread)
time.sleep(SLEEP_TIME)
def processCrawler(args, **kwargs):
numCpus = multiprocessing.cpu_count()
print('Starting {} processes'.format(numCpus))
processes = []
for i in range(numCpus):
p = multiprocessing.Process(target=threadedCrawler,args=[args], kwargs=kwargs)
p.start()
processes.append(p)
# wait for prcesses to complete
for p in processes:
p.join()
def normalize(seedUrl, link):
link, _ = urllib.parse.urldefrag(link)
return urllib.parse.urljoin(seedUrl, link)
多进程爬虫中将python内建队列替换为了MongoDB实现的新队列,该队列会在内部实现中重复处理url的问题。最后,在url被处理完成以后调用complete()方法,用于记录该url已经被成功解析。