为了避免造成服务器过载,可以在两次下载之间添加时延,从而降低爬虫下载速度。
class Throttle:
def __init__(self, delay):
self.delay = delay
self.domains = {}
def wait(self, url):
domain = urlparse.urlparse(url).netloc
last_accessed = self.domains.get(domain)#get函数在domain不存在时返回None
if self.delay > 0 and last_accessed is not None:
sleep_secs = self.delay - (datetime.datetime.now() - last_accessed).seconds
id sleep_secs > 0:
time.sleep(sleep_secs)
self.domains[domain] = datetime.datetime.now()
Throttle记录了每个域名上最后一次访问时间,如果当前访问时间距离上次访问时间的间隔小于5秒,那么程序进入睡眠。
throttle = Throttle(delay)
throttle.wait(url)
result = download(url, headers, proxy=proxy)