python 文本文件中按每秒记录数_python-aiohttp:通过域限制每秒请求数

我正在编写一个Web搜寻器,该爬行器正在为许多不同的域运行并行提取.我想限制每个单独域的每秒请求数,但我不在乎打开的连接总数或所有域中每秒的请求总数.我想最大程度地增加总体上打开的连接和每秒的请求数,同时限制对单个域的每秒请求数.

我可以找到所有当前存在的示例,要么(1)限制打开的连接数,要么(2)限制在访存循环中每秒发出的请求总数.示例包括:

他们都没有执行我要求的限制每个域的每秒请求数的操作.第一个问题仅回答如何限制整体每秒请求数.第二个甚至没有实际问题的答案(OP每秒询问请求,而所有答案都涉及限制连接数).

这是我使用为同步版本制作的简单速率限制器尝试的代码,当DomainTimer代码在异步事件循环中运行时,该代码不起作用:

from collections import defaultdict

from datetime import datetime, timedelta

import asyncio

import async_timeout

import aiohttp

from urllib.parse import urlparse

from queue import Queue, Empty

from HTMLProcessing import processHTML

import URLFilters

SEED_URLS = ['http://www.bbc.co.uk', 'http://www.news.google.com']

url_queue = Queue()

for u in SEED_URLS:

url_queue.put(u)

# number of pages to download per run of crawlConcurrent()

BATCH_SIZE = 100

DELAY = timedelta(seconds = 1.0) # delay between requests from single domain, in seconds

HTTP_HEADERS = {'Referer': 'http://www.google.com',

'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:59.0) Gecko/20100101 Firefox/59.0'}

class DomainTimer():

def __init__(self):

self.timer = None

def resetTimer(self):

self.timer = datetime.now()

def delayExceeded(self, delay):

if not self.timer: #We haven't fetched this before

return True

if (datetime.now() - self.timer) >= delay:

return True

else:

return False

crawl_history = defaultdict(dict) # given a URL, when is last time crawled?

domain_timers = defaultdict(DomainTimer)

async def fetch(session, url):

domain = urlparse(url).netloc

print('here fetching ' + url + "\n")

dt = domain_timers[domain]

if dt.delayExceeded(DELAY) or not dt:

with async_timeout.timeout(10):

try:

dt.resetTimer() # reset domain timer

async with session.get(url, headers=HTTP_HEADERS) as response:

if response.status == 200:

crawl_history[url] = datetime.now()

html = await response.text()

return {'url': url, 'html': html}

else:

# log HTTP response, put into crawl_history so

# we don't attempt to fetch again

print(url + " failed with response: " + str(response.status) + "\n")

return {'url': url, 'http_status': response.status}

except aiohttp.ClientConnectionError as e:

print("Connection failed " + str(e))

except aiohttp.ClientPayloadError as e:

print("Recieved bad data from server @ " + url + "\n")

else: # Delay hasn't passed yet: skip for now & put @ end of q

url_queue.put(url);

return None

async def fetch_all(urls):

"""Launch requests for all web pages."""

tasks = []

async with aiohttp.ClientSession() as session:

for url in urls:

task = asyncio.ensure_future(fetch(session, url))

tasks.append(task) # create list of tasks

return await asyncio.gather(*tasks) # gather task responses

def batch_crawl():

"""Launch requests for all web pages."""

start_time = datetime.now()

# Here we build the list of URLs to crawl for this batch

urls = []

for i in range(BATCH_SIZE):

try:

next_url = url_queue.get_nowait() # get next URL from queue

urls.append(next_url)

except Empty:

print("Processed all items in URL queue.\n")

break;

loop = asyncio.get_event_loop()

asyncio.set_event_loop(loop)

pages = loop.run_until_complete(fetch_all(urls))

crawl_time = (datetime.now() - start_time).seconds

print("Crawl completed. Fetched " + str(len(pages)) + " pages in " + str(crawl_time) + " seconds.\n")

return pages

def parse_html(pages):

""" Parse the HTML for each page downloaded in this batch"""

start_time = datetime.now()

results = {}

for p in pages:

if not p or not p['html']:

print("Received empty page")

continue

else:

url, html = p['url'], p['html']

results[url] = processHTML(html)

processing_time = (datetime.now() - start_time).seconds

print("HTML processing finished. Processed " + str(len(results)) + " pages in " + str(processing_time) + " seconds.\n")

return results

def extract_new_links(results):

"""Extract links from """

# later we could track where links were from here, anchor text, etc,

# and weight queue priority based on that

links = []

for k in results.keys():

new_urls = [l['href'] for l in results[k]['links']]

for u in new_urls:

if u not in crawl_history.keys():

links.append(u)

return links

def filterURLs(urls):

urls = URLFilters.filterDuplicates(urls)

urls = URLFilters.filterBlacklistedDomains(urls)

return urls

def run_batch():

pages = batch_crawl()

results = parse_html(pages)

links = extract_new_links(results)

for l in filterURLs(links):

url_queue.put(l)

return results

没有引发任何错误或异常,并且速率限制代码可以很好地用于同步读取,但是DomainTimer在异步循环中运行时没有明显的作用.每个域每秒无法发送一个请求的延迟…

我将如何修改此同步速率限制代码以在异步事件循环中工作?谢谢!

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值