aio-scrapy
基于asyncio及aio全家桶, 使用scrapy框架流程及标准的一个异步爬虫框架。
github地址
概述
- aio-scrapy框架基于开源项目Scrapy & scrapy_redis,可以理解为scrapy-redis的asyncio版本。
- aio-scrapy实现了对scrapyd的支持。
- aio-scrapy实现了redis队列和rabbitmq队列。
- aio-scrapy是一个快速的高级web爬行和web抓取框架,用于抓取网站并从其页面提取结构化数据。
- 分布式爬虫。
需求
- Python 3.9+
- Works on Linux, Windows, macOS, BSD
安装
快速安装方式:
pip install aio-scrapy -U
用法
创建项目爬虫:
aioscrapy startproject project_quotes
cd project_quotes
aioscrapy genspider quotes
quotes.py
from aioscrapy.spiders import Spider
class QuotesMemorySpider(Spider):
name = 'QuotesMemorySpider'
start_urls = ['https://quotes.toscrape.com']
async def parse(self, response):
for quote in response.css('div.quote'):
yield {
'author': quote.xpath('span/small/text()').get(),
'text': quote.css('span.text::text').get(),
}
next_page = response.css('li.next a::attr("href")').get()
if next_page is not None:
yield response.follow(next_page, self.parse)
if __name__ == '__main__':
QuotesMemorySpider.start()
运行爬虫:
aioscrapy crawl quotes
创建单个爬虫脚本:
aioscrapy singlespider single_quotes
single_quotes.py:
from aioscrapy.spiders import Spider
class QuotesMemorySpider(Spider):
name = 'QuotesMemorySpider'
custom_settings = {
"USER_AGENT": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36",
# 'DOWNLOAD_DELAY': 3,
# 'RANDOMIZE_DOWNLOAD_DELAY': True,
# 'CONCURRENT_REQUESTS': 1,
# 'LOG_LEVEL': 'INFO'
}
start_urls = ['https://quotes.toscrape.com']
@staticmethod
async def process_request(request, spider):
""" request middleware """
return request
@staticmethod
async def process_response(request, response, spider):
""" response middleware """
return response
@staticmethod
async def process_exception(request, exception, spider):
""" exception middleware """
pass
async def parse(self, response):
for quote in response.css('div.quote'):
yield {
'author': quote.xpath('span/small/text()').get(),
'text': quote.css('span.text::text').get(),
}
next_page = response.css('li.next a::attr("href")').get()
if next_page is not None:
yield response.follow(next_page, self.parse)
async def process_item(self, item):
print(item)
if __name__ == '__main__':
QuotesMemorySpider.start()
运行爬虫:
aioscrapy runspider quotes.py
更多命令:
aioscrapy -h