1. 安装:
pip install Scrapy
2. 新建工程
scrapy startproject myspider
3. 测试
spiders 目录下新author_spider.py:
import scrapy
class AuthorSpider(scrapy.Spider):
name = 'author'
def start_requests(self):
urls = ['http://quotes.toscrape.com/']
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
# follow links to author pages
for href in response.css('.author + a::attr(href)').extract():
yield scrapy.Request(url=response.urljoin(href), callback=self.parse_author)
# follow pagination links
for href in response.css('li.next a::attr(href)').extact():
yield scrapy.Request(url=response.urljoin(href), callback=self.parse)
4. 结果输出到json
scrapy crawl author -o author.json