- Creating a new Scrapy project
- Writing a spider to crawl a site and extract data
- Exporting the scraped data using the command line
- Changing spider to recursively follow links
- Using spider arguments
1.创建一个scrapy的项目
scrapy startproject tutorial
2.写一个爬虫,用来爬取网站和扩展数据
quotes_spider.py
3.使用命令行导出爬取的数据
scrapy crawl quotes
4.修改爬虫迭代链接(???)
5.使用爬虫参数(????)
scrapy shell 'http://quotes.toscrape.com/page/1/'
爬虫代码:
import scrapy class QuotesSpider(scrapy.Spider): name = "quotes" def start_requests(self): url = 'http://quotes.toscrape.com/' tag = getattr(self, 'tag', None) if tag is not None: url = url + 'tag/' + tag yield scrapy.Request(url, self.parse) def parse(self, response): for quote in response.css('div.quote'): yield { 'text': quote.css('span.text::text').extract_first(), 'author': quote.css('small.author::text').extract_first(), } next_page = response.css('li.next a::attr(href)').extract_first() if next_page is not None: yield response.follow(next_page, self.parse)