scrapy 笔记一

creating a project:
enter "scrapy startproject tutorial” in your console.
This will create directory named tutorial ,it likes following:
tutorial/
    scrapy.cfg            # deploy configuration file

    tutorial/             # project's Python module, you'll import your code from here
        __init__.py

        items.py          # project items definition file

        middlewares.py    # project middlewares file

        pipelines.py      # project pipelines file

        settings.py       # project settings file

        spiders/          # a directory where you'll later put your spiders
            __init__.py
Make a file named quotes_spider.py under the tutorial/spiders directory.The code likes fllowing:
 
 
import scrapy
class QuotesSpider(scrapy.Spider):
    name = “quotes” #the spider’s name, you will use it while run this spider。
    def start_requests(self):  #must return an iterable。
        urls = [
            'http://quotes.toscrape.com/page/1/',
            'http://quotes.toscrape.com/page/2/',
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)
    def parse(self, response): #parse the response,extracting the scraped data。
        page = response.url.split("/")[-2]
        filename = 'quotes-%s.html' % page
        with open(filename, 'wb') as f:
            f.write(response.body)
        self.log('Saved file %s' % filename)
Our class implements scrapy.Spider and defines some attributes and methods。
To run our spider:
type following code into console and run It。
scrapy crawl quotes  #quotes is spider’s name。we defined it at above code。
Notice:
Instead of implementing a start_request() method, you can just define a start_url class attribute with a list of URLs.like this
 
 
start_urls = [
        'http://quotes.toscrape.com/page/1/',
        'http://quotes.toscrape.com/page/2/',
    ]
The parse() method will be called to handle each of the requests for these urls.
If you run this spider , It can’t stored scraped data just display them .
If you want to saved the scraped data.
First: 
	You will modify the parse() method.We will use the yield Python keyword in the parse() method.like following code:
    	def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').extract_first(),
                'author': quote.css('small.author::text').extract_first(),
                'tags': quote.css('div.tags a.tag::text').extract(),
             }
Second : Type following code in your console and run it.
	scrapy crawl quotes -o quotes.json
That will generate an quotes.json file.The file format is JSON.

 
 
 
 
 
 
 
 
 
 
 
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值