scrapy 笔记一

最新推荐文章于 2018-07-07 14:51:27 发布

joker_zhou

最新推荐文章于 2018-07-07 14:51:27 发布

阅读量177

点赞数

分类专栏： scrapy 文章标签： scrapy

本文链接：https://blog.csdn.net/joker_zhou/article/details/80907269

版权

scrapy 专栏收录该内容

6 篇文章 0 订阅

订阅专栏

creating a project：

enter "scrapy startproject tutorial” in your console.

This will create directory named tutorial ,it likes following:

tutorial/
    scrapy.cfg            # deploy configuration file

    tutorial/             # project's Python module, you'll import your code from here
        __init__.py

        items.py          # project items definition file

        middlewares.py    # project middlewares file

        pipelines.py      # project pipelines file

        settings.py       # project settings file

        spiders/          # a directory where you'll later put your spiders
            __init__.py

Make a file named quotes_spider.py under the tutorial/spiders directory.The code likes fllowing:

import scrapy
class QuotesSpider(scrapy.Spider):
    name = “quotes” #the spider’s name, you will use it while run this spider。
    def start_requests(self):  #must return an iterable。
        urls = [
            'http://quotes.toscrape.com/page/1/',
            'http://quotes.toscrape.com/page/2/',
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)
    def parse(self, response): #parse the response，extracting the scraped data。
        page = response.url.split("/")[-2]
        filename = 'quotes-%s.html' % page
        with open(filename, 'wb') as f:
            f.write(response.body)
        self.log('Saved file %s' % filename)

Our class implements scrapy.Spider and defines some attributes and methods。

To run our spider：

type following code into console and run It。

scrapy crawl quotes  #quotes is spider’s name。we defined it at above code。

Notice：

Instead of implementing a start_request() method, you can just define a start_url class attribute with a list of URLs.like this

start_urls = [
        'http://quotes.toscrape.com/page/1/',
        'http://quotes.toscrape.com/page/2/',
    ]

The parse（） method will be called to handle each of the requests for these urls.

If you run this spider , It can’t stored scraped data just display them .

If you want to saved the scraped data.

First:

	You will modify the parse() method.We will use the yield Python keyword in the parse() method.like following code:

    	def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').extract_first(),
                'author': quote.css('small.author::text').extract_first(),
                'tags': quote.css('div.tags a.tag::text').extract(),
             }

Second : Type following code in your console and run it.

	scrapy crawl quotes -o quotes.json

That will generate an quotes.json file.The file format is JSON.

joker_zhou

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
scrapy 笔记一

creating a project： enter "scrapy startproject tutorial” in your console.This will create directory named tutorial ,it likes following:tutorial/ scrapy.cfg # deploy configuration file...
复制链接

扫一扫

专栏目录