Scrapy笔记

安装Scrapy从pip

先安装软件依赖包

sudo apt-get install python-dev python-pip libxml2-dev libxslt1-dev zlib1g-dev libffi-dev libssl-dev

如果你的python程序是安装在python3上的

sudo apt-get install python3 python3-dev

然后用pip安装

sudo pip install Scrapy

为了使你的程序不与其他程序冲突,最好在虚拟环境中装(官方建议)
安装virtualenv like this:

sudo pip install virtualenv

使用教程在这里

virtualenv使用

还有一些python的库你也得安装一下

sudo pip install lxml parsel w3lib twisted cryptography pyOpenSSL

lxml库是Python一种高效的XML和HTML解析器

parsel是在lxml上层的HTML / XML数据提取库

w3lib是一个用于处理URL和网页编码的多用途助手

其他的可以从字面中理解

有时候pip库中的软件版本比较低,也可以从deb的软件库中安装,like this:

这是安装python2.x版本的lxml

sudo apt-get install python-lxml

这是安装python3.x版本的lxml

sudo apt-get install python3-lxml

以上就当废话

有一个必要库的是opssl的dev

sudo apt-get install libssl-dev

没有这个你根本没法安装上面涉及到加密的模块

其实你直接安装pip install Scrapy也行,它会自动检查你的依赖然后给你安装,国外的源没法用你懂的,你可以试试用清华tuna的pipy源like this:

sudo pip install -i https://pypi.tuna.tsinghua.edu.cn/simple Scrapy

然后就是使用

(作者注:英文地方是参考Scrapy官网的原文)

Before you start scraping, you will have to set up a new Scrapy project. Enter a directory where you’d like to store your code and run:

在开始爬虫之前,你have to在一个设置一个新的Scrapy项目。进入一个你想保存和运行你程序的目录

然后在这个目录里打开terminal运行下面命令初始化Scrapy:

scrapy startproject tutorial

This will create a tutorial directory with the following contents:

这将创建一个教程目录

tutorial/
    scrapy.cfg            # deploy configuration file
                          # 部署你的配置文件

    tutorial/             # project's Python module
                          # 项目的pythn模块
                          # you'll import your code from here
                          # 你可以在这里import你的代码

        __init__.py       # 这是Python代码的一个保留

        items.py          # project items definition file
                          # 项目的项定义文件

        pipelines.py      # project pipelines file
                          # 项目的pipelines(管道)文件

        settings.py       # project settings file
                          # 项目的设置文件

        spiders/          # a directory where you'll later  
                          # put your spiders
                          # 放置你的spiders的目录

            __init__.py   # python的保留

Spiders are classes that you define and that Scrapy uses to scrape information from a website (or a group of websites). They must subclass scrapy.Spider and define the initial requests to make, optionally how to follow links in the pages, and how to parse the downloaded page content to extract data.

蜘蛛是您定义的类,Scrapy用于从网站(或一组网站)中抓取信息。 它们必须子类化scrapy.Spider定义初始请求,可选:如何跟踪页面中的链接,以及如何解析下载的页面内容以提取数据。

This is the code for our first Spider. Save it in a file named quotes_spider.py under the tutorial/spiders directory in your project:

这是我们的第一个Spider,保存这个文件quotes_spider.pytutirial/spiders下面在你的项目里

import scrapy


class QuotesSpider(scrapy.Spider): # 定义一个类从scrapy.Spider继承
    name = "quotes"

    def start_requests(self):
        urls = [
        'http://quotes.toscrape.com/page/1/',
        'http://quotes.toscrape.com/page/2/',
        ] # 这是一个列表,不是元组

        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse) # 将结果保存在一个可迭代器中

    def parse(self, response): # 处理返回的数据
        page = response.url.split("/")[-2]
        filename = 'quotes-%s.html' % page
        with open(filename, 'wb') as f:
            f.write(response.body)
        self.log('Saved file %s' % filename)

解释:

  • name: identifies the Spider. It must be unique within a project, that is, you can’t set the same name for different Spiders.

  • 名字必须是唯一的在一个项目里面

  • start_requests(): must return an iterable of Requests (you can return a list of requests or write a generator function) which the Spider will begin to crawl from. Subsequent requests will be generated successively from these initial requests.

  • start_request():必须返回一个可迭代的结果(你可以返回一个列表或者一个发生器函数),Spider将从中开始爬取,随后产生一个返回

  • parse(): a method that will be called to handle the response downloaded for each of the requests made. The response parameter is an instance of TextResponse that holds the page content and has further helpful methods to handle it.

  • parse():将被调用来处理为每个请求下载的响应的方法。 response参数是一个TextResponse的实例,它保存页面内容,并具有更多有用的方法来处理它。

  • The parse() method usually parses the response, extracting the scraped data as dicts and also finding new URLs to follow and creating new requests (Request) from them.

  • parse()方法通常解析响应,将抓取的数据提取为dicts字典,并查找要跟踪的新URL并从中创建新请求(Request)。

运行Spider

回到我们项目最高目录,然后运行:

scrapy crawl quotes

这个将运行我们的Spider,名字叫做quotes,就是刚刚我们放在tutirial/spiders中,然后在self哪里定义了名字叫做quotes的爬虫

name = "quotes" #截取自刚刚那段代码

Now, check the files in the current directory. You should notice that two new files have been created: quotes-1.html and quotes-2.html, with the content for the respective URLs, as our parse method instructs.

然后现在有了两个新的文件在正确的目录下,你应该注意到了

Scrapy schedules the scrapy.Request objects returned by the start_requests method of the Spider. Upon receiving a response for each one, it instantiates Response objects and calls the callback method associated with the request (in this case, the parse method) passing the response as argument.

Scrapy计划由Spider的start_requests方法返回的scrapy.Request对象。 在接收到每个响应时,它实例化Response对象并调用与请求相关联的回调方法(在这种情况下,解析方法)将响应作为参数传递。

Instead of implementing a start_requests() method that generates scrapy.Request objects from URLs, you can just define a start_urls class attribute with a list of URLs. This list will then be used by the default implementation of start_requests() to create the initial requests for your spider:

代替实现一个从URL生成scrapy.Request对象的start_requests()方法,您可以使用URL列表定义一个start_urls类属性。 此列表将由默认实现的start_requests()用于为您的蜘蛛创建初始请求:

import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
        'http://quotes.toscrape.com/page/2/',
    ]

    def parse(self, response):
        page = response.url.split("/")[-2]
        filename = 'quotes-%s.html' % page
        with open(filename, 'wb') as f:
            f.write(response.body)

The parse() method will be called to handle each of the requests for those URLs, even though we haven’t explicitly told Scrapy to do so. This happens because parse() is Scrapy’s default callback method, which is called for requests without an explicitly assigned callback.

将调用parse()方法来处理对这些URL的每个请求,即使我们没有明确告诉Scrapy这样做。 发生这种情况是因为parse()是Scrapy的默认回调方法,它为没有显式分配的回调的请求调用。

提取数据

The best way to learn how to extract data with Scrapy is trying selectors using the shell Scrapy shell. Run:

scrapy shell 'http://quotes.toscrape.com/page/1/'

NOTE: 记住,当从命令行运行Scrapy shell时,始终用引号括住url,否则包含参数的urls(即&字符)将不起作用。

You will see something like:

[ ... Scrapy log here ... ]
2016-09-19 12:09:27 [scrapy] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/1/> (referer: None)
[s] Available Scrapy objects:
[s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s]   crawler    <scrapy.crawler.Crawler object at 0x7fa91d888c90>
[s]   item       {}
[s]   request    <GET http://quotes.toscrape.com/page/1/>
[s]   response   <200 http://quotes.toscrape.com/page/1/>
[s]   settings   <scrapy.settings.Settings object at 0x7fa91d888c10>
[s]   spider     <DefaultSpider 'default' at 0x7fa91c8af990>
[s] Useful shortcuts:
[s]   shelp()           Shell help (print this help)
[s]   fetch(req_or_url) Fetch request (or URL) and update local objects
[s]   view(response)    View response in a browser
>>>

Using the shell, you can try selecting elements using CSS with the response object:

>>> response.css('title')
[<Selector xpath='descendant-or-self::title' data='<title>Quotes to Scrape</title>'>]

There are two things to note here: one is that we’ve added ::text to the CSS query, to mean we want to select only the text elements directly inside element. If we don’t specify ::text, we’d get the full title element, including its tags:

::text表明我们只想得到text类型的数据,如果你想获取全部数据,可以用以下所示的:

>>> response.css('title').extract()
['<title>Quotes to Scrape</title>']

The result of running response.css(‘title’) is a list-like object called SelectorList, which represents a list of Selector objects that wrap around XML/HTML elements and allow you to run further queries to fine-grain the selection or extract the data.

运行response.css(‘title’)的结果是一个名为SelectorList的类似列表的对象,它表示包围XML / HTML元素的选择器对象列表,允许您运行进一步的查询以精细选择或提取 数据。

To extract the text from the title above, you can do:

>>> response.css('title::text').extract()
['Quotes to Scrape']

The other thing is that the result of calling .extract() is a list, because we’re dealing with an instance of SelectorList. When you know you just want the first result, as in this case, you can do:

这句话就是说,调用.extract()返回的结果是个列表,然后如果你只是想得到第一个结果,你可以这样做:

>>> response.css('title::text').extract_first()
'Quotes to Scrape'

作为替代你可以这样写:

>>> response.css('title::text')[0].extract()
'Quotes to Scrape'

上面两种写法都是一样的

However, using .extract_first() avoids an IndexError and returns None when it doesn’t find any element matching the selection.

这句话大概就是说用.extract_first()可以避免处理IndexError也就是超出列表范围的错误

There’s a lesson here: for most scraping code, you want it to be resilient to errors due to things not being found on a page, so that even if some parts fail to be scraped, you can at least get some data.

大概就是说,在某些程序中,这样最起码可以保证你爬的时候没有找到这个page的时候,程序还是可以继续爬的

Besides the extract() and extract_first() methods, you can also use the re() method to extract using regular expressions:

这句话很好懂吧:

>>> response.css('title::text').re(r'Quotes.*')
['Quotes to Scrape']
>>> response.css('title::text').re(r'Q\w+')
['Quotes']
>>> response.css('title::text').re(r'(\w+) to (\w+)')
['Quotes', 'Scrape']

re也就是正则表达式咯

In order to find the proper CSS selectors to use, you might find useful opening the response page from the shell in your web browser using view(response). You can use your browser developer tools or extensions like Firebug (see sections about Using Firebug for scraping and Using Firefox for scraping).

为了找到合适的CSS选择器使用,您可能会发现有用的打开响应页面从您的Web浏览器中使用视图(响应)。 您可以使用浏览器开发者工具或Firebug扩展(请参阅关于使用firebug进行剪贴和使用Firefox进行剪贴的部分)。

Selector Gadget is also a nice tool to quickly find CSS selector for visually selected elements, which works in many browsers.

Selector Gadget

XPath:简短介绍

Besides CSS, Scrapy selectors also support using XPath expressions:

>>> response.xpath('//title')
[<Selector xpath='//title' data='<title>Quotes to Scrape</title>'>]
>>> response.xpath('//title/text()').extract_first()
'Quotes to Scrape'

XPath expressions are very powerful, and are the foundation of Scrapy Selectors. In fact, CSS selectors are converted to XPath under-the-hood. You can see that if you read closely the text representation of the selector objects in the shell.

While perhaps not as popular as CSS selectors, XPath expressions offer more power because besides navigating the structure, it can also look at the content. Using XPath, you’re able to select things like: select the link that contains the text “Next Page”. This makes XPath very fitting to the task of scraping, and we encourage you to learn XPath even if you already know how to construct CSS selectors, it will make scraping much easier.

XPath表达式非常强大,是Scrapy选择器的基础。 事实上,CSS选择器转换为XPath under-the-hood。 你可以看到,如果你仔细阅读在shell中的选择器对象的文本表示。

虽然可能不像CSS选择器那样流行,XPath表达式提供了更多的功能,因为除了导航结构之外,它还可以查看内容。 使用XPath,您可以选择以下内容:选择包含文本“下一页”的链接。 这使得XPath非常适合于刮削的任务,我们鼓励你学习XPath,即使你已经知道如何构造CSS选择器,它将使刮削更容易。

using XPath with Scrapy Selectors here

this tutorial to learn XPath through examples

this tutorial to learn “how to think in XPath”

提取引号和作者

Now that you know a bit about selection and extraction, let’s complete our spider by writing the code to extract the quotes from the web page.

Each quote in http://quotes.toscrape.com is represented by HTML elements that look like this:

<div class="quote">
    <span class="text">“The world as we have created it is a process of our
    thinking. It cannot be changed without changing our thinking.”</span>
    <span>
        by <small class="author">Albert Einstein</small>
        <a href="/author/Albert-Einstein">(about)</a>
    </span>
    <div class="tags">
        Tags:
        <a class="tag" href="/tag/change/page/1/">change</a>
        <a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>
        <a class="tag" href="/tag/thinking/page/1/">thinking</a>
        <a class="tag" href="/tag/world/page/1/">world</a>
    </div>
</div>

Let’s open up scrapy shell and play a bit to find out how to extract the data we want:

$ scrapy shell 'http://quotes.toscrape.com'

We get a list of selectors for the quote HTML elements with:

>>> response.css("div.quote")

Each of the selectors returned by the query above allows us to run further queries over their sub-elements. Let’s assign the first selector to a variable, so that we can run our CSS selectors directly on a particular quote:

通过上面的查询返回的每个选择器允许我们对它们的子元素进行进一步的查询。 让我们将第一个选择器分配给一个变量,以便我们可以直接对特定的引用运行我们的CSS选择器:

>>> quote = response.css("div.quote")[0]

Now, let’s extract title, author and the tags from that quote using the quote object we just created:

>>> title = quote.css("span.text::text").extract_first()
>>> title
'“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'
>>> author = quote.css("small.author::text").extract_first()
>>> author
'Albert Einstein'
<div class="quote"> # div class = "quote"
    <span class="text">“The world as we have created it is a process of our
    thinking. It cannot be changed without changing our thinking.”</span> # span class = "text"
    <span>
        by <small class="author">Albert Einstein</small> # small class = "author"
        <a href="/author/Albert-Einstein">(about)</a>
    </span>
    <div class="tags"> # div class = "tags"
        Tags:
        <a class="tag" href="/tag/change/page/1/">change</a>
        # a class = "tag"

        <a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>
        <a class="tag" href="/tag/thinking/page/1/">thinking</a>
        <a class="tag" href="/tag/world/page/1/">world</a>
    </div>
</div>

Given that the tags are a list of strings, we can use the .extract() method to get all of them:

>>> tags = quote.css("div.tags a.tag::text").extract()
>>> tags
['change', 'deep-thoughts', 'thinking', 'world']

Having figured out how to extract each bit, we can now iterate over all the quotes elements and put them together into a Python dictionary:

在找出了如何提取每个位之后,我们现在可以遍历所有的引号元素,并将它们放在一起成为一个Python字典:

>>> for quote in response.css("div.quote"):
...     text = quote.css("span.text::text").extract_first()
...     author = quote.css("small.author::text").extract_first()
...     tags = quote.css("div.tags a.tag::text").extract()
...     print(dict(text=text, author=author, tags=tags))
{'tags': ['change', 'deep-thoughts', 'thinking', 'world'], 'author': 'Albert Einstein', 'text': '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'}
{'tags': ['abilities', 'choices'], 'author': 'J.K. Rowling', 'text': '“It is our choices, Harry, that show what we truly are, far more than our abilities.”'}
    ... a few more of these, omitted for brevity
>>>
在我们的Spider中提取数据

Let’s get back to our spider. Until now, it doesn’t extract any data in particular, just saves the whole HTML page to a local file. Let’s integrate the extraction logic above into our spider.

A Scrapy spider typically generates many dictionaries containing the data extracted from the page. To do that, we use the yield Python keyword in the callback, as you can see below:

让我们回到我们的Spider。 直到现在,它不会提取任何特别的数据,只是将整个HTML页面保存到本地文件。 让我们将上面的提取逻辑集成到我们的蜘蛛中。

Scrapy蜘蛛通常会生成许多包含从页面中提取的数据的字典。 为此,我们在回调中使用yield Python关键字,如下所示:

import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
        'http://quotes.toscrape.com/page/2/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').extract_first(),
                'author': quote.css('span small::text').extract_first(),
                'tags': quote.css('div.tags a.tag::text').extract(),
            }
2016-09-19 18:57:19 [scrapy] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/>
{'tags': ['life', 'love'], 'author': 'André Gide', 'text': '“It is better to be hated for what you are than to be loved for what you are not.”'}
2016-09-19 18:57:19 [scrapy] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/>
{'tags': ['edison', 'failure', 'inspirational', 'paraphrased'], 'author': 'Thomas A. Edison', 'text': "“I have not failed. I've just found 10,000 ways that won't work.”"}
存储已爬取的数据

The simplest way to store the scraped data is by using Feed exports, with the following command:

存储抓取数据的最简单方法是使用Feed导出,使用以下命令:

scrapy crawl quotes -o quotes.json

That will generate an quotes.json file containing all scraped items, serialized in JSON.

这会生成一个json数据格式

For historic reasons, Scrapy appends to a given file instead of overwriting its contents. If you run this command twice without removing the file before the second time, you’ll end up with a broken JSON file.

你运行第一次这个命令后没有删除json文件又再次运行这个命令会得到一个坏的json文件。

You can also used other formats, like JSON Lines:

你可以使用其他的格式,就像JSON Lines:

scrapy crawl quotes -o quotes.jl

JSON Lines

JSON Lines大概长这样:

{"name": "Gilbert", "wins": [["straight", "7♣"], ["one pair", "10♥"]]}
{"name": "Alexa", "wins": [["two pair", "4♠"], ["two pair", "9♠"]]}
{"name": "May", "wins": []}
{"name": "Deloise", "wins": [["three of a kind", "5♣"]]}

号称Better than CSV

然后官网说:

JSON allows encoding Unicode strings with only ASCII escape sequences, however those escapes will be hard to read when viewed in a text editor. The author of the JSON Lines file may choose to escape characters to work with plain ASCII files.

JSON允许用用ASCII来进行Unicode的解码,然后这个解码会很难去读取当我们用text editor的时候,JSON Lines的作者可以选择转义字符以使用纯ASCII文件。

Encodings other than UTF-8 are very unlikely to be valid when decoded as UTF-8 so the chance of accidentally misinterpreting characters in JSON Lines files is low.

除UTF-8之外的编码在解码为UTF-8时非常可能失效,因此在JSON Lines文件中意外错误解释字符的可能性很低。

Each Line is a Valid JSON Value
每一行都是一个可用的JSON数据

Line Separator is ‘\n’,所以windows用的话会都在一行,linux就很好

很好哦

The JSON Lines format is useful because it’s stream-like, you can easily append new records to it. It doesn’t have the same problem of JSON when you run twice. Also, as each record is a separate line, you can process big files without having to fit everything in memory, there are tools like JQ to help doing that at the command-line.

JSON行格式很有用,因为它是流式的,你可以轻松地添加新的记录到它。 它不具有相同的JSON问题,当你运行两次。 此外,由于每条记录都是单独的行,因此您可以处理大文件,而无需将所有内容都放在内存中,有像JQ这样的工具可以帮助在命令行执行。

In small projects (like the one in this tutorial), that should be enough. However, if you want to perform more complex things with the scraped items, you can write an Item Pipeline. A placeholder file for Item Pipelines has been set up for you when the project is created, in tutorial/pipelines.py. Though you don’t need to implement any item pipelines if you just want to store the scraped items.

在小项目(如本教程中的一个)中,这应该足够了。 但是,如果要对已清理的项目执行更复杂的操作,则可以编写项目管道。 在创建项目时,已经在tutorial / pipelines.py中为您创建了项目管道的占位符文件。 虽然如果你只想存储被抓取的项目,你不需要实现任何项目管道。(也就是你可用将爬取的内容用管道来传送给其他程序)

Following links(也就是找到页面之中的链接然后爬,这很重要)

Let’s say, instead of just scraping the stuff from the first two pages from http://quotes.toscrape.com, you want quotes from all the pages in the website.

只爬了两个页面,然后你想爬所有的页面

Now that you know how to extract data from pages, let’s see how to follow links from them.

不解释

First thing is to extract the link to the page we want to follow. Examining our page, we can see there is a link to the next page with the following markup:

解析链接在我们发现的page中:

<ul class="pager">
    <li class="next">
        <a href="/page/2/">Next <span aria-hidden="true">&rarr;</span></a>
    </li>
</ul>

We can try extracting it in the shell:

>>> response.css('li.next a').extract_first()
'<a href="/page/2/">Next <span aria-hidden="true"></span></a>'

This gets the anchor element, but we want the attribute href. For that, Scrapy supports a CSS extension that let’s you select the attribute contents, like this:

>>> response.css('li.next a::attr(href)').extract_first()
'/page/2/'

Let’s see now our spider modified to recursively follow the link to the next page, extracting data from it:

import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').extract_first(),
                'author': quote.css('span small::text').extract_first(),
                'tags': quote.css('div.tags a.tag::text').extract(),
            }

        next_page = response.css('li.next a::attr(href)').extract_first()
        if next_page is not None:
            next_page = response.urljoin(next_page)
            yield scrapy.Request(next_page, callback=self.parse)

Now, after extracting the data, the parse() method looks for the link to the next page, builds a full absolute URL using the urljoin() method (since the links can be relative) and yields a new request to the next page, registering itself as callback to handle the data extraction for the next page and to keep the crawling going through all the pages.

parse()找到链接到下一个page的,然后build一个完全的URL用urljoin()工具,然后生成一个迭代器来请求下一个page。

What you see here is Scrapy’s mechanism of following links: when you yield a Request in a callback method, Scrapy will schedule that request to be sent and register a callback method to be executed when that request finishes.

Using this, you can build complex crawlers that follow links according to rules you define, and extract different kinds of data depending on the page it’s visiting.

你可以build你自己的复杂爬虫根据这个rules

In our example, it creates a sort of loop, following all the links to the next page until it doesn’t find one – handy for crawling blogs, forums and other sites with pagination.

在我们的例子中,我们创建一个小loop,跟随下一个links来达到下一个page知道它没有。

More examples and patterns

Here is another spider that illustrates callbacks and following links, this time for scraping author information:

import scrapy


class AuthorSpider(scrapy.Spider):
    name = 'author'

    start_urls = ['http://quotes.toscrape.com/']

    def parse(self, response):
        # follow links to author pages
        for href in response.css('.author+a::attr(href)').extract():
            yield scrapy.Request(response.urljoin(href),
                                 callback=self.parse_author)

        # follow pagination(分页) links
        next_page = response.css('li.next a::attr(href)').extract_first()
        if next_page is not None:
            next_page = response.urljoin(next_page)
            yield scrapy.Request(next_page, callback=self.parse)

    def parse_author(self, response):
        def extract_with_css(query):
            return response.css(query).extract_first().strip()

        yield {
            'name': extract_with_css('h3.author-title::text'),
            'birthdate': extract_with_css('.author-born-date::text'),
            'bio': extract_with_css('.author-description::text'),
        }

This spider will start from the main page, it will follow all the links to the authors pages calling the parse_author callback for each of them, and also the pagination links with the parse callback as we saw before.

注:follow all the links to authors pages

The parse_author callback defines a helper function to extract and cleanup the data from a CSS query and yields the Python dict with the author data.

Another interesting thing this spider demonstrates is that, even if there are many quotes from the same author, we don’t need to worry about visiting the same author page multiple times. By default, Scrapy filters out duplicated requests to URLs already visited, avoiding the problem of hitting servers too much because of a programming mistake. This can be configured by the setting DUPEFILTER_CLASS.

另一个有趣的事情,这个Spider演示的是,即使有很多来自同一作者的报价,我们不需要担心访问同一作者页多次。 默认情况下,Scrapy会过滤掉已访问过的网址的重复请求,从而避免由于编程错误而导致服务器过多的问题。 这可以通过设置DUPEFILTER_CLASS进行配置。

Hopefully by now you have a good understanding of how to use the mechanism of following links and callbacks with Scrapy.

As yet another example spider that leverages the mechanism of following links, check out the CrawlSpider class for a generic spider that implements a small rules engine that you can use to write your crawlers on top of it.

希望现在你有一个很好的理解如何使用Scrapy的链接和回调的机制。

作为另一个示例蜘蛛利用以下链接的机制,检查CrawlSpider类的一个通用蜘蛛,实现一个小规则引擎,你可以用来写你的爬虫在它的顶部。

此外,一个常见的模式是使用一个把额外的数据传递给回调的技巧来构建一个包含多个页面的数据的项目。

trick to pass additional data to the callbacks

Using spider arguments

You can provide command line arguments to your spiders by using the -a option when running them:

scrapy crawl quotes -o quotes-humor.json -a tag=humor

These arguments are passed to the Spider’s init method and become spider attributes by default.

In this example, the value provided for the tag argument will be available via self.tag. You can use this to make your spider fetch only quotes with a specific tag, building the URL based on the argument:

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"

    def start_requests(self):
        url = 'http://quotes.toscrape.com/'
        tag = getattr(self, 'tag', None)
        if tag is not None:
            url = url + 'tag/' + tag
        yield scrapy.Request(url, self.parse)

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').extract_first(),
                'author': quote.css('span small a::text').extract_first(),
            }

        next_page = response.css('li.next a::attr(href)').extract_first()
        if next_page is not None:
            next_page = response.urljoin(next_page)
            yield scrapy.Request(next_page, self.parse)

If you pass the tag=humor argument to this spider, you’ll notice that it will only visit URLs from the humor tag, such as http://quotes.toscrape.com/tag/humor.

You can learn more about handling spider arguments here.

Next steps

This tutorial covered only the basics of Scrapy, but there’s a lot of other features not mentioned here. Check the What else? section in Scrapy at a glance chapter for a quick overview of the most important ones.

You can continue from the section Basic concepts to know more about the command-line tool, spiders, selectors and other things the tutorial hasn’t covered like modeling the scraped data. If you prefer to play with an example project, check the Examples section.

  • 2
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值