《爬虫利器Scrapy开发实战》第二章第一个Scrapy爬虫

最新推荐文章于 2022-09-01 10:56:35 发布

la_vie_est_belle

最新推荐文章于 2022-09-01 10:56:35 发布

阅读量408

点赞数

分类专栏：《爬虫利器Scrapy开发实战》文章标签： scrapy python爬虫网络爬虫

本文为博主原创文章，未经博主允许不得转载。

本文链接：https://blog.csdn.net/La_vie_est_belle/article/details/108635634

版权

《爬虫利器Scrapy开发实战》专栏收录该内容

2 篇文章 2 订阅

订阅专栏

本文详细介绍了如何使用Scrapy框架创建和运行第一个爬虫，包括理解初始代码、分析目标网站结构、编写爬虫逻辑、运行爬虫及数据导出。在例子中，爬取了quotes.toscrape.com网站的名言和作者信息，通过css选择器提取数据，并实现了分页爬取。最后，讨论了爬虫的命名规则、域名限制、数据导出格式等关键知识点。

摘要由CSDN通过智能技术生成

第一个Scrapy爬虫

在第一章中，我们通过genspider命令在spiders文件夹中生成了一个名为quote的爬虫。这本章，笔者将带大家了解quote.py的内容并编写出第一个具有特定功能的Scrapy爬虫。

1. 初始代码解释

以下是quote.py的初始代码：

import scrapy


class QuoteSpider(scrapy.Spider):
    name = 'quote'
    allowed_domains = ['quotes.toscrape.com']
    start_urls = ['http://quotes.toscrape.com/']

    def parse(self, response):
        pass

1. name属性是爬虫的名字。在同个项目下，genspider命令是不允许生成同名的爬虫的，请看下图：

笔者尝试在myscrapy项目下再生成一个名为quote的爬虫，但是消息提示quote早已存在，所以爬虫名字必须唯一。

2. allowed_domains属性存储允许爬取的域名。如果程序在运行过程中出现的请求链接不属于这个域名，那么这些链接就会被过滤掉。如果需要扩大爬取范围，我们可以再往allowed_domains中加入目标域名：

allowed_domains = ['quotes.toscrape.com', 'example.com']

3. start_urls属性存储爬虫启动后的初始请求链接，注意该属性中的链接不受allowed_domains限制，也就是说即使初始请求链接不在允许的域名下，也是不会被过滤掉的。

比如笔者现在将start_urls中的链接修改成https://www.python.org/

start_urls = ['https://www.python.org/']

使用crawl命令运行爬虫后，发现初始链接有被爬取。

4. 当start_urls属性中的链接请求完成后，返回的响应将作为response参数传递给parse方法，我们需要在这个方法中解析返回的响应、提取相应的数据或者生成新的请求。接下来我们将通过实例来进一步了解parse方法。

2. 分析目标网站

首先访问quotes.toscrape.com，页面显示如下：

在编写爬虫代码之前，我们肯定要先明确我们要抓取的数据，并分析目标网站的网页结构。

在这里，笔者将抓取所有的名言及作者信息。

选中目标元素并右键，点击检查后截图如下：

我们发现：

所有名言和作者信息都在class属性为quote的div元素中，即div.quote；
名言在div span.text元素中；
作者信息在div span small.author元素中；

注：关于xpath和css选择器的语法知识，网上的教程已经非常丰富，笔者不会再在此教程中赘述。

下拉网页，定位到下一页处，并检查其页面元素：

下一页的链接在一个a元素的href属性中，css选择器写法为li.next a::attr(href)

3. 开始编写爬虫

既然已经明确了目标数据所在的元素结构，那么我们就可以开始编写代码了。逻辑如下：

获取每一页上的div.quote元素，肯定会返回一个列表，其中包含当前页上所有的名言和作者信息。
循环该列表，获取其中每一个名言div span.text和作者信息div span small.author
当该页面获取完毕后，请求下一页的链接，并重复1,2两步，直到没有下一页。

代码编写如下：

import scrapy


class QuoteSpider(scrapy.Spider):
    name = 'quote'
    allowed_domains = ['quotes.toscrape.com']
    start_urls = ['http://quotes.toscrape.com/']

    def parse(self, response):
        # 1
        quotes = response.css('div.quote')
        print(quotes)
        
        # 2
        for quote in quotes:
            text = quote.css('span.text::text').extract_first()
            author = quote.css('span small.author::text').extract_first()

            yield {
                'text': text,
                'author': author
            }

        # 3
        next_url = response.css('li.next a::attr(href)').extract_first()
        if next_url:
            next_url = response.urljoin(next_url)
            yield scrapy.Request(next_url, callback=self.parse)

1. 首先调用response的css方法，传入css选择器，获取到当前页所有的名言和作者信息。如果读者更习惯写xpath的话，可以调用xpath()，即：response.xpath('//div[@class="quote"]')

此时获取到的quotes是一个列表，注意列表元素还不是最终要的数据，而是Selector对象，该对象中包含我们要的数据以及数据所在的元素结构。

[<Selector xpath="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype...'>, <Selector xpat
h="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype...'>, <Selector xpath="descendant-o
r-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype...'>, <Selector xpath="descendant-or-self::div[@cl
ass and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype...'>, <Selector xpath="descendant-or-self::div[@class and contain
s(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype...'>, <Selector xpath="descendant-or-self::div[@class and contains(concat(' ', n
ormalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype...'>, <Selector xpath="descendant-or-self::div[@class and contains(concat(' ', normalize-space(
@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype...'>, <Selector xpath="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '),
' quote ')]" data='<div class="quote" itemscope itemtype...'>, <Selector xpath="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" da
ta='<div class="quote" itemscope itemtype...'>, <Selector xpath="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class=
"quote" itemscope itemtype...'>]

2. 循环每一个Selector对象，获取其中的名言内容和作者，同理要调用css()或xpath()，而此时选择器中不需要再写出上一层的div元素(因为已经处于该元素层级内)。在最后我们需要调用extract_first方法，来将Selector对象中的目标数据提取出来。

除了extract_first，Scrapy还提供了extract方法。当调用css()或xpath()获取到Selector对象列表后，前者只返回第一个元素中的数据，后者返回一个数据列表。

获取到目标数据后，调用yield返回一个字典。

3. 获取下一页链接，如果存在的话，那么调用urljoin()拼接出一个完整的url链接，最后通过yield发送一个scrapy.Request请求，callback参数填写的是self.parse，也就是说下一页的页面数据还是交给parse方法来进行解析和获取。