scrapy爬虫框架

最新推荐文章于 2024-07-29 16:20:24 发布

阿无，

最新推荐文章于 2024-07-29 16:20:24 发布

阅读量804

点赞数

分类专栏：爬虫 python

原文链接：https://www.osgeo.cn/scrapy/intro/overview.html#walk-through-of-an-example-spider

版权

python 同时被 2 个专栏收录

15 篇文章 0 订阅

订阅专栏

爬虫

7 篇文章 0 订阅

订阅专栏

背景

大部分爬虫应该是来爬取网页的，我在公司的工作是来爬取电视盒子上的信息，所以一些无关的东西，不会很详细的出现在文章里，例如css选择器或者xpath选择器

版本

官方最新版本为 2.4.1，官方没有中文文档

使用版本为1.8.0

# 查看scrapy版本
scrapy version -v

python版本 3.6.1

参考中文文档版本为2.3.0
官方文档为https://docs.scrapy.org/en/latest/

简介

Scrapy 是用纯python编写的，scrapy采用Twisted异步网络框架，实现高效率的网络采集，是如今最强大的爬虫框架，没有之一。

它依赖于几个关键的python包（以及其他包）：

lxml 一个高效的XML和HTML解析器
parsel ，一个写在lxml上面的html/xml数据提取库,
w3lib ，用于处理URL和网页编码的多用途帮助程序
twisted 异步网络框架
cryptography 和 pyOpenSSL ，处理各种网络级安全需求

架构

在这里插入图片描述

在这里插入图片描述
Scrapy中的数据流由执行引擎控制，如下所示：

Spider(爬虫文件)将request对象(可能存放header，post数据，代理等信息)通过Engine(引擎)存放到Scheduler(存放引擎发过来的request对象)
Scheduler将request对象发送到Downloader(下载器)进行下载，然后将response对象通过引擎返回到spider
spider提取有效数据传递给ITEM PIPELINES，如果有新的需要爬取的url，spider通过引擎将request对象传递给
Downloader(下载器)。
其中引擎在调度Downloader或者Spiders的时候，可以通过Downloader Middleware或者Spiders Middleware对request对象或者response对象进行过滤

组件

在这里插入图片描述

Engine引擎
负责Spider、ItemPipeline、Downloader、Scheduler中间的通讯，信号、数据传递等。
Scheduler调度器
它负责接受引擎发送过来的Request请求，并按照一定的方式进行整理排列，入队，当引擎需要时，交还给引擎。
Downloader下载器
负责下载Scrapy Engine(引擎)发送的所有Requests请求，并将其获取到的Responses交还给Scrapy Engine(引擎)，由引擎交给Spider来处理
Spider爬虫
它负责处理所有Responses,从中分析提取数据，获取Item字段需要的数据，并将需要跟进的URL提交给引擎，再次进入Scheduler(调度器)，
Item Pipeline(管道)
它负责处理Spider中获取到的Item，并进行进行后期处理（详细分析、过滤、存储等）的地方
Downloader Middlewares（下载中间件）
你可以当作是一个可以自定义扩展下载功能的组件。比如设置代理
Spider Middlewares（Spider中间件）
你可以理解为是一个可以自定扩展和操作引擎和Spider中间通信的功能组件（比如进入Spider的Responses;和从Spider出去的Requests），可以自定义requests请求和进行response过滤

多线程爬虫原理

在这里插入图片描述

常用函数

'''
url,设置需要发起请求的url地址
callback=None,设置请求成功后的回调方法
method='GET',请求方式,默认为get请求
headers=None,设置请求头,字典类型
cookies=None,设置cookies信息,模拟登录用户,字典类型
meta=None,传递参数(字典类型)
encoding='utf-8',设置编码
dont_filter=False,是否去重,默认为false,表示去重
errback=None,设置请求失败后的回调
'''
yield scrapy.Request(first_url,callback=self.parse_tags_page)

入门案例

class QuotesSpider(scrapy.Spider):
    name = 'quotes'
    start_urls = [
        'http://quotes.toscrape.com/tag/humor/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'author': quote.xpath('span/small/text()').get(),
                'text': quote.css('span.text::text').get(),
            }

        next_page = response.css('li.next a::attr("href")').get()
        if next_page is not None:
            yield response.follow(next_page, self.parse)

进入到spiders目录下，运行命令scrapy runspider quotes.py
如果在这段命令后面加上 -o quotes.jl，会把爬取的数据以json格式放到一个jl文件中，并且每一条数据都是独占一行的
如果再次运行，则会在原有的基础上新增

运行结果

{‘author’: ‘Jane Austen’, ‘text’: ‘“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”’}
{‘author’: ‘Steve Martin’, ‘text’: ‘“A day without sunshine is like, you know, night.”’}
{‘author’: ‘Garrison Keillor’, ‘text’: ‘“Anyone who thinks sitting in church can make you a Christian must also think that sitting in a garage can make you a car.”’}…

运行过程

输入运行命令的时候，Scrapy在它内部寻找爬虫文件，并通过引擎运行它。

开始向start_urls中定义的url发出请求，并调用默认的回调方法parse()，参数为上一次请求的响应对象response。并在回调函数中找寻下一页的链接，再次发起请求，循环往复。

scrapy的优点：请求是异步的，这意味着Scrapy不需要等待请求完成和处理，它可以同时发送另一个请求或做其他事情。这也意味着，即使某些请求失败或在处理过程中发生错误，其他请求也可以继续进行。

Scrapy安装略过，链接

创建项目

# 创建scrapy项目，名为tutorial
scrapy startproject tutorial

# 指定要爬取的网站的域名
scrapy genspider 文件名(例如taobaoSpider) + 目标网站的域名

目录结构

tutorial/
- scrapy.cfg-------------部署配置文件
- tutorial/
  - init.py----------init两边是有下划线的，这个格式显示不出来
  - items.py-----------定义实体类
  - middlewares.py--------------定义数据模型中的中间件
  - pipelines.py-------------管道文件,负责对爬虫返回数据的处理，用来对items里面提取的数据做进一步处理，如保存等
  - settings.py---------配置文件
  - spiders/------------存放爬虫的文件夹
    - init.py----------init两边是有下划线的，这个格式显示不出来

在项目之间共享根目录

项目根目录有scrapy.cfg ，可以由多个项目共享，每个项目都有自己的设置模块。

在这种情况下，必须为设置模块定义别名

[settings]
default = myproject1.settings
project1 = myproject1.settings
project2 = myproject2.settings

默认情况下，scrapy 命令行工具将使用 default 设置。使用 SCRAPY_PROJECT 用于指定其他项目的环境变量 scrapy 使用：

$ scrapy settings --get BOT_NAME
Project 1 Bot
$ export SCRAPY_PROJECT=project2
$ scrapy settings --get BOT_NAME
Project 2 Bot

第一个爬虫代码

import scrapy


class QuotesSpider(scrapy.Spider):
	# 在一个项目中必须唯一
    name = "quotes"

	# 必须返回一个请求列表或生成器函数，爬虫将从这些初始请求中开始爬取
	# 下面这段代码也可以简略为：
	#start_urls = [
    #    'http://quotes.toscrape.com/page/1/',
    #    'http://quotes.toscrape.com/page/2/',
    #]
    def start_requests(self):
        urls = [
            'http://quotes.toscrape.com/page/1/',
            'http://quotes.toscrape.com/page/2/',
        ]
      	
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

	# 默认的回调方法
	# 将抓取的数据提取为dict
    def parse(self, response):
        page = response.url.split("/")[-2]
        filename = f'quotes-{page}.html'
        with open(filename, 'wb') as f:
            f.write(response.body)
        self.log(f'Saved file {filename}')

运行爬虫，需要转到项目的顶级目录
scrapy crawl quotes

此时已经创建了两个新文件：quotes-1.html 和引用-2.HTML(这是两个没有经过任何处理的Html文件)

scrapy shell 提取数据

使用shell命令爬取页面并获取指定的数据

# 果是一个类似于列表的对象
response.css('title')
[<Selector xpath='descendant-or-self::title' data='<title>Quotes to Scrape</title>'>]

# 从上述标题中提取不带标签的文本
response.css('title::text').getall()
['Quotes to Scrape']
# 从上述标题中提取带标签的文本
response.css('title').getall()
['<title>Quotes to Scrape</title>']

# .getall()可能返回多个结果的一个列表，如果只想要第一个结果的话
response.css('title::text').get()
'Quotes to Scrape'
# 或者
response.css('title::text')[0].get()
'Quotes to Scrape'

# 除getall()和get()方法，也可以使用re()提取结果
response.css('title::text').re(r'Quotes.*')
['Quotes to Scrape']
response.css('title::text').re(r'Q\w+')
['Quotes']
response.css('title::text').re(r'(\w+) to (\w+)')
['Quotes', 'Scrape']

urljoin()爬取全带有分页的页面

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('small.author::text').get(),
                'tags': quote.css('div.tags a.tag::text').getall(),
            }

        # /page/2/
        next_page = response.css('li.next a::attr(href)').get()
        print('==================')
        print(next_page)
        if next_page is not None:
          # urljoin，链接可以是相对的
          # http://quotes.toscrape.com/page/2/
          # http://quotes.toscrape.com/page/3/
          next_page = response.urljoin(next_page)
          print("----------------")
          print(next_page)
          # 在提取数据之后， parse() 方法查找到下一页的链接，
          # 并使用 urljoin() 方法（因为链接可以是相对的），
          # 并生成对下一页的新请求，将自身注册为回调，
          # 以处理下一页的数据提取，并保持爬行在所有页中进行。
		  # 在我们的示例中，它创建了一种循环，跟踪到下一页的所有链接，
		  # 直到找不到一个为止——这对于爬行博客、论坛和其他带有分页的站点很方便。
          yield scrapy.Request(next_page, callback=self.parse)

相对于scrapy.Request()的快捷方式

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('span small::text').get(),
                'tags': quote.css('div.tags a.tag::text').getall(),
            }

        next_page = response.css('li.next a::attr(href)').get()
        if next_page is not None:
        	
            yield response.follow(next_page, callback=self.parse)

不像Scrapy.Request， response.follow 直接支持相对URL-无需调用URLJOIN。

也可以将选择器传递给 response.follow 而不是字符串；此选择器应提取必要的属性：

for href in response.css('ul.pager a::attr(href)'):
    yield response.follow(href, callback=self.parse)

为了 a标签元素有一个快捷方式： response.follow 自动使用其href属性。因此代码可以进一步缩短：

for a in response.css('ul.pager a'):
    yield response.follow(a, callback=self.parse)

要从iterable创建多个请求，可以使用 response.follow_all 取而代之的是：

anchors = response.css('ul.pager a')
yield from response.follow_all(anchors, callback=self.parse)

或者，进一步缩短：

yield from response.follow_all(css='ul.pager a', callback=self.parse)

使用response.follow_all

follow_allScrapy 2.0才有，之前的版本没有，所以我测试的时候是报错的

import scrapy


class AuthorSpider(scrapy.Spider):
    name = 'author'

    start_urls = ['http://quotes.toscrape.com/']

    def parse(self, response):
        author_page_links = response.css('.author + a')
        yield from response.follow_all(author_page_links, self.parse_author)

        pagination_links = response.css('li.next a')
        yield from response.follow_all(pagination_links, self.parse)

    def parse_author(self, response):
        def extract_with_css(query):
            return response.css(query).get(default='').strip()

        yield {
            'name': extract_with_css('h3.author-title::text'),
            'birthdate': extract_with_css('.author-born-date::text'),
            'bio': extract_with_css('.author-description::text'),
        }

这个爬虫展示的另一个有趣的事情是，即使同一作者引用了很多话，我们也不需要担心多次访问同一作者页面。默认情况下，scrapy过滤掉对已经访问过的URL的重复请求，避免了由于编程错误而太多地访问服务器的问题。这可以通过设置进行配置 DUPEFILTER_CLASS

使用爬虫参数构建url

运行时通过-a来添加参数

scrapy crawl quotes -O quotes-humor.json -a tag=humor

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"

    def start_requests(self):
        url = 'http://quotes.toscrape.com/'
        tag = getattr(self, 'tag', None)
        if tag is not None:
            url = url + 'tag/' + tag
        yield scrapy.Request(url, self.parse)

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('small.author::text').get(),
            }

        next_page = response.css('li.next a::attr(href)').get()
        if next_page is not None:
            yield response.follow(next_page, self.parse)

如果你通过 tag=humor 对于这个蜘蛛，您会注意到它只访问来自 humor 标记，如 http://quotes.toscrape.com/tag/humor

第二种

import scrapy

class MySpider(scrapy.Spider):
    name = 'myspider'

    def __init__(self, category=None, *args, **kwargs):
        super(MySpider, self).__init__(*args, **kwargs)
        self.start_urls = [f'http://www.example.com/categories/{category}']
        # ...

默认值 init 方法将获取任何spider参数，并将其作为属性复制到spider。上面的例子也可以写如下：

import scrapy

class MySpider(scrapy.Spider):
    name = 'myspider'

    def start_requests(self):
        yield scrapy.Request(f'http://www.example.com/categories/{self.category}')

程序的入口

from scrapy import cmdline
 

# run文件作为程序的入口
# split() 目的是把字符串转换为列表形式
# 第一个参数是 scrapy 第二个参数是 crawl 第三个参数是example.com
# -a 来添加参数 -o *.json 
# 对于json文件，在setting.js文件里添加，设置编码格式，否则会乱码：
# FEED_EXPORT_ENCODING='utf-8'

cmdline.execute('scrapy crawl example.com'.split())

爬虫

spider是定义一个特定站点（或一组站点）如何被抓取的类，包括如何执行抓取（即跟踪链接）以及如何从页面中提取结构化数据（即抓取项）。换言之，spider是为特定站点（或者在某些情况下，一组站点）定义爬行和解析页面的自定义行为的地方。

爬虫周期

首先生成对第一个URL进行爬网的初始请求，然后指定一个回调函数，该函数使用从这些请求下载的响应进行调用。

要执行的第一个请求是通过调用 start_requests() （默认）生成的方法 Request 对其中指定的URL start_urls 以及 parse 方法作为请求的回调函数。
在回调函数中，解析响应（网页）并返回 item objects ， Request 对象，或这些对象中的一个不可重复的对象。这些请求还将包含回调（可能相同），然后由scrappy下载，然后由指定的回调处理它们的响应。
在回调函数中，解析页面内容，通常使用 选择器 （但您也可以使用beautifulsoup、lxml或任何您喜欢的机制）并使用解析的数据生成项。
最后，从spider返回的项目通常被持久化到数据库（在某些 Item Pipeline ）或者使用 Feed（json、xml等） 导出 .

尽管这个循环（或多或少）适用于任何类型的蜘蛛，但是为了不同的目的，有不同类型的默认蜘蛛被捆绑成 Scrapy 。我们将在这里讨论这些类型。

爬虫的类型

scrapy.Spider

这是最简单的爬虫，也是每个爬虫都必须继承的。它不提供任何特殊功能。它只是提供了一个默认值start_requests()。从类型为列表的属性start_urls 提取url发送请求，并调用spider的方法 parse 对应每个结果响应。

name
爬虫的名字，唯一，必需。

爬虫的命名一般为要爬取的url，例如mywebsite.com 经常被称为 mywebsite
allowed_domains
允许爬虫爬取的url，假设目标url是https://www.example.com/1.html，然后添加’example.com’

如不添加属性，就是不限制爬取的范围
start_urls
存放需要爬取的url列表
custom_settings
在实例化爬虫之前，覆盖settings中的属性

name = 'myspider'

custom_settings = {
    'SOME_SETTING': 'some value',
}

crawler
此属性由 from_crawler() 初始化类后的类方法，并链接到 Crawler 此蜘蛛实例绑定到的对象。

Crawler封装了项目中的许多组件，用于它们的单入口访问（例如扩展、中间件、信号管理器等）。见爬虫API 了解更多。

todo 待了解
logger
用爬虫创建的python记录器 name . 可以使用它通过它发送日志消息
log(message)
应该是封装了logger，向后兼容
parse(response)
这是Scrapy在请求未指定回调时用来处理下载响应的默认回调。

这个 parse 方法负责处理响应，并返回爬取的数据和/或更多的URL。
from_crawler(crawler, *args, **kwargs)
您可能不需要直接重写它，因为默认实现充当 init() 方法，使用给定参数调用它 args 和命名参数 kwargs .

尽管如此，此方法设置了 crawler 和 settings 新实例中的属性，以便稍后在蜘蛛代码中访问它们。

参数
crawler (Crawler instance) – 蜘蛛将被绑到的爬行器

args (list) – 传递给的参数 init() 方法

kwargs (dict) – 传递给的关键字参数 init() 方法
start_requests()
如果要更改用于开始抓取域的请求，这是要重写的方法。例如，如果您需要从使用POST请求登录开始，可以执行以下操作：

class MySpider(scrapy.Spider):
    name = 'myspider'

    def start_requests(self):
        return [scrapy.FormRequest("http://www.example.com/login",
                                   formdata={'user': 'john', 'pass': 'secret'},
                                   callback=self.logged_in)]

    def logged_in(self, response):
        pass

closed(reason)
蜘蛛关闭时调用。此方法为 spider_closed 信号。

还有好多属性，用到的时候再记录吧

例子之打印日志

需要在setting文件中设置ROBOTSTXT_OBEY=False

在向我们设定的url发起请求前，scrapy会对目标服务器根目录请求一个txt文件，这个文件中规定了该站点允许的爬虫机器爬取的范围，scrapy默认遵守robot协议。

这个logger()和print()感觉区别不大呀。

import scrapy


class MySpider(scrapy.Spider):
    name = 'example.com'
    allowed_domains = ['']
    start_urls = [
        'http://www.itheima.com/special/brandzly1/index.html?jingjiapphm2-heima-pinpaici-pc-heimachengxuyuanpeixun&bd_vid=10886117877466962985',
    ]

    def parse(self, response):
        print("========================")
        self.logger.info('A response from %s just arrived!', response.url)

例子之多个请求

import scrapy

class MySpider(scrapy.Spider):
    name = 'example.com'
    allowed_domains = ['example.com']
    start_urls = [
        'http://www.example.com/1.html',
        'http://www.example.com/2.html',
        'http://www.example.com/3.html',
    ]

    def parse(self, response):
        for h3 in response.xpath('//h3').getall():
            yield {"title": h3}

        for href in response.xpath('//a/@href').getall():
            yield scrapy.Request(response.urljoin(href), self.parse)


# 或者
import scrapy
from myproject.items import MyItem

class MySpider(scrapy.Spider):
    name = 'example.com'
    allowed_domains = ['example.com']

    def start_requests(self):
        yield scrapy.Request('http://www.example.com/1.html', self.parse)
        yield scrapy.Request('http://www.example.com/2.html', self.parse)
        yield scrapy.Request('http://www.example.com/3.html', self.parse)

    def parse(self, response):
        for h3 in response.xpath('//h3').getall():
            yield MyItem(title=h3)

        for href in response.xpath('//a/@href').getall():
            yield scrapy.Request(response.urljoin(href), self.parse)

实体类

在items.py文件

import scrapy

class TestItem(scrapy.Item):
    id = scrapy.Field()
    name = scrapy.Field()
    description = scrapy.Field()

class scrapy.spiders.CrawlSpider

这是最常用的爬行常规网站的蜘蛛，因为它通过定义一组规则为跟踪链接提供了一种方便的机制。它可能不是最适合您的特定网站或项目的，但它对于某些情况来说已经足够通用了，因此您可以从它开始，并根据需要覆盖它以获得更多的自定义功能，或者只实现您自己的蜘蛛。

除了从spider继承的属性（必须指定），这个类还支持一个新的属性：

rules
这是一个（或多个）列表 Rule 物体。各 Rule 定义对网站进行爬行的特定行为。规则对象如下所述。如果多个规则与同一链接匹配，则将根据在该属性中定义的顺序使用第一个规则。

这个蜘蛛还公开了一个可重写的方法：

parse_start_url(response,**kwargs)
为spider中的url生成的每个响应调用此方法 start_urls 属性。它允许解析初始响应，并且必须返回 item object ，A Request 对象，或包含任何对象的iterable。

Rule参数说明

classscrapy.spiders.Rule(link_extractor=None, callback=None, 
cb_kwargs=None, follow=None, 
process_links=None, process_request=None, errback=None)

link_extractor：定义提取url的规则
是一个 Link Extractor 对象，该对象定义如何从每个已爬网页提取链接。每个生成的链接将用于生成 Request 对象，其中包含链接的文本 meta 字典（在 link_text 键）。如果省略，将使用没有参数创建的默认链接提取器，从而导致提取所有链接。
callback：指定发起请求的回调方法
对用指定的链接提取程序提取的每个链接调用的可调用或字符串（在这种情况下，将使用具有该名称的spider对象中的方法）。此回调接收 Response 作为第一个参数，必须返回 item objects 和/或 Request 对象（或其任何子类）。如上所述，收到 Response 对象将包含生成 Request 在其 meta 字典（在 link_text 关键）
cb_kwargs：向回调函数传递参数
是包含要传递给回调函数的关键字参数的dict。
errback：指定请求过程中发生错误的回调函数
在处理规则生成的请求时引发任何异常时要调用的可调用或字符串（在这种情况下，将使用来自spider对象的具有该名称的方法）。它收到一个 Twisted Failure 实例作为第一个参数。

from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class MySpider(CrawlSpider):
    name = 'example.com'
    allowed_domains = ['example.com']
    start_urls = ['http://www.example.com']

    rules = (
        # Extract links matching 'category.php' (but not matching 'subsection.php')
        # and follow links from them (since no callback means follow=True by default).
        Rule(LinkExtractor(allow=('category\.php', ), deny=('subsection\.php', ))),

        # Extract links matching 'item.php' and parse them with the spider's method parse_item
        Rule(LinkExtractor(allow=('item\.php', )), callback='parse_item'),
    )

    def parse_item(self, response):
        self.logger.info('Hi, this is an item page! %s', response.url)
        item = scrapy.Item()
        item['id'] = response.xpath('//td[@id="item_id"]/text()').re(r'ID: (\d+)')
        item['name'] = response.xpath('//td[@id="item_name"]/text()').get()
        item['description'] = response.xpath('//td[@id="item_description"]/text()').get()
        item['link_text'] = response.meta['link_text']
        url = response.xpath('//td[@id="additional_data"]/@href').get()
        return response.follow(url, self.parse_additional_page, cb_kwargs=dict(item=item))

    def parse_additional_page(self, response, item):
        item['additional_data'] = response.xpath('//p[@id="additional_data"]/text()').get()
        return item

这个爬虫会开始对example.com的主页进行爬行，收集类别链接和项目链接，并用 parse_item 方法。对于每个项目响应，将使用xpath从HTML中提取一些数据，并且 Item 会接收它的。

CrawlSpider实际上是没用过的，还有一些没用过的，例如：

为解析XML而设计的XMLFeedSpider
CSVFeedSpider：处理.csv文件
SitemapSpider：不同的url有不同的回调函数，
sitemap_urls = [‘http://www.example.com/sitemap.xml’]
sitemap_rules = [
(’/product/’, ‘parse_product’),
(’/category/’, ‘parse_category’),
]

todo 选择器的部分有时间回过头来再看吧，其他人应该大部分工作都是在解析html，但我爬取电视盒子根本用不着

Item Pipeline管道

定义实体类

from scrapy.item import Item, Field

class CustomItem(Item):
    one_field = Field()
    another_field = Field()

# 或者
import scrapy

class Product(scrapy.Item):
    name = scrapy.Field()
    price = scrapy.Field()
    stock = scrapy.Field()
    tags = scrapy.Field()
    last_updated = scrapy.Field(serializer=str)

在一个项目被爬取之后，它被发送到项目管道，该管道通过几个按顺序执行的组件来处理它。

每个管道组件都是一个实现简单方法的python类.它接收一个项目并对其执行操作，还决定该项目是否应继续通过管道，或者是否应删除并不再处理

项目管道的典型用途有

清理HTML数据
验证抓取的数据（检查项目是否包含某些字段）
检查重复项（并删除它们）
将爬取的项目存储在数据库中

编写自己的管道

每个item pipeline组件都是一个python类，必须实现以下方法：

process_item(self, item, spider)
item：爬取的数据
spider：哪一个spider
open_spider(self, spider)
当spider打开时调用此方法。
close_spider(self, spider)
当spider关闭时调用此方法。
from_crawler(cls, crawler)
应该是从设置中提取属性的时候会用到，下面有例子，但感觉不只这样，待探索 todo

管道事例之过滤无用字段

from scrapy.exceptions import DropItem
class PricePipeline:

    vat_factor = 1.15	

    def process_item(self, item, spider):
        adapter = ItemAdapter(item)
        if adapter.get('price'):
            if adapter.get('price_excludes_vat'):
                adapter['price'] = adapter['price'] * self.vat_factor
            return item
        else:
            raise DropItem(f"Missing price in {item}")

将项目写入json文件

下面的管道将所有爬取的项目（从所有蜘蛛）存储到一个单独的管道中 items.jl 文件，每行包含一个以JSON格式序列化的项：

import json

from itemadapter import ItemAdapter

class JsonWriterPipeline:

    def open_spider(self, spider):
        self.file = open('items.jl', 'w')

    def close_spider(self, spider):
        self.file.close()

	# item是定义的实体类的item，spider也是定义的spider
	# spider 中的 item 是需要使用 return 或者 yield返回的，python貌似用yield更多一些
    def process_item(self, item, spider):
        line = json.dumps(ItemAdapter(item).asdict()) + "\n"
        self.file.write(line)
        return item

jsonWriterPipeline的目的只是介绍如何编写项管道。如果您真的想将所有的爬取项存储到JSON文件中，那么应该使用 Feed exports 。在下面会说到

写入MongoDB

在这个示例中，我们将向 MongoDB 使用 pymongo. 在Scrapy设置中指定MongoDB地址和数据库名称；MongoDB集合以item类命名。

import pymongo
from itemadapter import ItemAdapter

class MongoPipeline:

    collection_name = 'scrapy_items'

    def __init__(self, mongo_uri, mongo_db):
        self.mongo_uri = mongo_uri
        self.mongo_db = mongo_db

    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            mongo_uri=crawler.settings.get('MONGO_URI'),
            # setting中没有获取到MONGO_DATABASE的话就使用items
            mongo_db=crawler.settings.get('MONGO_DATABASE', 'items')
        )

    def open_spider(self, spider):
        self.client = pymongo.MongoClient(self.mongo_uri)
        self.db = self.client[self.mongo_db]

    def close_spider(self, spider):
        self.client.close()

    def process_item(self, item, spider):
        self.db[self.collection_name].insert_one(ItemAdapter(item).asdict())
        return item

todo抓取截图

这个暂时用不到，有时间再说

激活项目管道

settings文件中

# 在此设置中分配给类的整数值决定了它们的运行顺序：
# 项从低值类传递到高值类。习惯上把这些数字定义在0-1000范围内。

ITEM_PIPELINES = {
    'myproject.pipelines.PricePipeline': 300,
    'myproject.pipelines.JsonWriterPipeline': 800,
}

请求和响应

请求对象

class scrapy.http.Request(*args, **kwargs)

Request对象表示一个HTTP请求，通常由spider生成并由下载程序执行，从而生成一个Response

参数

url (str) – 此请求的URL，如果该URL无效，则 ValueError 引发异常。
callback (collections.abc.Callable) – 指定请求成功后的回调函数，如果请求未指定回调，则将使用默认回调方法parse()。如果在处理过程中引发异常，则改为调用errback。
method (str) – 该请求的HTTP方法。默认为 ‘GET’ ，大写.
meta (dict) – 初始值 Request.meta 属性。如果给定，则将浅复制传入此参数的dict，可以给回调函数传参。回调函数通过 response.meta 属性来获取
body (bytes or str) – 请求主体。如果传递了一个字符串，则使用 encoding 通过（默认为 utf-8 ）如果 body 未给定，则存储空字节对象。不管这个参数的最后一个值是什么，都会被存储 None ）
headers (dict) – 此请求的头。dict值可以是字符串（对于单值头）或列表（对于多值头）。如果 None 作为值传递，HTTP头将不会被发送。
cookies (dict or list) –

请求cookies。这些可以用两种形式发送。

使用DICT：：

request_with_cookies = 
Request(url="http://www.example.com",
	cookies={'currency': 'USD', 'country': 'UY'})

使用字典列表：

request_with_cookies = Request(url="http://www.example.com",
cookies=[{'name': 'currency','value': 'USD',
'domain': 'example.com','path': '/currency'}])

后一个表单允许自定义 domain 和 path cookie的属性。只有在为以后的请求保存cookie时，这才有用。

当某些站点返回cookies（在响应中）时，这些cookies存储在该域的cookies中，并将在以后的请求中再次发送。这是任何普通网络浏览器的典型行为。

要创建不发送存储的cookie且不存储接收到的cookie的请求，请设置 dont_merge_cookies 关键 True 在里面 request.meta .

发送手动定义的cookie并忽略cookie存储的请求示例：

Request(
    url="http://www.example.com",
    cookies={'currency': 'USD', 'country': 'UY'},
    meta={'dont_merge_cookies': True},
)

encoding (str) – 此请求的编码（默认为 ‘utf-8’ ). 此编码将用于对URL进行百分比编码，并将正文转换为字节（如果以字符串形式给出）。
priority (int) – 此请求的优先级（默认为 0 ）调度程序使用优先级定义用于处理请求的顺序。优先级值较高的请求将更早执行。允许负值以表示相对较低的优先级。
dont_filter (bool) – 指示调度程序不应筛选此请求。当您希望多次执行相同的请求时，可以使用此选项忽略重复的筛选器。小心使用，否则会进入爬行循环。默认为 False 。
errback (collections.abc.Callable) – 如果在处理请求时引发任何异常，则将调用的函数。这包括404 HTTP错误等失败的页面。它收到一个 Failure 作为第一个参数。
flags (list) – 发送到请求的标志可用于日志记录或类似用途。
cb_kwargs (dict) – 具有任意数据的dict，将作为关键字参数传递到请求的回调。回调函数可以通过response.cb_kwargs来获取。在请求失败的情况下可以通过failure.request.cb_kwargs来获取。

函数

copy()
返回一个新请求，它是此请求的副本。
replace([url, method, headers, body, cookies, meta, flags, encoding, priority, dont_filter, callback, errback, cb_kwargs])
返回具有相同成员的请求对象，除了那些通过指定的关键字参数赋予新值的成员。这个 Request.cb_kwargs 和 Request.meta 默认情况下，属性被浅复制（除非新值作为参）
classmethod from_curl(curl_command, ignore_unknown_options=True, **kwargs)
从包含 cURL 命令。它填充HTTP方法、URL、头、cookies和主体。它接受与 Request 类，获取首选项并重写cURL命令中包含的相同参数的值。

默认情况下，将忽略无法识别的选项。若要在查找未知选项时引发错误，请通过传递调用此方法 ignore_unknown_options=False .
警告

使用 from_curl() 从 Request 子类，例如 JSONRequest 或 XmlRpcRequest ，以及 downloader middlewares 和 spider middlewares 启用，例如 DefaultHeadersMiddleware ， UserAgentMiddleware 或 HttpCompressionMiddleware ，可以修改 Request 对象。

要将cURL命令转换为Scrapy请求，可以使用 curl2scrapy .

向回调函数中传递参数

响应成功的回调函数

request = scrapy.Request('http://www.example.com/index.html',
 	callback=self.parse_page2,
 	cb_kwargs=dict(main_url=response.url))
 	# add more arguments for the callback
	request.cb_kwargs['foo'] = 'bar'  
	yield request

def parse_page2(self, response, main_url, foo):
	yield dict(
		main_url=main_url,
		other_url=response.url,
		foo=foo,
)

发生异常的回调函数

def parse(self, response):
    request = scrapy.Request('http://www.example.com/index.html',
                             callback=self.parse_page2,
                             errback=self.errback_page2,
                             cb_kwargs=dict(main_url=response.url))
    yield request

def parse_page2(self, response, main_url):
    pass

def errback_page2(self, failure):
    yield dict(
        main_url=failure.request.cb_kwargs['main_url'],
    )

注意
Request.cb_kwargs 在版本中引入 1.7 . 在此之前，使用 Request.meta 建议在回调时传递信息。1.7之后， Request.cb_kwargs 成为处理用户信息的首选方式。

使用errback捕获请求中的异常

请求的errback是一个函数，在处理异常时将调用该函数。

它收到一个 Failure 作为第一个参数，可用于跟踪连接建立超时、DNS错误等。

import scrapy

from scrapy.spidermiddlewares.httperror import HttpError
from twisted.internet.error import DNSLookupError
from twisted.internet.error import TimeoutError, TCPTimedOutError

class ErrbackSpider(scrapy.Spider):
    name = "errback_example"
    start_urls = [
    	# HTTP 200 expected
        "http://www.httpbin.org/",  
         # Not found error            
        "http://www.httpbin.org/status/404",  
         # server issue 
        "http://www.httpbin.org/status/500", 
         # non-responding host, timeout expected
         # 无响应主机，预期超时  
        "http://www.httpbin.org:12345/",   
        # DNS error expected    
        # DNS错误预期
        "http://www.httphttpbinbin.org/",       
    ]

    def start_requests(self):
        for u in self.start_urls:
            yield scrapy.Request(u, callback=self.parse_httpbin,
                                    errback=self.errback_httpbin,
                                    dont_filter=True)

    def parse_httpbin(self, response):
        self.logger.info('Got successful response from {}'.format(response.url))
        # do something useful here...

    def errback_httpbin(self, failure):
        # log all failures
        self.logger.error(repr(failure))

        # in case you want to do something special for some errors,
        # you may need the failure's type:

        if failure.check(HttpError):
            # these exceptions come from HttpError spider middleware
            # you can get the non-200 response
            response = failure.value.response
            self.logger.error('HttpError on %s', response.url)

        elif failure.check(DNSLookupError):
            # this is the original request
            request = failure.request
            self.logger.error('DNSLookupError on %s', request.url)

        elif failure.check(TimeoutError, TCPTimedOutError):
            request = failure.request
            self.logger.error('TimeoutError on %s', request.url)

请求子类FormRequest对象

class scrapy.http.FormRequest(url[, formdata, ...])[源代码]

class methodfrom_response(response[, formname=None, formid=None, formnumber=0, formdata=None, 
formxpath=None, formcss=None, clickdata=None, dont_click=False, ...])

返回新的 FormRequest 对象，其表单字段值预填充在HTML中包含在给定响应中的元素。有关示例，请参见使用formRequest.from_response（）模拟用户登录 .

默认情况下，策略是在任何看起来可单击的窗体控件上自动模拟单击，如 . 尽管这非常方便，而且常常是所需的行为，但有时它可能会导致难以调试的问题。例如，当处理使用javascript填充和/或提交的表单时，默认 from_response() 行为可能不是最合适的。要禁用此行为，可以设置 dont_click 参数 True . 此外，如果要更改单击的控件（而不是禁用它），还可以使用 clickdata 参数。

使用formRequest.from_response（）模拟用户登录

网站通常通过元素，例如与会话相关的数据或身份验证令牌（用于登录页）。当进行抓取时，您将希望这些字段自动预填充，并且只覆盖其中的几个字段，例如用户名和密码。你可以使用 FormRequest.from_response() 此作业的方法。下面是一个蜘蛛的例子，它使用它：


def authentication_failed(response):
    # TODO: Check the contents of the response and return True if it failed
    # or False if it succeeded.
    pass

class LoginSpider(scrapy.Spider):
    name = 'example.com'
    start_urls = ['http://www.example.com/users/login.php']

    def parse(self, response):
        return scrapy.FormRequest.from_response(
            response,
            formdata={'username': 'john', 'password': 'secret'},
            callback=self.after_login
        )

    def after_login(self, response):
        if authentication_failed(response):
            self.logger.error("Login failed")
            return

        # continue scraping with authenticated session...

响应对象

class scrapy.http.Response(*args, **kwargs)

Response 对象表示一个HTTP响应，它通常被下载（由下载程序）并送入spider进行处理。

参数

url (str) – 此响应的URL
status (int) – 响应的HTTP状态。默认为 200 .
headers (dict) – 此响应的头。dict值可以是字符串（对于单值头）或列表（对于多值头）。
可以使用访问值 get() 返回具有指定名称的第一个头值，或 getlist() 返回具有指定名称的所有头值。例如，此调用将为您提供标题中的所有cookie

response.headers.getlist('Set-Cookie')

body (bytes) – 响应体。要以字符串形式访问解码文本，请使用 response.text 从编码感知 Response subclass ，如 TextResponse .
flags (list) – 是一个列表，其中包含 Response.flags 属性。如果给定，则将浅复制列表。
request (scrapy.http.Request) – 的初始值 Response.request 属性。这代表 Request 产生了这个响应。

HTTP重定向将导致将原始请求（重定向前的URL）分配给重定向响应（重定向后的最终URL）。

response.request.url并不总是等于response.url
certificate (twisted.internet.ssl.Certificate) – 表示服务器的SSL证书的对象。
ip_address (ipaddress.IPv4Address or ipaddress.IPv6Address) – 从哪个服务器发出响应的IP地址。

2.1.0版本才有ip_address

todo响应子类

class scrapy.http.TextResponse(
	url[, encoding[, ...]])

TextResponse 对象将编码功能添加到基 Response 类，它只用于二进制数据，如图像、声音或任何媒体文件。

有时间再看这个

settings 设置文件

可以使用不同的机制填充设置，每个机制具有不同的优先级。以下是按优先级降序排列的列表：

命令行选项（最优先）

scrapy crawl myspider -s LOG_FILE=scrapy.log

每个蜘蛛的设置

这些设置将优先并覆盖项目设置。他们可以通过设置 custom_settings 属性：

class MySpider(scrapy.Spider):
    name = 'myspider'

    custom_settings = {
        'SOME_SETTING': 'some value',
    }

项目设置模块

项目设置模块是碎屑项目的标准配置文件，它将填充大部分自定义设置。对于标准的Scrapy项目，这意味着您将在 settings.py 为项目创建的文件。

每个命令的默认设置
默认全局设置（优先级较低）

这些设置源的填充是在内部处理的，但是可以使用API调用进行手动处理。

如何访问设置

在Spider中，可以通过 self.settings ：

class MySpider(scrapy.Spider):
    name = 'myspider'
    start_urls = ['http://example.com']

   def parse(self, response):
        print(f"Existing settings: {self.settings.attributes.keys()}")

注意
这个 settings 属性在Spider初始化后在基本Spider类中设置。如果要在初始化之前使用这些设置（例如，在Spider的 init() 方法），您需要重写 from_crawler() 方法。

class MyExtension:
    def __init__(self, log_is_enabled=False):
        if log_is_enabled:
            print("log is enabled!")

    @classmethod
    def from_crawler(cls, crawler):
        settings = crawler.settings
        return cls(settings.getbool('LOG_ENABLED'))

内置的设置

BOT_NAME
默认 ‘scrapybot’

这个项目的名字叫Scrapy。此名称也将用于日志记录。

CONCURRENT_ITEMS
默认100
最大并发项数(每个响应)
CONCURRENT_REQUESTS
默认16
Scrapy下载程序将执行的最大并发（即同时）请求数。
CONCURRENT_REQUESTS_PER_DOMAIN
默认 8
将对任何单个域执行的最大并发（即同时）请求数。
CONCURRENT_REQUESTS_PER_IP
默认 0

将对任何单个IP执行的最大并发（即同时）请求数。如果非零，则 CONCURRENT_REQUESTS_PER_DOMAIN 设置被忽略，而是使用此设置。换句话说，并发限制将应用于每个IP，而不是每个域。

此设置还影响 DOWNLOAD_DELAY 和 AutoThrottle 扩展如果 CONCURRENT_REQUESTS_PER_IP 是非零的，下载延迟是每个IP强制执行的，而不是每个域。

DEFAULT_ITEM_CLASS
默认 ‘scrapy.item.Item’
将用于实例化中的项的默认类
DEFAULT_REQUEST_HEADERS
默认

{
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Language': 'en',
}

Http默认的请求头

DEPTH_LIMIT
默认 0
范围 scrapy.spidermiddlewares.depth.DepthMiddleware
允许对任何网站进行爬网的最大深度。如果为零，则不施加限制。
DOWNLOAD_DELAY
默认 0

下载者从同一网站下载连续页面之前应等待的时间（以秒计）。这可以用来限制爬行速度，以避免对服务器造成太大的冲击。支持十进制数。例子：

DOWNLOAD_DELAY = 0.25    # 250 ms of delay

DOWNLOAD_TIMEOUT
默认 180
下载程序在超时前等待的时间（以秒计）。
LOG_ENABLED
违约： True

是否启用日志记录。

LOG_ENCODING
默认 ‘utf-8’

用于日志记录的编码。

LOG_FILE
默认 None

用于日志记录输出的文件名。如果 None ，将使用标准错误。

LOG_LEVEL
默认 ‘DEBUG’

要记录的最低级别。可用级别包括：严重、错误、警告、信息、调试。

ROBOTSTXT_OBEY
默认 False

范围： scrapy.downloadermiddlewares.robotstxt

如果启用，scrapy将遵守robots.txt策略。

RETRY_ENABLED
重试失败的HTTP请求会大大降低爬行速度，特别是当站点原因响应速度非常慢（或失败）时，会导致超时错误，该错误会被多次不必要地重试，从而阻止爬行器容量被重新用于其他域。
RETRY_ENABLED = False
RETRY_TIMES

# 重试次数	默认2
RETRY_TIMES = 5

异常

关闭爬虫

CloseSpider
exceptionscrapy.exceptions.CloseSpider(reason=‘cancelled’)
可以从蜘蛛回调中引发此异常以请求关闭/停止蜘蛛。支持的参数：

参数
reason (str) – 关闭的原因

def parse_page(self, response):
    if 'Bandwidth exceeded' in response.body:
        raise CloseSpider('bandwidth_exceeded')

日志

python的内置日志记录定义了5个不同的级别，以指示给定日志消息的严重性。以下是标准的，按降序排列：

logging.CRITICAL -对于严重错误（严重性最高）
logging.ERROR -对于常规错误
logging.WARNING -用于警告消息
logging.INFO -以获取信息性消息
logging.DEBUG -用于调试消息（最低严重性）

如何使用logging.WARNING

import logging
logging.warning("This is a warning")

在标准的5个级别中，有一个用于发布日志消息的快捷方式，还有一个常规的 logging.log 方法，该方法将给定的级别作为参数。

import logging
logging.log(logging.WARNING, "This is a warning")

除此之外，您还可以创建不同的“记录器”来封装消息。

import logging
logger = logging.getLogger()
logger.warning("This is a warning")


import logging
logger = logging.getLogger('mycustomlogger')
logger.warning("This is a warning")

最后，通过使用 name 变量，用当前模块的路径填充：

import logging
logger = logging.getLogger(__name__)
logger.warning("This is a warning")

从爬虫记录

Scrapy提供了 logger 在每个蜘蛛实例中，可以这样访问和使用：

import scrapy

class MySpider(scrapy.Spider):

    name = 'myspider'
    start_urls = ['https://scrapinghub.com']

    def parse(self, response):
        self.logger.info('Parse function called on %s', response.url)

这个记录器是使用蜘蛛的名称创建的，但是您可以使用任何您想要的自定义Python记录器。例如：

import logging
import scrapy

logger = logging.getLogger('mycustomlogger')

class MySpider(scrapy.Spider):

    name = 'myspider'
    start_urls = ['https://scrapinghub.com']

    def parse(self, response):
        logger.info('Parse function called on %s', response.url)

todo 高级自定义日志，有时间再看吧，

调试spiders

import scrapy
from myproject.items import MyItem

class MySpider(scrapy.Spider):
    name = 'myspider'
    start_urls = (
        'http://example.com/page1',
        'http://example.com/page2',
        )

    def parse(self, response):
        # <processing code not shown>
        # collect `item_urls`
        for item_url in item_urls:
            yield scrapy.Request(item_url, self.parse_item)

    def parse_item(self, response):
        # <processing code not shown>
        item = MyItem()
        # populate `item` fields
        # and extract item_details_url
        yield scrapy.Request(item_details_url, self.parse_details, cb_kwargs={'item': item})

    def parse_details(self, response, item):
        # populate more `item` fields
        return item

看了半天文档，看的迷迷糊糊的。。。。。。。。。。。。。。。。。。。。。。。。下面还有东西，不过看不懂了

内容全部转载自：
https://www.osgeo.cn/scrapy/intro/overview.html#walk-through-of-an-example-spider
https://www.jianshu.com/p/8e78dfa7c368
https://www.cnblogs.com/xiaojwang/p/11331202.html
https://blog.csdn.net/qq_42543250/article/details/81347368
https://blog.csdn.net/u012106306/article/details/100040680
https://blog.csdn.net/peiwang245/article/details/102579644?ops_request_misc=%25257B%252522request%25255Fid%252522%25253A%252522160879785716780266221921%252522%25252C%252522scm%252522%25253A%25252220140713.130102334…%252522%25257D&request_id=160879785716780266221921&biz_id=0&utm_medium=distribute.pc_search_result.none-task-blog-2_allsobaiduend~default-1-102579644.nonecase&utm_term=scrapy%E5%A4%9A%E7%BA%BF%E7%A8%8B%E7%88%AC%E5%8F%96
https://blog.csdn.net/ck784101777/article/details/104468780/