Scarpy源码分析 15 Requests and Responses Ⅳ

2021SC@SDUSC

PART5:Accessing additional data in errback functions 在 errback 函数中访问附加数据

In case of a failure to process the request, you may be interested in accessing arguments to the callback functions so you can process further based on the arguments in the errback. The following example shows how to achieve this by using Failure.request.cb_kwargs:

如果处理请求失败,我们可能对访问回调函数的参数感兴趣,以便根据 errback 中的参数进一步处理。 这个官方文档示例展示了如何通过使用 Failure.request.cb_kwargs 来实现此目的:

def parse(self, response):
    request = scrapy.Request('http://www.example.com/index.html',
                             callback=self.parse_page2,
                             errback=self.errback_page2,
                             cb_kwargs=dict(main_url=response.url))
    yield request

def parse_page2(self, response, main_url):
    pass

def errback_page2(self, failure):
    yield dict(
        main_url=failure.request.cb_kwargs['main_url'],
    )

PART5:Request.meta special keys 特殊键

The Request.meta attribute can contain any arbitrary data, but there are some special keys recognized by Scrapy and its built-in extensions.

Request.meta 属性可以包含任意数据,但是 Scrapy 及其内置扩展可以识别一些特殊的键。

包括:
bindaddress

cookiejar

dont_cache

dont_merge_cookies

dont_obey_robotstxt

dont_redirect

dont_retry

download_fail_on_dataloss

download_latency

download_maxsize

download_timeout

ftp_password (See FTP_PASSWORD for more info)

ftp_user (See FTP_USER for more info)

handle_httpstatus_all

handle_httpstatus_list

max_retry_times

proxy

redirect_reasons

redirect_urls

referrer_policy

bindaddress:用于执行请求的传出 IP 地址的 IP。

download_timeout:下载器在超时前等待的时间(以秒为单位)。

download_latency:

The amount of time spent to fetch the response, since the request has been started, i.e. HTTP message sent over the network. This meta key only becomes available when the response has been downloaded. While most other meta keys are used to control Scrapy behavior, this one is supposed to be read-only.

获取响应所花费的时间,因为请求已经开始,即通过网络发送的 HTTP 消息。 此元密钥仅在响应下载后才可用。 虽然大多数其他元键用于控制 Scrapy 行为,但这个应该是只读的。

download_fail_on_dataloss:是否因损坏的响应而失败。

max_retry_times:元键用于设置每个请求的重试次数。 初始化时,max_retry_times 元键的优先级高于 RETRY_TIMES 设置。


The meta key is used set retry times per request. When initialized, the max_retry_times meta key takes higher precedence over the RETRY_TIMES setting.

Stopping the download of a Response

Raising a StopDownload exception from a handler for the bytes_received or headers_received signals will stop the download of a given response.

从 bytes_received 或 headers_received 信号的处理程序引发 StopDownload 异常将停止给定响应的下载。来个例子:

import scrapy


class StopSpider(scrapy.Spider):
    name = "stop"
    start_urls = ["https://docs.scrapy.org/en/latest/"]

    @classmethod
    def from_crawler(cls, crawler):
        spider = super().from_crawler(crawler)
        crawler.signals.connect(spider.on_bytes_received, signal=scrapy.signals.bytes_received)
        return spider

    def parse(self, response):
        # 'last_chars' show that the full response was not downloaded
        yield {"len": len(response.text), "last_chars": response.text[-40:]}

    def on_bytes_received(self, data, request, spider):
        raise scrapy.exceptions.StopDownload(fail=False)

By default, resulting responses are handled by their corresponding errbacks. To call their callback instead, like in this example, pass fail=False to the StopDownload exception.

默认情况下,结果响应由其相应的 errback 处理。 要改为调用它们的回调,就像在本例中一样,将 fail=False 传递给 StopDownload 异常。

Request subclasses:内置请求子类的列表。 还可以将其子类化以实现自定义功能。

FormRequest objects:FormRequest 类使用处理 HTML 表单的功能扩展了基本请求。 使用 lxml.html 表单用来自 Response 对象的表单数据预填充表单字段。

比较重要:类方法classmethod  from_response(response[, formname=Noneformid=Noneformnumber=0formdata=Noneformxpath=Noneformcss=Noneclickdata=Nonedont_click=False...])

Returns a new FormRequest object with its form field values pre-populated with those found in the HTML <form> element contained in the given response. 

返回一个新的 FormRequest 对象,其表单字段值预先填充了包含在给定响应中的 HTML <form> 元素中的值。

The policy is to automatically simulate a click, by default, on any form control that looks clickable, like a <input type="submit">. Even though this is quite convenient, and often the desired behaviour, sometimes it can cause problems which could be hard to debug. For example, when working with forms that are filled and/or submitted using javascript, the default from_response() behaviour may not be the most appropriate. To disable this behaviour you can set the dont_click argument to True. Also, if you want to change the control clicked (instead of disabling it) you can also use the clickdata argument.

默认情况下,这个策略是在任何看起来可点击的表单控件上自动模拟点击,例如 <input type="submit">。 尽管这非常方便,并且通常是期望的行为,但有时它会导致难以调试的问题。 例如,在处理使用 javascript 填写和/或提交的表单时,默认的 from_response() 行为可能不是最合适的。 要禁用此行为,我们可以将 dont_click 参数设置为 True。 此外,如果我们想更改单击的控件(而不是禁用它),还可以使用 clickdata 参数。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值