2021SC@SDUSC
PART5:Accessing additional data in errback functions 在 errback 函数中访问附加数据
In case of a failure to process the request, you may be interested in accessing arguments to the callback functions so you can process further based on the arguments in the errback. The following example shows how to achieve this by using Failure.request.cb_kwargs:
如果处理请求失败,我们可能对访问回调函数的参数感兴趣,以便根据 errback 中的参数进一步处理。 这个官方文档示例展示了如何通过使用 Failure.request.cb_kwargs 来实现此目的:
def parse(self, response):
request = scrapy.Request('http://www.example.com/index.html',
callback=self.parse_page2,
errback=self.errback_page2,
cb_kwargs=dict(main_url=response.url))
yield request
def parse_page2(self, response, main_url):
pass
def errback_page2(self, failure):
yield dict(
main_url=failure.request.cb_kwargs['main_url'],
)
PART5:Request.meta special keys 特殊键
The Request.meta attribute can contain any arbitrary data, but there are some special keys recognized by Scrapy and its built-in extensions.
Request.meta 属性可以包含任意数据,但是 Scrapy 及其内置扩展可以识别一些特殊的键。
包括:
bindaddress
cookiejar
dont_cache
dont_merge_cookies
dont_obey_robotstxt
dont_redirect
dont_retry
download_fail_on_dataloss
download_latency
download_maxsize
download_timeout
ftp_password (See FTP_PASSWORD for more info)
ftp_user (See FTP_USER for more info)
handle_httpstatus_all
handle_httpstatus_list
max_retry_times
proxy
redirect_reasons
redirect_urls
referrer_policy
bindaddress:用于执行请求的传出 IP 地址的 IP。
download_timeout:下载器在超时前等待的时间(以秒为单位)。
download_latency:
The amount of time spent to fetch the response, since the request has been started, i.e. HTTP message sent over the network. This meta key only becomes available when the response has been downloaded. While most other meta keys are used to control Scrapy behavior, this one is supposed to be read-only.
获取响应所花费的时间,因为请求已经开始,即通过网络发送的 HTTP 消息。 此元密钥仅在响应下载后才可用。 虽然大多数其他元键用于控制 Scrapy 行为,但这个应该是只读的。
download_fail_on_dataloss:是否因损坏的响应而失败。
max_retry_times:元键用于设置每个请求的重试次数。 初始化时,max_retry_times 元键的优先级高于 RETRY_TIMES 设置。
The meta key is used set retry times per request. When initialized, the max_retry_times meta key takes higher precedence over the RETRY_TIMES setting.
Stopping the download of a Response
Raising a StopDownload exception from a handler for the bytes_received or headers_received signals will stop the download of a given response.
从 bytes_received 或 headers_received 信号的处理程序引发 StopDownload 异常将停止给定响应的下载。来个例子:
import scrapy
class StopSpider(scrapy.Spider):
name = "stop"
start_urls = ["https://docs.scrapy.org/en/latest/"]
@classmethod
def from_crawler(cls, crawler):
spider = super().from_crawler(crawler)
crawler.signals.connect(spider.on_bytes_received, signal=scrapy.signals.bytes_received)
return spider
def parse(self, response):
# 'last_chars' show that the full response was not downloaded
yield {"len": len(response.text), "last_chars": response.text[-40:]}
def on_bytes_received(self, data, request, spider):
raise scrapy.exceptions.StopDownload(fail=False)
By default, resulting responses are handled by their corresponding errbacks. To call their callback instead, like in this example, pass fail=False
to the StopDownload exception.
默认情况下,结果响应由其相应的 errback 处理。 要改为调用它们的回调,就像在本例中一样,将 fail=False 传递给 StopDownload 异常。
Request subclasses:内置请求子类的列表。 还可以将其子类化以实现自定义功能。
FormRequest objects:FormRequest 类使用处理 HTML 表单的功能扩展了基本请求。 使用 lxml.html 表单用来自 Response 对象的表单数据预填充表单字段。
比较重要:类方法classmethod from_response(response[, formname=None, formid=None, formnumber=0, formdata=None, formxpath=None, formcss=None, clickdata=None, dont_click=False, ...])
Returns a new FormRequest object with its form field values pre-populated with those found in the HTML <form> element contained in the given response.
返回一个新的 FormRequest 对象,其表单字段值预先填充了包含在给定响应中的 HTML <form> 元素中的值。
The policy is to automatically simulate a click, by default, on any form control that looks clickable, like a <input type="submit">. Even though this is quite convenient, and often the desired behaviour, sometimes it can cause problems which could be hard to debug. For example, when working with forms that are filled and/or submitted using javascript, the default from_response() behaviour may not be the most appropriate. To disable this behaviour you can set the dont_click argument to True. Also, if you want to change the control clicked (instead of disabling it) you can also use the clickdata argument.
默认情况下,这个策略是在任何看起来可点击的表单控件上自动模拟点击,例如 <input type="submit">。 尽管这非常方便,并且通常是期望的行为,但有时它会导致难以调试的问题。 例如,在处理使用 javascript 填写和/或提交的表单时,默认的 from_response() 行为可能不是最合适的。 要禁用此行为,我们可以将 dont_click 参数设置为 True。 此外,如果我们想更改单击的控件(而不是禁用它),还可以使用 clickdata 参数。