官方文档:https://docs.scrapy.org/en/latest/topics/request-response.html#module-scrapy.http
一 Request objects
classs crapy.http.Request(url, callback=None, method='GET', headers=None, body=None, cookies=None, meta=None, encoding='utf-8', priority=0, dont_filter=False, errback=None, flags=None, cb_kwargs=None)
1 参数
- url (string) – the URL of this request
- callback (callable) – the function that will be called with the response of this request (once its downloaded) as its first parameter(指定一个回调函数,该回调函数以这个request的response作为第一个参数)。If a Request doesn’t specify a callback, the spider’s parse() method will be used(如果未指定callback,则默认使用spider的parse()方法)
- method (string) – the HTTP method of this request. Defaults to 'GET'.(GET参数通过URL传递,POST放在Request body中)
- headers (dict) – 请求头。the headers of this request. The dict values can be strings (for single valued headers) or lists (for multi-valued headers).
- body (str or unicode) – 请求体。the request body. (POST中包含请求体,GET不包含请求体)
- meta (dict) – 其作用是随着Request的产生传递信息,传递信息的格式必须是字典类型(字典的键值可以是任意类型的,比如值、字符串、列表、字典)。其传递过程是通过Request传进去,通过Request的请求结果response取出来
- priority (int) – 优先级。the priority of this request (defaults to 0). The priority is used by the scheduler (调度器)to define the order used to process requests. Requests with a higher priority value will execute earlier. Negative values(负值) are allowed in order to indicate relatively low-priority.
- dont_filter (boolean) – 是否过滤该请求。indicates that this request should not be filtered by the scheduler. This is used when you want to perform an identical(同一个) request multiple times, to ignore the duplicates filter. Use it with care, or you will get into crawling loops. Default to False.
- encoding (string) – the encoding of this request (defaults to 'utf-8')
- errback (callable) – 处理异常的回调函数。a function that will be called if any exception was raised while processing the request.
- cb_kwargs (dict) – 回调函数的参数。A dict with arbitrary data that will be passed as keyword arguments to the Request’s callback.
- cookies (dict or list) –the request cookies. These can be sent in two forms.
# 第一种Using a dict:
request_with_cookies = Request(url="http://www.example.com",
cookies={'currency': 'USD', 'country': 'UY'})
# 第二种Using a list of dicts:
request_with_cookies = Request(url="http://www.example.com",
cookies=[{'name': 'currency',
'value': 'USD',
'domain': 'example.com',
'path': '/currency'}]
# The latter form allows for customizing the domain and path attributes of the cookie.
# This is only useful if the cookies are saved for later requests.
2 属性和方法
- url: A string containing the URL of this request
- method: A string representing the HTTP method in the request
- headers: A dictionary-like object which contains the request headers
- body: A str that contains the request body
- copy(): Return a new Request which is a copy of this Request
- replace([url, method, headers, body, cookies, meta, encoding, dont_filter, callback, errback]): Return a Request object with the same members(相同参数), except for those members given new values by whichever keyword arguments are specified. The Request.cb_kwargs and Request.meta attributes are shallow copied(浅拷贝)by default (unless new values are given as arguments)
- meta: A dict that contains arbitrary metadata for this request. This dict is empty for new Requests. This dict is shallow copied(潜拷贝) when the request is cloned using the copy() or replace() methods, and can also be accessed, in your spider, from the response.meta attribute(属性)
Request.meta special keys
- dont_redirect:如果 Request.meta 包含 dont_redirect 键,则该request将会被RedirectMiddleware忽略
- dont_retry:如果 Request.meta 包含 dont_retry 键, 该request将会被RetryMiddleware忽略
- handle_httpstatus_list:Request.meta 中的 handle_httpstatus_list 键可以用来指定每个request所允许的response code
- handle_httpstatus_all:handle_httpstatus_all为True ,可以允许请求的任何响应代码
- dont_merge_cookies:Request.meta 中的dont_merge_cookies设为TRUE,可以避免与现有cookie合并
- cookiejar:Scrapy通过使用 Request.meta中的cookiejar 来支持单spider追踪多cookie session。 默认情况下其使用一个cookie jar(session),不过可以传递一个标示符来使用多个
- dont_cache:可以避免使用dont_cache元键等于True缓存每个策略的响应
- redirect_reasons:
- redirect_urls:通过该中间件的(被重定向的)request的url可以通过 Request.meta 的 redirect_urls 键找到
- bindaddress:The IP of the outgoing IP address to use for the performing the request. 用于执行请求的传出IP地址的IP
- dont_obey_robotstxt:如果Request.meta将dont_obey_robotstxt键设置为True,则即使启用ROBOTSTXT_OBEY,RobotsTxtMiddleware也会忽略该请求
- download_timeout:The amount of time (in secs) that the downloader will wait before timing out.下载器在超时之前等待的时间(以秒为单位)
- download_maxsize:爬取URL的最大长度
- download_latency:The amount of time spent to fetch(获取) the response, since the request has been started自请求已经开始,即通过网络发送的HTTP消息,用于获取响应的时间量。This meta key only becomes available when the response has been downloaded. 该键仅在下载响应时才可用。While most other meta keys are used to control Scrapy behavior, this one is supposed to be read-only.虽然大多数其他键用于控制Scrapy行为,但是这个键是只读的
- download_fail_on_dataloss:Whether or not to fail on broken responses.是否在故障响应失败
- proxy:可以将代理每个请求设置为像http:// some_proxy_server:port这样的值
- ftp_user :用于FTP连接的用户名
- ftp_password :用于FTP连接的密码
- referrer_policy:为每个请求设置referrer_policy
- max_retry_times:The meta key is used set retry times per request. 用于每个请求的重试次数。When initialized, the max_retry_times meta key takes higher precedence over the RETRY_TIMES setting.初始化时,max_retry_times元键比RETRY_TIMES设置更高优先级
- cb_kwargs: A dictionary that contains arbitrary metadata for this request. Its contents will be passed to the Request’s callback as keyword arguments. It is empty for new Requests, which means by default callbacks only get a Response object as argument.This dict is shallow copied(潜拷贝) when the request is cloned using the copy() or replace() methods, and can also be accessed, in your spider, from the response.cb_kwargs attribute(属性).
3 FormRequest
The FormRequest class extends the base Request with functionality for dealing with HTML forms.FormRequest 继承于Request ,用来处理表单数据。 The FormRequest class adds a new keyword parameter to the constructor. The remaining arguments are the same as for the Request class and are not documented here.FormRequest 比Request多一个formdata参数,剩下的参数跟Request一致
classs crapy.http.FormRequest(url[, formdata, ...])
- formdata (dict or iterable of tuples) – is a dictionary (or iterable of (key, value) tuples) containing HTML Form data which will be url-encoded and assigned to the body of the request.
类方法:
classmethod from_response(response[, formname=None, formid=None, formnumber=0, formdata=None, formxpath=None, formcss=None, clickdata=None, dont_click=False, ...])
The FormRequest objects support the following class method in addition to the standard Request methods.
Returns a new FormRequest object with its form field values pre-populated with those found in the HTML <form> element contained in the given response. 返回一个新的FormRequest对象,其表单字段值预填充在给定响应中包含的HTML <form>元素中。
参考链接:https://www.cnblogs.com/zhang293/p/8352479.html
二 Response objects
classs crapy.http.Response(url, status=200, headers=None, body=b'', flags=None, request=None)
1 参数
- url (string) – the URL of this response
- status (integer) – the HTTP status of the response. Defaults to 200.
- headers (dict) – the headers of this response. The dict values can be strings (for single valued headers) or lists (for multi-valued headers).
- body (bytes) – the response body. To access the decoded text as str. you can use response.text from an encoding-aware Response subclass, such as TextResponse.
- request (Request object) – the initial value of the Response.request attribute. This represents the Request that generated this response.代表是由这个Request(请求)生成response(响应)
2 属性和方法
- url:A string containing the URL of the response.
- status:An integer representing the HTTP status of the response.
- headers:A dictionary-like object which contains the response headers. Values can be accessed using get() to return the first header value with the specified name or getlist() to return all header values with the specified name.
get('keyname') : 获取指定key的第一个value值,返回str
getlist('keyname') : 获取指定key的所有value值,返回list
- body:The body of this Response. Keep in mind that Response.body is always a bytes object. If you want the unicode version use TextResponse.text (only available in TextResponse and subclasses).
- request:The Request object that generated this response. This attribute is assigned in the Scrapy engine, after the response and the request have passed through all Downloader Middlewares. In particular, this means that:
- HTTP redirections will cause the original request (to the URL before redirection) to be assigned to the redirected response (with the final URL after redirection).
- Response.request.url doesn’t always equal Response.url
- This attribute is only available in the spider code, and in the Spider Middlewares, but not in Downloader Middlewares (although you have the Request available there by other means) and handlers of the response_downloaded signal.
- meta:Unlike the Response.request attribute, the Response.meta attribute is propagated(传播) along redirects and retries, so you will get the original Request.meta sent from your spider.
- copy():Returns a new Response which is a copy of this Response.
- replace([url, status, headers, body, request, flags, cls]):Returns a Response object with the same members, except for those members given new values by whichever keyword arguments are specified. The attribute Response.meta is copied by default.
- urljoin(url):Constructs an absolute url by combining the Response’s url with a possible relative url.
print('url:', response.url) # www.example.com
print('new url:', response.urljoin('suburl')) # www.example.com/suburl
- follow(url, callback=None, method='GET', headers=None, body=None, cookies=None, meta=None, encoding='utf-8', priority=0, dont_filter=False, errback=None, cb_kwargs=None)
Return a Request instance to follow a link url. It accepts the same arguments as Request.__init__ method, but url can be a relative URL or a scrapy.link.Link object, not only an absolute URL.
<pre style="margin-top: 0px; margin-bottom: 0px; white-space: pre-wrap; overflow-wrap: break-word; font-family: "Courier New" !important; font-size: 14px;"> # 方法1:直接获取到下一页的绝对url,yield一个新Request对象 next_page = target_a.css('::attr(href)').extract_first() if next_page is not None: next_page = response.urljoin(next_page) yield scrapy.Request(next_page, callback=self.parse) # 方法2:不用获取到绝对的url,使用follow方法会自动帮我们实现 next_page = target_a.css('::attr(href)').extract_first() if next_page is not None: yield response.follow(next_page, callback=self.parse)
附录:HTTP请求报文和响应报文
HTTP请求报文由3部分组成(请求行+请求头+请求体)
HTTP的响应报文也由三部分组成(响应行+响应头+响应体)