Scarpy源码分析 17 Requests and Responses Ⅵ（终于肝完）

及锋而试

于 2021-12-06 13:19:40 发布

阅读量1.4k

点赞数

分类专栏： 2021SC@SDUSC 文章标签： python

本文链接：https://blog.csdn.net/No_oneelse/article/details/121744663

版权

2021SC@SDUSC 专栏收录该内容

20 篇文章 0 订阅

订阅专栏

2021SC@SDUSC

最后，作为软工应用这门课的收尾，熬夜硬肝，看完了response部分的代码。结合着官方文档进行分析：

Response objects：

类：scrapy.http.Response(*args, **kwargs) 源码附在最后

A Response object represents an HTTP response, which is usually downloaded (by the Downloader) and fed to the Spiders for processing.

一个 Response 对象代表一个 HTTP 响应，通常被下载（由下载器）并提供给爬虫程序进行处理。

Parameters

url (str) – the URL of this response
status (int) – the HTTP status of the response. Defaults to 200.
headers (dict) – the headers of this response. The dict values can be strings (for single valued headers) or lists (for multi-valued headers).
body (bytes) – the response body. To access the decoded text as a string, use response.text from an encoding-aware Response subclass, such as TextResponse.
flags (list) – is a list containing the initial values for the Response.flags attribute. If given, the list will be shallow copied.
request (scrapy.http.Request) – the initial value of the Response.request attribute. This represents the Request that generated this response.
certificate (twisted.internet.ssl.Certificate) – an object representing the server’s SSL certificate.
ip_address (ipaddress.IPv4Address or ipaddress.IPv6Address) – The IP address of the server from which the Response originated.
protocol (str) – The protocol that was used to download the response. For instance: “HTTP/1.0”, “HTTP/1.1”, “h2”

url (str) – 此响应的 URL

status (int) – 响应的 HTTP 状态。默认为 200。

headers (dict) – 此响应的标头。 dict 值可以是字符串（对于单值标题）或列表（对于多值标题）。

body (bytes) – 响应正文。要将解码后的文本作为字符串访问，请使用编码感知 Response 子类中的 response.text，例如 TextResponse。

flags (list) – 是一个包含 Response.flags 属性初始值的列表。如果给定，列表将被浅复制。

request (scrapy.http.Request) – Response.request 属性的初始值。这表示生成此响应的请求。

证书 (twisted.internet.ssl.Certificate) – 代表服务器 SSL 证书的对象。

ip_address（ipaddress.IPv4Address 或 ipaddress.IPv6Address）– 产生响应的服务器的 IP 地址。

protocol (str) – 用于下载响应的协议。例如：“HTTP/1.0”、“HTTP/1.1”、“h2”

方法：follow_all(urls, callback=None, method='GET', headers=None, body=None, cookies=None, meta=None, encoding='utf-8', priority=0, dont_filter=False, errback=None, cb_kwargs=None, flags=None)→ Generator[scrapy.http.request.Request, None, None]

2.0 版中的新功能。

返回一个可迭代的 Request 实例以跟踪 url 中的所有链接。它接受与 Request.__init__ 方法相同的参数，但 url 的元素可以是相对 URL 或 Link 对象，而不仅仅是绝对 URL。

TextResponse 提供了一个 follow_all() 方法，除了绝对/相对 URL 和 Link 对象之外，它还支持选择器。

def follow_all(self, urls, callback=None, method='GET', headers=None, body=None,
                   cookies=None, meta=None, encoding='utf-8', priority=0,
                   dont_filter=False, errback=None, cb_kwargs=None, flags=None):
        # type: (...) -> Generator[Request, None, None]
      
        if not hasattr(urls, '__iter__'):
            raise TypeError("'urls' argument must be an iterable")
        return (
            self.follow(
                url=url,
                callback=callback,
                method=method,
                headers=headers,
                body=body,
                cookies=cookies,
                meta=meta,
                encoding=encoding,
                priority=priority,
                dont_filter=dont_filter,
                errback=errback,
                cb_kwargs=cb_kwargs,
                flags=flags,
            )
            for url in urls
        )

replace([url, status, headers, body, request, flags, cls])

Returns a Response object with the same members, except for those members given new values by whichever keyword arguments are specified. The attribute Response.meta is copied by default.

返回具有相同成员的 Response 对象，除了那些通过指定的关键字参数赋予新值的成员。默认情况下复制属性 Response.meta。

urljoin(url)

Constructs an absolute url by combining the Response’s url with a possible relative url.

This is a wrapper over urljoin(), it’s merely an alias for making this call

通过将 Response 的 url 与可能的相对 url 组合来构造绝对 url。

是 urljoin() 的包装器

Response subclasses：

以下是可用的内置响应子类的列表。此外还可以继承 Response 类来实现自定义功能。

TextResponse objects：

类scrapy.http.TextResponse(url[, encoding[, ...]])

class TextResponse(Response):

    _DEFAULT_ENCODING = 'ascii'
    _cached_decoded_json = _NONE

    def __init__(self, *args, **kwargs):
        self._encoding = kwargs.pop('encoding', None)
        self._cached_benc = None
        self._cached_ubody = None
        self._cached_selector = None
        super().__init__(*args, **kwargs)

    def _set_url(self, url):
        if isinstance(url, str):
            self._url = to_unicode(url, self.encoding)
        else:
            super()._set_url(url)

    def _set_body(self, body):
        self._body = b''  # used by encoding detection
        if isinstance(body, str):
            if self._encoding is None:
                raise TypeError('Cannot convert unicode body - '
                                f'{type(self).__name__} has no encoding')
            self._body = body.encode(self._encoding)
        else:
            super()._set_body(body)

    def replace(self, *args, **kwargs):
        kwargs.setdefault('encoding', self.encoding)
        return Response.replace(self, *args, **kwargs)

TextResponse 对象向基本 Response 类添加了编码功能，该类仅用于二进制数据，例如图像、声音或任何媒体文件。

除了基本 Response 对象之外，TextResponse 对象还支持新的 __init__ 方法参数。其余功能与 Response 类相同

HtmlResponse objects：

类：scrapy.http.HtmlResponse(url[, ...])

HtmlResponse 类是 TextResponse 的子类，它通过查看 HTML 元 http-equiv 属性添加了编码自动发现支持。

XmlResponse objects

类：scrapy.http.XmlResponse(url[, ...])

XmlResponse 类是 TextResponse 的子类，它通过查看 XML 声明行来添加编码自动发现支持。

附：response类源码：

class Response(object_ref):

    def __init__(
        self,
        url,
        status=200,
        headers=None,
        body=b"",
        flags=None,
        request=None,
        certificate=None,
        ip_address=None,
        protocol=None,
    ):
        self.headers = Headers(headers or {})
        self.status = int(status)
        self._set_body(body)
        self._set_url(url)
        self.request = request
        self.flags = [] if flags is None else list(flags)
        self.certificate = certificate
        self.ip_address = ip_address
        self.protocol = protocol

    @property
    def cb_kwargs(self):
        try:
            return self.request.cb_kwargs
        except AttributeError:
            raise AttributeError(
                "Response.cb_kwargs not available, this response "
                "is not tied to any request"
            )

    @property
    def meta(self):
        try:
            return self.request.meta
        except AttributeError:
            raise AttributeError(
                "Response.meta not available, this response "
                "is not tied to any request"
            )

    def _get_url(self):
        return self._url

    def _set_url(self, url):
        if isinstance(url, str):
            self._url = url
        else:
            raise TypeError(f'{type(self).__name__} url must be str, '
                            f'got {type(url).__name__}')

    url = property(_get_url, obsolete_setter(_set_url, 'url'))

    def _get_body(self):
        return self._body

    def _set_body(self, body):
        if body is None:
            self._body = b''
        elif not isinstance(body, bytes):
            raise TypeError(
                "Response body must be bytes. "
                "If you want to pass unicode body use TextResponse "
                "or HtmlResponse.")
        else:
            self._body = body

    body = property(_get_body, obsolete_setter(_set_body, 'body'))

    def __str__(self):
        return f"<{self.status} {self.url}>"

    __repr__ = __str__

[docs]    def copy(self):
        """Return a copy of this Response"""
        return self.replace()


[docs]    def replace(self, *args, **kwargs):
        """Create a new Response with the same attributes except for those
        given new values.
        """
        for x in [
            "url", "status", "headers", "body", "request", "flags", "certificate", "ip_address", "protocol",
        ]:
            kwargs.setdefault(x, getattr(self, x))
        cls = kwargs.pop('cls', self.__class__)
        return cls(*args, **kwargs)


[docs]    def urljoin(self, url):
        """Join this Response's url with a possible relative url to form an
        absolute interpretation of the latter."""
        return urljoin(self.url, url)


    @property
    def text(self):
        """For subclasses of TextResponse, this will return the body
        as str
        """
        raise AttributeError("Response content isn't text")

    def css(self, *a, **kw):
        """Shortcut method implemented only by responses whose content
        is text (subclasses of TextResponse).
        """
        raise NotSupported("Response content isn't text")

    def xpath(self, *a, **kw):
        """Shortcut method implemented only by responses whose content
        is text (subclasses of TextResponse).
        """
        raise NotSupported("Response content isn't text")

[docs]    def follow(self, url, callback=None, method='GET', headers=None, body=None,
               cookies=None, meta=None, encoding='utf-8', priority=0,
               dont_filter=False, errback=None, cb_kwargs=None, flags=None):
        # type: (...) -> Request
        """
        Return a :class:`~.Request` instance to follow a link ``url``.
        It accepts the same arguments as ``Request.__init__`` method,
        but ``url`` can be a relative URL or a ``scrapy.link.Link`` object,
        not only an absolute URL.

        :class:`~.TextResponse` provides a :meth:`~.TextResponse.follow`
        method which supports selectors in addition to absolute/relative URLs
        and Link objects.

        .. versionadded:: 2.0
           The *flags* parameter.
        """
        if isinstance(url, Link):
            url = url.url
        elif url is None:
            raise ValueError("url can't be None")
        url = self.urljoin(url)

        return Request(
            url=url,
            callback=callback,
            method=method,
            headers=headers,
            body=body,
            cookies=cookies,
            meta=meta,
            encoding=encoding,
            priority=priority,
            dont_filter=dont_filter,
            errback=errback,
            cb_kwargs=cb_kwargs,
            flags=flags,
        )


[docs]    def follow_all(self, urls, callback=None, method='GET', headers=None, body=None,
                   cookies=None, meta=None, encoding='utf-8', priority=0,
                   dont_filter=False, errback=None, cb_kwargs=None, flags=None):
        # type: (...) -> Generator[Request, None, None]
        """
        .. versionadded:: 2.0

        Return an iterable of :class:`~.Request` instances to follow all links
        in ``urls``. It accepts the same arguments as ``Request.__init__`` method,
        but elements of ``urls`` can be relative URLs or :class:`~scrapy.link.Link` objects,
        not only absolute URLs.

        :class:`~.TextResponse` provides a :meth:`~.TextResponse.follow_all`
        method which supports selectors in addition to absolute/relative URLs
        and Link objects.
        """
        if not hasattr(urls, '__iter__'):
            raise TypeError("'urls' argument must be an iterable")
        return (
            self.follow(
                url=url,
                callback=callback,
                method=method,
                headers=headers,
                body=body,
                cookies=cookies,
                meta=meta,
                encoding=encoding,
                priority=priority,
                dont_filter=dont_filter,
                errback=errback,
                cb_kwargs=cb_kwargs,
                flags=flags,
            )
            for url in urls
        )

及锋而试

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Scarpy源码分析 17 Requests and Responses Ⅵ（终于肝完）

2021SC@SDUSC最后，作为软工应用这门课的收尾，熬夜硬肝，看完了response部分的代码。结合着官方文档进行分析：Response objects：类：scrapy.http.Response(*args,**kwargs) 源码附在最后AResponseobject represents an HTTP response, which is usually downloaded (by the Downloader) and fed to the Spiders ...
复制链接

扫一扫