python爬虫学习笔记——使用requests库编写爬虫（1）

最新推荐文章于 2024-03-21 08:36:03 发布

ATM246800

最新推荐文章于 2024-03-21 08:36:03 发布

阅读量7.1k

点赞数 8

分类专栏： python requests 学习笔记

本文链接：https://blog.csdn.net/atm246800/article/details/51376354

版权

学习笔记同时被 3 个专栏收录

10 篇文章 0 订阅

订阅专栏

python

1 篇文章 0 订阅

订阅专栏

requests

1 篇文章 0 订阅

订阅专栏

首先感谢http://python.jobbole.com ，我是看了此站的文章之后才有写此文的想法，本人也是开始学python不久，此文仅仅是记录一些学习过程中遇到的问题，边学边写，初次写这样的博文，差错在所难免，如有差错也请指出，感激不尽。

然而关于网上使用requests写爬虫的文章，在我学习过程中，感觉都很少。。。或者说不尽人意吧，大家都用的urllib，或者3.0里的urllib2。其实在我看来，requests就是将urllib中的一些麻烦的东西做了整合，更加清楚明了。

关于requests库的下载与安装，不做过多赘述，百度一下有很多文章

这里有requests库的官方文档以及中文翻译，有些翻译虽然有点生硬，但大致能懂，我也是摸着这个手册过河的：http://cn.python-requests.org

安装好requests库，之后，打开api.py,查看有哪些接口。

首先把api中的 request的定义放上来，方便查看，也方便解释后面的函数，这里看不懂没关系，因为要结合后面的函数看。

def request(method, url, **kwargs):
    """Constructs and sends a :class:`Request <Request>`.

    :param method: method for the new :class:`Request` object.
    :param url: URL for the new :class:`Request` object.
    :param params: (optional) Dictionary or bytes to be sent in the query string for the :class:`Request`.
    :param data: (optional) Dictionary, bytes, or file-like object to send in the body of the :class:`Request`.
    :param json: (optional) json data to send in the body of the :class:`Request`.
    :param headers: (optional) Dictionary of HTTP Headers to send with the :class:`Request`.
    :param cookies: (optional) Dict or CookieJar object to send with the :class:`Request`.
    :param files: (optional) Dictionary of ``'name': file-like-objects`` (or ``{'name': file-tuple}``) for multipart encoding upload.
        ``file-tuple`` can be a 2-tuple ``('filename', fileobj)``, 3-tuple ``('filename', fileobj, 'content_type')``
        or a 4-tuple ``('filename', fileobj, 'content_type', custom_headers)``, where ``'content-type'`` is a string
        defining the content type of the given file and ``custom_headers`` a dict-like object containing additional headers
        to add for the file.
    :param auth: (optional) Auth tuple to enable Basic/Digest/Custom HTTP Auth.
    :param timeout: (optional) How long to wait for the server to send data
        before giving up, as a float, or a :ref:`(connect timeout, read
        timeout) <timeouts>` tuple.
    :type timeout: float or tuple
    :param allow_redirects: (optional) Boolean. Set to True if POST/PUT/DELETE redirect following is allowed.
    :type allow_redirects: bool
    :param proxies: (optional) Dictionary mapping protocol to the URL of the proxy.
    :param verify: (optional) whether the SSL cert will be verified. A CA_BUNDLE path can also be provided. Defaults to ``True``.
    :param stream: (optional) if ``False``, the response content will be immediately downloaded.
    :param cert: (optional) if String, path to ssl client cert file (.pem). If Tuple, ('cert', 'key') pair.
    :return: :class:`Response <Response>` object
    :rtype: requests.Response

    Usage::

      >>> import requests
      >>> req = requests.request('GET', 'http://httpbin.org/get')
      <Response [200]>
    """

    # By using the 'with' statement we are sure the session is closed, thus we
    # avoid leaving sockets open which can trigger a ResourceWarning in some
    # cases, and look like a memory leak in others.
    with sessions.Session() as session:
        return session.request(method=method, url=url, **kwargs)

先讲get,get定义如下

def get(url, params=None, **kwargs):
    """Sends a GET request.

    :param url: URL for the new :class:`Request` object.
    :param params: (optional) Dictionary or bytes to be sent in the query string for the :class:`Request`.
    :param \*\*kwargs: Optional arguments that ``request`` takes.
    :return: :class:`Response <Response>` object
    :rtype: requests.Response
    """

    kwargs.setdefault('allow_redirects', True)
    return request('get', url, params=params, **kwargs)

get函数用于发送一个get请求，就好像访问网页一样，想服务端发送请求，其中有这么几个参数：

url：域名，如http://www.acfun.tv；举个例子 r = request.get(url='http://www.acfun.tv'),向url的内容所指域名发送请求，讲返回的request给r，然后可以再对r做一些操作。

params：翻译过来就是参数，在requests的定义中可以看到，params（optional），即这个参数是可选的，同时表明，这个参数应是一个字典或是bytes类型的值。（这个详细之后会说）

**kwargs:其他可选参数，诸如timeout,data,jason等等，其实就是request中的可选参数（同上，详细内容之后再说）

看到这里，可能有点晕，这request是个啥，从api中得知，request返回的是session.requst(......),这又得去看session.py,这样就会越来越麻烦，从简理解，request就好像一个容器，储存着服务端的返回信息，比如说页面的html代码，以及一些相应报文等（其实就是response嘛！查看源码的我眼泪掉下来），爬虫就要从这些里面筛选信息进行操作。当然有能力去细细研究，一个个去把源码弄清楚也是可以的，加深理解，只不过太费时间。。。

这里举一个简单的例子，理论啥的看得人头晕，实践一下比较清楚

import requests

url = 'http://www.acfun.tv'
r = requests.get(url)
print(r.content)#编码原因，这里使用content
print(r)

#运行结果：b'<!DOCTYPE html><html xmlns="http://www.w3.org/1999/xhtml"><head><.......
<Response [200]>
#这里print(r.content)的结果太长，只放开头一段

明眼人一看就知道，r.content的内容是网页的源码，这样我们知道，可以通过get()将整个网页扒下来，那么r是什么？结果来看是<Response [200]>,这是啥意思？

我们查看一下r的类型

import  requests

url = 'http://www.acfun.tv'
r = requests.get(url)
print(type(r))

#运行结果：<class 'requests.models.Response'>

从结果来看，这个r是来自requests模块 models.py里的Response类

查看Reponse 的源码，可以看到他的成员变量和函数，前面的content也是其中之一，在这里不放了。。。一个类的定义还是挺长的。

在这里要注意两个成员，content和text，从源码得知，content是成员函数，返回的是self._content的成员变量，text是成员函数，返回的是

content = str(self.content, encoding, errors='replace')
return content

再通过查看其类型，得知，content的类型的‘bytes’，text的类型是‘str’，并且是自动编码之后的，知道这一点很重要，众所周知，py3.0取消了decode(),所以一般推荐使用text，当然有些地方也看情况而定。之所以要搞清楚这一点，是为了之后使用正则表达式进行筛选过滤的时候，不弄出一些类型不匹配的问题，相信使用过urllib的一定有过经历，尤其在py2.7中，并没有统一编码，py3中统一编码为utf-8，至于编码问题，详见http://www.ruanyifeng.com/blog/2007/10/ascii_unicode_and_utf-8.html。

最后简单介绍一下其他的常用成员（其实没写的还有些比较生疏，之后慢慢补全）

url：目标url

raw：英文都可以看得出来，源码，初始码，反正你是不会看得懂这里面写的啥的

headers：响应头，我用的是火狐浏览器，按下f12可查看这个headers

encoding：编码形式，可更改，决定text的编码形式

cookies：饼干，服务器记录并辨识你的电脑的身份的玩意

总之，requests中的get，就是根据你所输入的参数，模拟浏览器向服务器发送请求，获得response，响应内容当然包括网页的源代码，有了源代码，就可以抓取相应页面中想要的信息。本文只是初步讲解了requests库中get的方法，request库中还有很多类似方法，如post，put等等，我也在一一学习。此文也是个人理解笔记，有些地方理解的可能不正确，或是不够深。写到这里，才感觉自己写的进度是不是太慢了，花了一个晚上才写了一个request和get，但是在阅读源码的时候，平时遇到的问题就一瞬间清晰了很多。