requests库---网络爬虫

最新推荐文章于 2022-12-28 09:49:21 发布

遗忘了呵呵

最新推荐文章于 2022-12-28 09:49:21 发布

阅读量421

点赞数 1

分类专栏：网络爬虫文章标签： requests 网络爬虫

本文链接：https://blog.csdn.net/sinat_36802840/article/details/70140698

版权

网络爬虫专栏收录该内容

3 篇文章 0 订阅

订阅专栏

requests库简介
官方文档：requests快速上手
写的非常详细，推荐阅读官方文档。

快速上手

导入requests库

import requests

发送请求：

r = requests.get('url')
r = requests.post("http://httpbin.org/post")
r = requests.put("http://httpbin.org/put")
r = requests.delete("http://httpbin.org/delete")
r = requests.head("http://httpbin.org/get")
r = requests.options("http://httpbin.org/get")

上面都为http请求类型，一般常用的为get和post请求。
post一般用来向对方发送数据来获得内容。

传递 URL 参数
你也许经常想为 URL 的查询字符串(query string)传递某种数据。如果你是手工构建 URL，那么数据会以键/值对的形式置于 URL 中，跟在一个问号的后面。例如， httpbin.org/get?key=val。
Requests 允许你使用 params 关键字参数，以一个字典来提供这些参数。举例来说，如果你想传递 key1=value1 和 key2=value2 到 httpbin.org/get ，那么你可以使用如下代码：

payload = {‘key1’: ‘value1’, ‘key2’: ‘value2’}
r = requests.get(“http://httpbin.org/get“, params=payload)
通过打印输出该 URL，你能看到 URL 已被正确编码：

print(r.url)
http://httpbin.org/get?key2=value2&key1=value1
注意字典里值为 None 的键都不会被添加到 URL 的查询字符串里。

你还可以将一个列表作为值传入：

>>>payload = {'key1': 'value1', 'key2': ['value2', 'value3']}

>>> r = requests.get('http://httpbin.org/get', params=payload)
>>> print(r.url)
http://httpbin.org/get?key1=value1&key2=value2&key2=value3

定制Headers
在爬取网络资源时，服务器一般很容易识别出是爬虫，然后拒绝访问。这个时候我们一般就要让爬虫模拟浏览器，定制headers。

url = 'https://api.github.com/some/endpoint'
headers = {'user-agent': 'my-app/0.0.1'}

r = requests.get(url, headers=headers)

响应内容

import requests
r = requests.get('https://github.com/timeline.json')
r.text
u'[{"repository":{"open_issues":0,"url":"https://github.com/.

#或者使用.content以字节的方式访问请求响应体
r.content
#Requests 会自动为你解码 gzip 和 deflate 传输编码的响应数据。

Requests 会自动解码来自服务器的内容。大多数 unicode 字符集都能被无缝地解码。“
请求发出后，Requests 会基于 HTTP 头部对响应的编码作出有根据的推测。当你访问 r.text 之时，Requests 会使用其推测的文本编码。你可以找出 Requests 使用了什么编码，并且能够使用 r.encoding 属性来改变它：

>>>r.encoding
>>>'utf-8'
>>>r.encoding = 'ISO-8859-1'

post方法
现在很多网站要登陆才能爬取，这个时候我们就要传递一些data给服务器，才能得到内容。
要实现这个，只需简单地传递一个字典给 data 参数。你的数据字典在发出请求时会自动编码为表单形式：

>>> payload = {'key1': 'value1', 'key2': 'value2'}

>>> r = requests.post("http://httpbin.org/post", data=payload)
>>> print(r.text)
{
  ...
  "form": {
    "key2": "value2",
    "key1": "value1"
  },
  ...
}

会话session
会话对象让你能够跨请求保持某些参数。它也会在同一个 Session 实例发出的所有请求之间保持 cookie，期间使用 urllib3 的 connection pooling 功能。

s = requests.Session()

s.get('http://httpbin.org/cookies/set/sessioncookie/123456789')
r = s.get("http://httpbin.org/cookies")

print(r.text)
# '{"cookies": {"sessioncookie": "123456789"}}'

响应状态码
我们可以检测响应状态码：

>>> r = requests.get('http://httpbin.org/get')
>>> r.status_code
200
>>> r.status_code == requests.codes.ok#内置的状态码查询对象
True

发送了一个错误请求(一个 4XX 客户端错误，或者 5XX 服务器错误响应)，我们可以通过 Response.raise_for_status() 来抛出异常：

>>> bad_r = requests.get('http://httpbin.org/status/404')
>>> bad_r.status_code
404

>>> bad_r.raise_for_status()
Traceback (most recent call last):
  File "requests/models.py", line 832, in raise_for_status
    raise http_error
requests.exceptions.HTTPError: 404 Client Error

响应头

>>> r.headers
{
    'content-encoding': 'gzip',
    'transfer-encoding': 'chunked',
    'connection': 'close',
    'server': 'nginx/1.0.4',
    'x-runtime': '148ms',
    'etag': '"e1ca502697e5c9317743dc078f67693f"',
    'content-type': 'application/json'
}

cookie

#如果某个响应中包含一些 cookie，你可以快速访问它们：
>>>url = 'http://example.com/some/cookie/setting/url'
>>> r = requests.get(url)

>>> r.cookies['example_cookie_name']
'example_cookie_value

#要想发送你的cookies到服务器，可以使用 cookies 参数：
>>> url = 'http://httpbin.org/cookies'
>>> cookies = dict(cookies_are='working')
>>> r = requests.get(url, cookies=cookies)
>>> r.text
'{"cookies": {"cookies_are": "working"}}'

如果想把cookie保存下来，一般使用cookielib库。

import requests
import cookielib
#设置cookie
session = requests.Session()
cookies = 'temp/cookie.txt'
#filename为cookie获取后保存的位置，不仅仅可以为文件，也可以为str对象
session.cookies = cookielib.LWPCookieJar(filename=cookies)
#再次登录可以直接加载cookie，不用再输入账号密码
r =session.cookies.load()

#获得cookie后
self.session.cookies.save()

超时

>>>requests.get('http://github.com', timeout=0.001)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
requests.exceptions.Timeout: HTTPConnectionPool(host='github.com', port=80): Request timed out. (timeout=0.001)

代理
如果需要使用代理，你可以通过为任意请求方法提供 proxies 参数来配置单个请求:

import requests

proxies = {
  "http": "http://10.10.1.10:3128",
  "https": "http://10.10.1.10:1080",
}

requests.get("http://example.org", proxies=proxies)

你也可以通过环境变量 HTTP_PROXY 和 HTTPS_PROXY 来配置代理。

$ export HTTP_PROXY="http://10.10.1.10:3128"
$ export HTTPS_PROXY="http://10.10.1.10:1080"

$ python
>>> import requests
>>> requests.get("http://example.org")

要为某个特定的连接方式或者主机设置代理，使用 scheme://hostname 作为 key，它会针对指定的主机和连接方式进行匹配。

proxies = {'http://10.20.1.128': 'http://10.10.1.10:5323'}

SOCKS
除了基本的 HTTP 代理，Request 还支持 SOCKS 协议的代理。这是一个可选功能，若要使用，你需要安装第三方库。
你可以用 pip 获取依赖

$ pip install requests[socks]

使用：

proxies = {
    'http': 'socks5://user:pass@host:port',
    'https': 'socks5://user:pass@host:port'
}

综合实例

import requests
url = 'https://www.zhihu.com/' 

headers =  {
        'Accept': '*/*',
        'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
        'X-Requested-With': 'XMLHttpRequest',
        'Referer': 'https://www.zhihu.com/',
        'Accept-Language': 'en-GB,en;q=0.8,zh-CN;q=0.6,zh;q=0.4',
        'Accept-Encoding': 'gzip, deflate, br',
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.116 Safari/537.36',
        'Host': 'www.zhihu.com'
        }

proxies = {
    "https": "http://116.28.206.126:8998",
    "https": "http://182.240.62.187:8998"
}

r = requests.get(url,timeout=10, proxies=proxies, headers= headers)

post_data = {
            '_xsrf': _xsrf,
            self.account_name: username,
            'password':password,
            'remember_me': 'true',
        }
r1 = requests.post((url,data=self.post_data,timeout=10, proxies=proxies, headers= headers).content.decode('utf8)

s = requests.Session()
s.get('http://httpbin.org/cookies/set/sessioncookie/123456789')
r = s.get("http://httpbin.org/cookies")
print(r.text)
# '{"cookies": {"sessioncookie": "123456789"}}'