Requests包的基础

最新推荐文章于 2024-05-12 20:43:28 发布

love_ccccy

最新推荐文章于 2024-05-12 20:43:28 发布

阅读量832

点赞数

分类专栏：爬虫参考文章标签：爬虫

本文链接：https://blog.csdn.net/love_ccccy/article/details/95761898

版权

参考同时被 2 个专栏收录

3 篇文章 0 订阅

订阅专栏

爬虫

2 篇文章 0 订阅

订阅专栏

一、Requests 库基本方法介绍

最重要的两个对象：
request
response

1.1 request方法

requests.request() ：构造一个请求，支撑以下各方法的基础方法
requests.get() ：获取HTML网页的主要方法，对应于HTTP的GET
requests.head() ：获取HTML网页头信息的方法，对应于HTTP的HEAD
requests.post() ：向HTML网页提交POST请求的方法，对应于HTTP的POST
requests.put() ：向HTML网页提交PUT请求的方法，对应于HTTP的PUT
requests.patch() ：向HTML网页提交局部修改请求，对应于HTTP的PATCH
requests.delete() ：向HTML页面提交删除请求，对应于HTTP的DELETE

注：request方法实际上是后面六种方法的封装，也就是说，requests库实际上只有request一种方法。

1.2 response对象属性

r.states_code ：获取返回的状态码，200代表成功
r.text / r.read() ：HTTP响应内容的字符串形式，即url对应页面的内容
r.content ：HTTP响应内容的二进制形式
r.son() ：HTTP响应内容的json形式
r.raw ：HTTP响应内容的原始形式
r.url ：返回请求的url
r.encoding ：从HTTP的header中猜测的响应内容编码方式，如果header中不存在charset，则认为编码为ISO-8859-1
r.apparent_encoding ：从内容中分析出的响应内容编码方式（备选编码方式）
r.raise_for_status() ：失败请求(非200响应)抛出异常

1.3、requests库的异常

requests.ConnectionError:网络连接错误异常，如DNS查询失败、拒绝连接等
requests.HTTPError:HTTP错误异常
requests.URLRequired:URL缺失异常
requests.TooManyRedirects:超过最大重定向次数，产生重定向异常
request.ConnectTimeout:连接远程服务器超时异常
requests.Timeout:请求URL超时，产生超时异常

1.4、最常用的方法

requests.get( ) 方法

基本形式：requests.get(url,params=None,**kwargs)
  
~url ：模拟获取页面的url连接    
~params :url 中的额外参数，字典或字节流格式，无需对其编码，可选
~**kwargs :12个控制访问的参数

 import requests
 data = {"age":"22","name":"germey",}
 response = requests.get("http://httpbin.org/get",params=data)  # 给URL传参       
 print(response.text)

注：Requests 会推测其编码然后解码，但是 HTTP 和 XML 自身可以指定编码，这样的话，应该使用 r.content 获取二进制的内容，然后设置 r.encoding=‘utf8’ 为相应的解码。

1.5、 request方法

requests.request(method, url, **kwargs)

Docstring: Constructs and sends a :class:`Request <Request>`.
method: "GET"、"POST"、"HEAD"、"PUT"、"PATCH"、"DELETE"中的一种；
url :目标网址
**kwargs:十二个控制访问的参数

:param method: method for the new :class:`Request` object.
:param url: URL for the new :class:`Request` object.

**kwargs的十二个控制访问的参数:

:param params: (optional) Dictionary, list of tuples or bytes to send
    in the query string for the :class:`Request`.
    字典或字节序列，作为参数添加到url中

kv={'key1':'value1','key2':'value2'}
r=requests.request('GET','http://python123.io/ws',params=kv)
print(r.url) 
# https://python123.io/ws?key1=value1&key2=value2

:param data: (optional) Dictionary, list of tuples, bytes, or file-like
    object to send in the body of the :class:`Request`.
 字典、字节序列或文件对象，作为request的内容
    
kv={'key1':'value1','key2':'value2'}
r=requests.request('POST','http://python123.io/ws',data=kv)
body='主题内容'
r=requests.request('POST','http://python123.io/ws',data=body)

:param json: (optional) A JSON serializable Python object to send in the
 body of the :class:`Request`.
JSON格式的数据，作为request的内容
kv={'key1':'value1','key2':'value2'}
r=requests.request('POST','http://python123.io/ws',json=kv)

:param headers: (optional) Dictionary of HTTP Headers to send with the
 :class:`Request`.
字典，HTTP定制头
hd={'User-Agent':'Chrome/10'}
r=requests.request('POST','http://python123.io/ws',headers=hd)

:param cookies: (optional) Dict or CookieJar object to send with the
 :class:`Request`.

:param files: (optional) Dictionary of ``'name': file-like-objects``
 (or ``{'name': file-tuple}``) for multipart encoding upload.
    ``file-tuple`` can be a 2-tuple ``('filename', fileobj)``, 
    3-tuple ``('filename', fileobj, 'content_type')``
    or a 4-tuple ``('filename', fileobj, 'content_type', custom_headers)``, where ``'content-type'`` is a string
    defining the content type of the given file and ``custom_headers`` a dict-like object containing additional headers
    to add for the file.
    字典类型，传输文件
    
fs={'file':open('data.xls','rb')}
r=requests.request('POST','http://python123.io/ws,files=fs)

:param auth: (optional) Auth tuple to enable Basic/Digest/Custom HTTP Auth.
元组。支持HTTP认证功能

:param timeout: (optional) How many seconds to wait for the server to send data
    before giving up, as a float, or a :ref:`(connect timeout, read
    timeout) <timeouts>` tuple.
:type timeout: float or tuple
设定超时时间

:param allow_redirects: (optional) Boolean. Enable/disable GET/OPTIONS/POST/PUT/PATCH/DELETE/HEAD redirection. Defaults to ``True``.
:type allow_redirects: bool
True/False,默认为True，重定向开关

:param proxies: (optional) Dictionary mapping protocol to the URL of the proxy.
字典类型，设定访问代理服务器，可以增加登录认证

pxs ={'http':'http://uesr:pass@10.10.10.1:1234'
'https':'https://10.10.10.1:4321'}
r=requests.request('GET','http://python123.io/ws',proxies=pxs)

:param verify: (optional) Either a boolean, in which case it controls whether we verify
        the server's TLS certificate, or a string, in which case it must be a path
        to a CA bundle to use. Defaults to ``True``.
       True/False,默认为True，认证SSL证书开关

:param stream: (optional) if ``False``, the response content will be immediately downloaded.
True/False,默认为True，获取内容立即下载开关

:param cert: (optional) if String, path to ssl client cert file (.pem). If Tuple, ('cert', 'key') pair.
:return: :class:`Response <Response>` object
:rtype: requests.Response
本地SSL证书路径

二、一些具体操作

定制请求头

传递一个 dict（字典）给 headers 参数就可以，Requests 不会基于定制 header 的具体情况改变自己的行为。只不过在最后的请求中，所有的header信息都会被传递进去。

import requests
headers = {
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36',
}
response = requests.get("https://www.zhihu.com/expiore",headers=headers)
print(response.text)

    注意: 定制 header 的优先级低于某些特定的信息源，例如：
    如果在 .netrc中设置了用户认证信息，使用headers= 设置的授权就不会生效。而如果设置了 auth=参数，.netrc 的设置就无效了。
    如果被重定向到别的主机，授权 header 就会被删除。
    代理授权header会被URL提供的代理身份覆盖掉。
    在我们能判断内容长度的情况下，header 的 Content-Length 会被改写

POST请求

通常，要发送一些编码为表单形式的数据——非常像一个HTML表单。要实现这个，只需简单地传递一个字典给 data参数。你的数据字典在发出请求时会自动编码为表单形式：

import requests
data = {'name':'germey','age':'22'}
headers = {
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36',
}
response = requests.post("http://httpbin.org/post",data=data)
print(response.json())

假如要上传文件时可用以下方法：

url = 'http://httpbin.org/post'
files = {'file': open('文件名', 'rb')} # rb二进制读取
r = requests.post(url, files=files)
r.text

解析json数据

import requests
import json
response = requests.get("http://httpbin.org/get")
print(type(response.text))
print(response.json()) # response.json()=json.load()
print(json.loads(response.text))
print(type(response.json()))

响应状态码（不同的响应码对应着不同的状态码。）

r = requests.get('http://httpbin.org/get')
r.status_code
if r.status_code == requests.codes.ok (200，True)
print("访问成功")

响应头（ HTTP头部信息）

r.headers
{
'content-encoding': 'gzip',
'transfer-encoding': 'chunked',
'connection': 'close',
'server': 'nginx/1.0.4',
'x-runtime': '148ms',
'etag': '"e1ca502697e5c9317743dc078f67693f"',
'content-type': 'application/json'
}
 r.headers.get('content-type') # 获取具体的内容格式 'application/json'
#然而，如果想得到发送到服务器的请求的头部，我们可以简单地访问该请求，然后是该请求的头部
r.request.headers
{'Accept-Encoding': 'identity, deflate, compress, gzip',
'Accept': '*/*', 'User-Agent': 'python-requests/0.13.1'}

获取Cookie，用cookies参数来发送到服务器

import requests
response = requests.get("URL")
print(response.cookies) #获取cookie
#print(response.cookies['example_cookie_name']) 获取某个具体的cookie
for key,value in response.cookies.items():#获取所有cookie的两个属性
print(key + '=' + value)
cookies = {'cookies_are': 'working'} #设置cookie参数
request = requests.get('http://httpbin.org/cookies', cookies=cookies)

会话保持

'''
cookie的一个作用就是可以用于模拟登陆，做会话维持,使得模拟登陆时，始终在一个浏览器页面。
获取cookie，以cookie的内容进行网站登陆，所以如果向同一主机发送多个请求，底层的 TCP 连接将会被重用，从而带来显著的性能提升。
'''
import requests
s = requests.Session() #Session()模拟服务器设置与登陆过程，在服务器中储存用户登陆信息。
s.get("http://httpbin.org/cookies/set/sessioncookie/123456")
response = s.get("http://httpbin.org/cookies")
print(response.text)
#会话可以用作前后文管理器，确保with区块退出后会话能被关闭，即使发生了异常也一样。
with requests.Session() as s:
s.get('http://httpbin.org/cookies/set/sessioncookie/123456789')

设置超时

"""
告诉requests在经过以timeout参数设定的秒数时间之后停止等待响应,如果服务器在
timeout 秒内没有应答，将会引发一个异常 
"""
import requests
request = requests.get('http://www.google.com.hk', timeout=0.01)
print(request.url)

异常

Requests显式抛出的异常都继承自requests.exceptions.RequestExceptio
异常说明
requests.ConnectionError 网络连接错误异常，如DNS查询失败、拒绝连接等
requests.HTTPError HTTP错误异常
requests.URLRequired URL缺失异常
requests.TooManyRedirects 超过最大重定向次数，产生重定向异常
requests.ConnectTimeout 连接远程服务器超时异常
requests.Timeout 请求URL超时，产生超时异常

SSL证书验证

Requests可以为HTTPS请求验证SSL证书，就像web浏览器一样。
SSL验证默认是开启True，如果证书验证失败，Requests 会抛出 SLError:

'''
如果你将 verify 设置为 False，Requests 也能忽略对 SSL 证书的验证,但是会产生警告'''
requests.get('https://kennethreitz.org', verify=False)
#1、忽略警告
#2、传入证书进行验证
import requests
from requests.packages import urllib3
urllib3.disable_warnings()
response = requests.get("https://www.12306.cn",verify=False)
print(response.status_code)
#200
requests.get('https://github.com', verify='/path/to/certfile')如果 verify #设为文件夹路径，文件夹必须通过 OpenSSL 提供的 c_rehash 工具处理。
#s = requests.Session() 或者将其保持在会话中
#s.verify = '/path/to/certfile'

代理设置

设置 proxies 参数来配置代理，同时也可以设置代理密码认证，还可以使用 SOCKS 代理

import requests
proxies= {
"http":"http://127.0.0.1:9999", (代理地址，端口)
"https":"http://127.0.0.1:8888"
}
response = requests.get("https://www.baidu.com",proxies=proxies)
print(response.text)
'''
proxies = {"http": "http://user:pass@10.10.1.10:3128/",}
proxies = {
'http': 'socks5://user:pass@host:port',
'https': 'socks5://user:pass@host:port'
}
'''

自定义身份验证

自定义的身份验证机制是作为requests.auth.AuthBase的子类来实现的，也非常容易定义。equests 在 requests.auth中提供了两种常见的的身份验证方案： HTTPBasicAuth 和 HTTPDigestAuth。

import requests
from requests.auth import HTTPBasicAuth
response = requests.get("http://120.27.34.24:9001/",auth=HTTPBasicAuth("user","123"))
print(response.status_code)
'''
import requests
response = requests.get("http://120.27.34.24:9001/",auth=("user","123"))
print(response.status_code)'''
'''官方文档
from requests.auth import AuthBase
class PizzaAuth(AuthBase):
"""Attaches HTTP Pizza Authentication to the given Request object."""
def __init__(self, username):
# setup any auth-related data here
self.username = username
def __call__(self, r):
# modify and return the request
r.headers['X-Pizza'] = self.username
return r
>>> requests.get('http://pizzabin.org/admin', auth=PizzaAuth('kenneth'))
<Response [200]>

爬虫的过程

    1、向服务器提出请求，request，服务器响应回复，response对得到的
    响应进行解析，这些响应可能是HTML，可用正则表达式对其解析，也可能是json对象，
    也有可能是二进制数据，这是可以保存本地进一步处理。
    2，对提出去来的数据进行结构化存储。网页比较简单时直接处理，
    json结构化的数据用正则表达式提取出来，另外原生的html网页只包含源代
    码，可能并不包含一些数据，这些数据是通过后期JS接口调用之后拿到的。
    3、当网页比较复杂时，比如用JavaScript渲染，可用elenium模拟浏览器加载
    网页，来模拟下载。
    4、数据保存时可能有关系行型保存，非关系型保存，以及二进制保存

爬取网页的通用代码框架

import requests
def getHTMLText(url):
try:
r = requests.get(url, timeout=30)
r.raise_for_status() # 如果状态不是200, 引发HTTPError异常
r.encoding = r.apparent_encoding
return r.text
except:
return "产生异常"
if __name__=="__main__":
url = "http://www.baidu.com"
print(getHTMLText(url))

love_ccccy

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
Requests包的基础

首先 Requests 库基本方法介绍方法requests.request() ：构造一个请求，支撑一下各方法的基础方法requests.get() ：获取HTML网页的主要方法，对应于HTTP的GETrequests.head() ：获取HTML网页头信息的方法，对应于HTTP的HEADrequests.post() ：向HTML网页提交POST请求的方法，对应于HTT...
复制链接

扫一扫