requests库的使用

最新推荐文章于 2024-06-26 15:12:14 发布

xiaogeldx

最新推荐文章于 2024-06-26 15:12:14 发布

阅读量6.5k

点赞数 6

分类专栏： python 爬虫

本文链接：https://blog.csdn.net/xiaogeldx/article/details/86267999

版权

python 同时被 2 个专栏收录

25 篇文章 1 订阅

订阅专栏

爬虫

12 篇文章 0 订阅

订阅专栏

简介

Requests是一个优雅而简单的Python HTTP库，专为人类而构建
Requests是有史以来下载次数最多的Python软件包之一，每天下载量超过400000次
之前的urllib作为Python的标准库，因为历史原因，使用方式可以说是非常的麻烦而复杂的，而且官方文档也十分的简陋，常常需要去查看源码
相反，Requests的使用方式非常的简单，直观，人性化，让程序员的精力完全从库的使用中解放出来
Requests的官方文档同样也非常的完善详尽，而且少见的有中文官方文档：http://cn.python-requests.org/zh_CN/latest/
英文文档：http://docs.python-requests.org/en/master/api/
作者：Kenneth Reitz
中文文档：http://2.python-requests.org/zh_CN/latest/

例1

可以说Requests最大的特性就是其风格的简单直接优雅，无论是请求方法，还是响应结果的处理，还有cookies，url参数，post提交数据，都体现出了这种风格

import requests
response = requests.get('http://www.baidu.com')
print(response.request.url) # 等同于response.url
print(response.status_code)
#请求头是请求头，响应头是响应头
print(response.headers['content-type'])	#不区分大小写
print(response.encoding)
print(response.text)		 #获取文本，一般情况自动解码可能会和自己想要的有偏差
print(response.content)	#获取字节格式的
print(response.content.decode('utf8'))	#获取文本，手动解码，免得自动解码有偏差
#200
  text/html
  ISO-8859-1
  <!DOCTYPE html>
  <!--STATUS OK--><html> <head><meta h.....

例2

import requests

headers = {
    'User-Agent': 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)'
}
kw = {
    'wd': '星爷'
}
# 不加 cookie 会获取不到
s = requests.Session()
s.get('http://www.baidu.com/s',params=kw,headers=headers,timeout=3)
cookie = s.cookies
resp = s.get('http://www.baidu.com/s',params=kw,headers=headers,timeout=3)	#将字典放入 params 中，requests 自动帮我们转换成编码，不用像 urllib 用 parse.urlencode() 手动转换
print(resp.content.decode('utf8'))

特性

Req目前基本上完全满足web请求的所有要求，以下是Requests的特性：
- Keep-Alive&连接池
- 国际化域名和URL
- 带持久cookie的会话
- 浏览器式的SSL认证
- 自动内容解码
- 基本/摘要式的身份认证
- 优雅的key/value cookie
- 自动解压
- Unicode响应体
- HTTP(S)代理
- 文件分块上传
- 流下载
- 连接超时
- 分块请求
- 支持.netrc

发起请求

请求方法

Requests的请求不再像urllib一样需要去构造各种Request，opener和handler，使用Requests构造的方法，并在其中传入需要的参数即可
每一个请求方法都有一个对应的API，比如GET请求就可以使用get()方法
POST请求就可以使用post()方法，并且将需要提交的数据传递给data参数即可
而其他的请求类型，都有各自对应的方法：
import requests
response = requests.get(‘https://httpbin.org/get’)
response = requests.post(‘http://gttpbin.org/post’,data={‘key’:‘value’})
- post请求四种传送正文方式：
　　- 请求正文是application/x-www-form-urlencoded
　　- 请求正文是multipart/form-data
　　- 请求正文是raw
　　- 请求正文是binary
response = requests.put(‘http://httpbin.org/put’,data={‘key’:‘value’})
response = requests.delete(‘http://httpbin.org/delete’)
response = requests.head('http://httpbin.org/get‘）
response = requests.options(‘http://httpbin.org/get’)
非常简单直观明了

传递URL参数

传递URL参数也不用再像urllib中那样需要去拼接URL，而是简单的构造一个字典，并在请求时将其传递给params参数
import requests
params = {‘key1’:‘value1’,‘key2’:‘value2’}
response = requests.get(‘http://httpbin.org/get’,params=params)
有时候我们会遇到相同的url参数名，但又不同的值，而Python的字典又不支持键的重名，可以把键的值用列表表示
params = {‘key1’:‘value1’,‘key2’:[‘value2’,‘value3’]}
response = requests.get(‘http://httpbin.org/get’,params=params)
print(response.url)
#http://httpbin.org/get?key1=value1&key2=value2&key2=value3

自定义Headers

如果想自定义请求的Headers，同样的将字典数据传递给headers参数
url = ‘http://api.github.com/some/endpoint’
headers = {‘user-agent’:‘my-app/0.0.1’} #自定义headers
response = requests.get(url,headers=headers)
print(response.headers)

自定义cookies

Requests中自定义cookies也不用再去构造CookieJar对象，直接将字典递给cookies参数
url = ‘http://httpbin.org/cookies’
co = {‘cookies_are’:‘working’}
response = requests.get(url,cookies=co)
print(response.text) #{“cookies”: {“cookies_are”: “working”}}

s = requests.Session()
s.get(url,headers=headers,timeout=3)
cookie = s.cookies

设置代理

当我们需要使用代理时，同样构造代理字典，传递给proxies参数
import requests
proxies = {
‘http’:‘http://10.10.1.10:3128’,
‘https’:‘https://10.10.1.10:1080’
}
requests.get(‘http://httpbin.org/ip’,proxies=proxy)
print(response.text)

重定向

在网络请求中，我们常常会遇到状态码是3开头的重定向问题，在Requests中是默认开启允许重定向的，即遇到重定向时，会自动继续访问
response = requests.get(‘http://github.com’,allow_redirects=False)
print(response.url) #http://github.com/
print(response.headers)
#{‘Content-length’: ‘0’, ‘Location’: ‘https://github.com/’}
print(response.status_code) #301

禁止证书验证

有时候我们使用了抓包工具，这个时候由于抓包工具提供的证书并不是由受信任的数字证书颁发机构颁发的，所以证书的验证会失败，所以我们就需要关闭证书验证
在请求的时候把verify参数设置为False就可以关闭证书验证了
response = requests.get(‘http://httpbin.org/post’,verify=False)
但是关闭验证后，会有一个比较烦人的warning，可以使用以下方法关闭警告
from requests.packages.urllib3.exceptions import InsecureRequestWarning
requests.packages.urllib3.disable_warnings(InsecureRequestWarning)

设置超时

设置访问超时，设置timeout参数即可
requests.get(‘http://github.com’,timeout=0.01)

接收响应

响应内容

通过Requests发起请求获取到的，是一个requests.models.Response对象，通过这个对象我们可以很方便的获取响应的内容
之前通过urllib获取的响应，读取的内容都是bytes的二进制格式，需要我们去将结果decode()一次转换成字符串数据
而Requests通过text属性，就可以获得字符串格式的响应内容

字符编码

Requests会自动的跟据响应的报头来猜测网页的编码是什么，然后根据猜测的编码来解码网页内容，基本上大部分的网页都能够正确的被解码
如果发现text解码不正确的时候，就需要我们自己手动的去指定解码的编码格式
response = requests.get(‘https://api.github.com/events’)
response.encoding = ‘utf-8’
print(response.text)

二进制数据

如果需要获得原始的二进制数据，使用content属性即可
response = requests.get(‘https://api.github.com/events’)
print(response.content)

json数据

如果我们访问之后获得的是json格式的，那么可以使用json()方法，直接获取转换成字典格式的数据
response = requests.get(‘https://api.github.com/events’)
print(response.json())

状态码

通过status_code属性获取响应的状态码
response = requests.get(‘http://httpbin.org/get’)
print(response.status_code)

响应报头

print(response.headers)
#{
'Cache-Control': 'private, no-cache, no-store, proxy-revalidate, no-transform', 
'Connection': 'Keep-Alive', 
'Content-Encoding': 'gzip',
 'Content-Type': 'text/html', 
 'Date': 'Thu, 10 Jan 2019 16:48:04 GMT', 
 'Last-Modified': 'Mon, 23 Jan 2017 13:27:36 GMT',
  'Pragma': 'no-cache',
   'Server': 'bfe/1.0.8.18', 
  'Set-Cookie': 'BDORZ=27315; max-age=86400; domain=.baidu.com; path=/', 
  'Transfer-Encoding': 'chunked'
  }

服务器返回的cookies

通过cookies属性获取服务器返回的cookies
url = ‘http://example.com/some/cookie/setting/url’
response = requests.get(url)
print(response.cookies[‘example_cookie_name’])

url

还可以使用url属性查看访问的url
params = {‘key1’:‘value1’,‘key2’:[‘value2’,‘value3’]}
response = requests.get(‘http://httpbin.org/get’,params=params)
print(response.url)
#http://httpbin.org/get?key1=value1&key2=value2&key2=value3

session对象

在Requests中，实现了session(会话)功能，当我们使用session时，能够像浏览器一样，在没有关闭浏览器时，能够保持住访问的状态
这个功能常常被我们用于登陆之后的数据获取，使我们不用再一又一次的传递cookies
import requests
session = requests.Session()
session.get(‘http://httpbin.org/cookies/set/sessioncookie/123456789’)
response = session.get(‘http://httpbin.org/cookies’)
print(response.text)
#{“cookies”: {“sessioncookie”: “123456789”}}
首先我们需要去生成一个Session对象，然后用这个Session对象来发起访问，发起访问的方法与正常的请求是一摸一样的
同时需要注意的是，如果是我们在get()方法中传入headers和cookies等数据，那么这些数据只能在当前这一次请求中有效
如果想要让一个headers在Session的整个生命周期内都有效的话，需要用以下的方式来进行设置
#设置整个headers
session.hraders = {
‘user-agent’:‘my-app/0.0.1’
}
#增加一条headers
session.headers.update({‘x-test’:‘true’})

登录豆瓣

- 一个会话123
#1.访问首页面	cookie
#2.验证用户名和密码	携带上一次的cookie
#3.根据验证，携带上一次的cookie
import requests
login_url = 'https://www.douban.com/accounts/login'
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'
}
form_data = {
    'source': 'index_nav',
    'form_email': '*******',		#账号
    'form_password': '*********'		#密码
}
response = requests.post(login_url,data=form_data,headers=headers)
print(response.text)
print(response.url)

校验12306

用cookie

import requests
import base64
import re
def get_point_by_index(indexs):
    """
    根据图片的序号快速获取坐标
    :param insexs:1,2
    :return:111,111,222,222
    """
    map = {
        '1':'39,43',
        '2':'109,43',
        '3':'185,43',
        '4':'253,43',
        '5':'39,121',
        '6':'109,121',
        '7':'185,121',
        '8':'253,121',
    }
    indexs = indexs.split(',')
    temp = []
    for index in indexs:
        temp.append(map[index])
    return ','.join(temp)
cookies = None	#需要cookie做标记，每一步都要带上
#1.访问首页面
login_page_url = 'https://kyfw.12306.cn/otn/resources/login.html'
headers = {
    'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'
}
response = requests.get(login_page_url,headers)
cookies = response.cookies
#2.下载图片 验证码
captcha_url = 'https://kyfw.12306.cn/passport/captcha/captcha-image64?login_site=E&module=login&rand=sjrand&1547303075115&callback=jQuery19106706322143238805_1547303062080&_=154730'
captcha_response = requests.get(captcha_url,headers=headers,cookies=cookies)
#获取图片信息
img_data = re.findall(b'"image":"(.*?)"',captcha_response.content)[0]
res = base64.b64decode(img_data)
with open('captcha.jpg','wb') as f:
    f.write(base64.b64decode(img_data)) #解码后的验证码图片信息
cookies = captcha_response.cookies
#3.验证验证码
check_captcha_api = 'https://kyfw.12306.cn/passport/captcha/captcha-check'
args = {
    'callback':'jQuery191032014987830377206_1547304492627',
    'answer':get_point_by_index(input("请输入正确的图片坐标：")),
    'rand':'sjrand',
    'login_site':'E',
    '_':'1547304492633',    #只是为了防止浏览器的缓存，爬虫可以不管
}
check_response = requests.get(check_captcha_api,params=args,headers=headers,cookies=cookies)
print(check_response.text)

用session

import requests
import base64
import re
def get_point_by_index(indexs):
    """
    根据图片的序号快速获取坐标
    :param insexs:1,2
    :return:111,111,222,222
    """
    map = {
        '1':'39,43',
        '2':'109,43',
        '3':'185,43',
        '4':'253,43',
        '5':'39,121',
        '6':'109,121',
        '7':'185,121',
        '8':'253,121',
    }
    indexs = indexs.split(',')
    temp = []
    for index in indexs:
        temp.append(map[index])
    return ','.join(temp)
headers = {
    'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'
}
session = requests.Session()
#添加请求头
session.headers.update(headers)
#1.访问首页面
login_page_url = 'https://kyfw.12306.cn/otn/resources/login.html'
response = session.get(login_page_url)
#2.下载图片 验证码
captcha_url = 'https://kyfw.12306.cn/passport/captcha/captcha-image64?login_site=E&module=login&rand=sjrand&1547303075115&callback=jQuery19106706322143238805_1547303062080&_=154730'
captcha_response = session.get(captcha_url)
#获取图片信息
img_data = re.findall(b'"image":"(.*?)"',captcha_response.content)[0]
res = base64.b64decode(img_data)
with open('captcha.jpg','wb') as f:
    f.write(base64.b64decode(img_data)) #解码后的验证码图片信息
cookies = captcha_response.cookies
#3.验证验证码
check_captcha_api = 'https://kyfw.12306.cn/passport/captcha/captcha-check'
args = {
    'callback':'jQuery191032014987830377206_1547304492627',
    'answer':get_point_by_index(input("请输入正确的图片坐标：")),
    'rand':'sjrand',
    'login_site':'E',
    '_':'1547304492633',    #只是为了防止浏览器的缓存，爬虫可以不管
}
check_response = session.get(check_captcha_api,params=args)
print(check_response.text)

小例子

百度图片"星爷"的前20张图片

import requests
import re
request_url = 'https://image.baidu.com/search/index?tn=baiduimage&ipn=r&ct=201326592&cl=2&lm=-1&st=-1&fm=result&fr=&sf=1&fmq=1574482313106_R&pv=&ic=&nc=1&z=&hd=&latest=&copyright=&se=1&showtab=0&fb=0&width=&height=&face=0&istype=2&ie=utf-8&ctd=1574482313108%5E00_1400X216&sid=&word=%E6%98%9F%E7%88%B7'    #查看对应的 Request URL
headers = {
    'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36'
    '''
    Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/14.0.835.163 Safari/535.1
    Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0) Gecko/20100101 Firefox/6.0
    Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50
    Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 2.0.50727; SLCC2; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; InfoPath.3; .NET4.0C; Tablet PC 2.0; .NET4.0E)
    Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; InfoPath.3)
    Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; InfoPath.3; .NET4.0C; .NET4.0E) QQBrowser/6.9.11079.201
    Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; InfoPath.3; .NET4.0C; .NET4.0E)
    '''
}
response = requests.get(request_url,headers)
img_urls = re.findall(r'data-imgurl="(.*?)"',response.text)
for index,image_url in enumerate(img_urls):
    image_response = requests.get(image_url)
    image_filename = '%s.%s.jpg' % (index,(image_url.split(',')[-1]).split('.')[0])
    with open(image_filename,'wb') as f:
        f.write(image_response.content)

在这里插入图片描述

xiaogeldx

关注

6
点赞
踩
42

收藏

觉得还不错? 一键收藏
0
评论
requests库的使用

简介Requests是一个优雅而简单的Python HTTP库，专为人类而构建Requests是有史以来下载次数最多的Python软件包之一，每天下载量超过400000次之前的urllib作为Python的标准库，因为历史原因，使用方式可以说是非常的麻烦而复杂的，而且官方文档也十分的简陋，常常需要去查看源码相反，Requests的使用方式非常的简单，直观，人性化，让程序员的精力完全从库的...
复制链接

扫一扫