python3 requests爬虫_Python3网络爬虫——三、Requests库的基本使用

最新推荐文章于 2022-05-09 12:03:42 发布

weixin_39778668

最新推荐文章于 2022-05-09 12:03:42 发布

阅读量124

点赞数

文章标签： python3 requests爬虫

一、什么是Requests

Requests是用Python语言编写，基于urllib，采用Apache2 Licensed开元协议的HTTP库。它比urllib更加的方便，可以节约我们大量的工作完全满足HTTP测试需求。简单来讲，即Python实现的简单易用的HTTP库。

二、Requests库的安装

如果是初学者，建议使用原生Python3进行安装。

1 >> pip3 install requests

如果有一定的Python基础(会基本语法即可)，使用anaconda进行安装更加方便，可以避免一些版本问题，毕竟Python2和Python3是两种不同的语言(高级黑(⊙﹏⊙)b)。

1 >> conda install requests

三、常用方法

首先来感受一下Requests的方便之处。

1 importrequests2

3 response = requests.get('http://www.baidu.com')4

5 print(response.status_code)6 print(response.text)7 print(type(response.text))8 print(response.cookies)

运行代码，可以看到response的类型为str类型，即我们不需要再用decode方法进行转码，其次可以直接获得cookie对象。

1 importrequests2

3 requests.post('http://httpbin.org/post')4 requests.put('http://httpbin.org/put')5 requests.options('http://httpbin.org/get')

可以看到我们可以方便的进行各种请求。httpbin.org是一个http验证网址。下面看一下常用的一些方法。

普通的get请求

1 importrequests2

3 response = requests.get('http://httpbin.org/get')4 print(response.text)5 '''

6 {7 "args": {},8 "headers": {9 "Accept": "*/*",10 "Accept-Encoding": "gzip, deflate",11 "Connection": "close",12 "Host": "httpbin.org",13 "User-Agent": "python-requests/2.14.2"14 },15 "origin": "127.0.0.1",16 "url": "http://httpbin.org/get"17 }18 '''

这是最简单的get请求，可以看一下返回结果(''' '''内的字符串)。是以字典形式返回的结果。注：没有使用代理，为了防止恶意的IP攻击，将origin的值修改了下，实际返回的是请求的IP地址。

带参数的get请求

1 importrequests2

3 data ={4 'name':'zhangsan',5 'age':22

6 }7 response = requests.get('http://httpbin.org/get',params=data)8 print(response.text)9 '''

10 {11 "args": {12 "age": "22",13 "name": "zhangsan"14 },15 "headers": {16 "Accept": "*/*",17 "Accept-Encoding": "gzip, deflate",18 "Connection": "close",19 "Host": "httpbin.org",20 "User-Agent": "python-requests/2.14.2"21 },22 "origin": "127.0.0.1",23 "url": "http://httpbin.org/get?name=zhangsan&age=22"24 }25 '''

我们可以构造一个字典，传给params参数，这样就可以向服务器发送参数，从url参数可以看出，效果相当于utl?name=zhangsan&age=22。

解析json

1 importrequests2 importjson3

4 response = requests.get('http://httpbin.org/get')5 print(response.json()) #等同于 json.loads(response)

6 print(type(response.json()))7

8 '''

9 {'args': {}, 'headers': {'Accept': '*/*', 'Accept-Encoding': 'gzip10 , deflate', 'Connection': 'close', 'Host': 'httpbin.org', 'User-Ag11 ent': 'python-requests/2.14.2'}, 'origin': '113.128.88.6', 'url':12 'http://httpbin.org/get'}13 14 '''

这样返回的数据就被转换成了json格式，类型为字典类型。

二进制数据

1 importrequests2

3 response = requests.get('http://github.com/favicon.ico')4 with open(r'F:\favicon.ico','wb') as f:5 f.write(response.content)

以上将一张图片保存到本地的过程。

1 >>> print(type(response.text))2

3 >>> print(type(response.content))4

可以看出text和content的区别。content的内容为二进制数据，所以想要进行存储时，保存的是其二进制数据。

添加headers

1 importrequests2

3 headers ={4 'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3100.0 Safari/537.36'

5 }6

7 response = requests.get('http://www.zhihu.com/explore',headers=headers)

post请求

1 importrequests2

3 data = {'name':'zhangsan','age':'22'}4 response = requests.post('http://httpbin.org/post',data=data)5 print(response.text)6

7 '''

8 {9 "args": {},10 "data": "",11 "files": {},12 "form": {13 "age": "22",14 "name": "zhangsan"15 },16 "headers": {17 "Accept": "*/*",18 "Accept-Encoding": "gzip, deflate",19 "Connection": "close",20 "Content-Length": "20",21 "Content-Type": "application/x-www-form-urlencoded",22 "Host": "httpbin.org",23 "User-Agent": "python-requests/2.14.2"24 },25 "json": null,26 "origin": "127.0.0.1",27 "url": "http://httpbin.org/post"28 }29 '''

可以看到，data参数接收的数据，将以表单的形式进行提交。

file提交

1 importrequests2

3 files = {'file':open(r'F:\favicon.ico','rb')}4 response = requests.post('http://httpbin.org/post',files=files)

运行代码，就会看到在key=file的值为该本地图片的二进制代码。

获取cookie并输出

1 importrequests2

3 response = requests.get('https://www.baidu.com')4 for key,value inresponse.cookies.items():5 print(key + '=' +value)6

7 #BDORZ = 27315

通过这种方式可以获得cookie的具体信息。

会话维持

我们获取cookie信息是为了维持会话，下面的例子用到了http测试网址的特性，即我们先通过url进行cookie的设置，然后通过访问服务器获取cookie。

1 importrequests2

3 requests.get('http://httpbin.org/cookies/set/name/zhangsan')4 response = requests.get('http://httpbin.org/cookies')5 print(response.text)6

7 '''

8 {9 "cookies": {}10 }11 '''

这时我们看到，cookies信息为空。这是因为我们通过以上方式进行测试，相当于进行了两次独立的请求(可以想象成用两个浏览器进行请求)，因为第一次设置的cookie在第二次访问中并拿不到，所以我们需要会话维持。

1 importrequests2

3 s =requests.Session()4 print(type(s)) #

6 s.get('http://httpbin.org/cookies/set/name/zhangsan')7 response = s.get('http://httpbin.org/cookies')8 print(response.text)9

10 '''

11 {12 "cookies": {13 "name": "zhangsan"14 }15 }16 '''

通过session对象我们就可以实现会话维持。

SSL证书验证问题

1 importrequests2

3 response = requests.get('http://www.12306.cn')4

5 '''

6 raise SSLError(e, request=request)7 requests.exceptions.SSLError: ("bad handshake: Error([('SSL routin8 es', 'ssl3_get_server_certificate', 'certificate verify failed')],9 '''

https进行了网站的安全验证，因此当我们访问一个没有SSL证书的网址时会抛出SSL错误。为了解决这个问题，需要进行参数设置。

1 importrequests2 from requests.packages importurllib33 urllib3.disable_warnings() #消除警告信息

5 response =requests.get('https://www.12306.cn',verify=false)

这样就可以成功的返回网页。

1 importrequests2 from requests.exceptions importReadTimeout,HTTPError,RequestException3

4 try:5 response = requests.get('http://www.baidu.com',timeout=0.01)6 print(response.status_code)7 exceptReadTimeout:8 print('Timeout')9 exceptHTTPError:10 print('http error')11 exceptRequestException:12 print('Error')13

14 #Timeout

可以点开异常处理的链接查看官方文档，Requests库封装了很多异常处理的class。