Python爬虫从入门到精通——请求库requests的使用（一）：基本用法

最新推荐文章于 2024-05-21 09:51:42 发布

von Neumann

最新推荐文章于 2024-05-21 09:51:42 发布

阅读量4.2w

点赞数 18

分类专栏： Python爬虫从入门到精通文章标签：爬虫 Python

本文链接：https://blog.csdn.net/hy592070616/article/details/90046229

版权

Python爬虫从入门到精通专栏收录该内容

23 篇文章 70 订阅

订阅专栏

分类目录：《Python爬虫从入门到精通》总目录

请求库requests的使用（一）：基本用法
 请求库requests的使用（二）：高级用法

在《请求库Urllib的使用》中，我们了解了urllib的基本用法，但是其中确实有不方便的地方，比如处理网页验证和Cookies时，需要写Opener和Handler来处理。为了更加方便地实现这些操作，就有了更为强大的库requests，有了它，Cookies、登录验证、代理设置等操作都不是事儿。

urllib库中的urlopen()方法实际上是以GET方式请求网页，而requests中相应的方法就是get()方法。

import requests

response = requests.get('https://www.baidu.com')
print(response.status_code)
print(response.text)
print(response.cookies)

结果如下：

200
<!DOCTYPE html>
<!--STATUS OK--><html> <head><meta http-equiv=content-type content=text/html;charset=utf-8><meta http-equiv=X-UA-Compatible content=IE=Edge><meta content=always name=referrer><link rel=stylesheet type=text/css href=https://ss1.bdstatic.com/5eN1bjq8AAUYm2zgoY3K/r/www/cache/bdorz/baidu.min.css><title>ç™¾åº¦ä¸€ä¸‹ï¼Œä½ å°±çŸ¥é“</title></head> <body link=#0000cc> <div id=wrapper> <div id=head> <div class=head_wrapper> <div class=s_form> <div class=s_form_wrapper> <div id=lg> <img hidefocus=true src=//www.baidu.com/img/bd_logo1.png width=270 height=129> </div> <form id=form name=f action=//www.baidu.com/s class=fm> <input type=hidden name=bdorz_come value=1> <input type=hidden name=ie value=utf-8> <input type=hidden name=f value=8> <input type=hidden name=rsv_bp value=1> <input type=hidden name=rsv_idx value=1> <input type=hidden name=tn value=baidu><span class="bg s_ipt_wr"><input id=kw name=wd class=s_ipt value maxlength=255 autocomplete=off autofocus=autofocus></span><span class="bg s_btn_wr"><input type=submit id=su value=ç™¾åº¦ä¸€ä¸‹ class="bg s_btn" autofocus></span> </form> </div> </div> <div id=u1> <a href=http://news.baidu.com name=tj_trnews class=mnav>æ–°é—»</a> <a href=https://www.hao123.com name=tj_trhao123 class=mnav>hao123</a> <a href=http://map.baidu.com name=tj_trmap class=mnav>åœ°å›¾</a> <a href=http://v.baidu.com name=tj_trvideo class=mnav>è§†é¢‘</a> <a href=http://tieba.baidu.com name=tj_trtieba class=mnav>è´´å§</a> <noscript> <a href=http://www.baidu.com/bdorz/login.gif?login&amp;tpl=mn&amp;u=http%3A%2F%2Fwww.baidu.com%2f%3fbdorz_come%3d1 name=tj_login class=lb>ç™»å½•</a> </noscript> <script>document.write('<a href="http://www.baidu.com/bdorz/login.gif?login&tpl=mn&u='+ encodeURIComponent(window.location.href+ (window.location.search === "" ? "?" : "&")+ "bdorz_come=1")+ '" name="tj_login" class="lb">ç™»å½•</a>');
                </script> <a href=//www.baidu.com/more/ name=tj_briicon class=bri style="display: block;">æ›´å¤šäº§å“</a> </div> </div> </div> <div id=ftCon> <div id=ftConw> <p id=lh> <a href=http://home.baidu.com>å
³äºŽç™¾åº¦</a> <a href=http://ir.baidu.com>About Baidu</a> </p> <p id=cp>&copy;2017&nbsp;Baidu&nbsp;<a href=http://www.baidu.com/duty/>ä½¿ç”¨ç™¾åº¦å‰å¿
è¯»</a>&nbsp; <a href=http://jianyi.baidu.com/ class=cp-feedback>æ„è§åé¦ˆ</a>&nbsp;äº¬ICPè¯030173å·&nbsp; <img src=//www.baidu.com/img/gs.gif> </p> </div> </div> </div> </body> </html>

<RequestsCookieJar[<Cookie BDORZ=27315 for .baidu.com/>]>

这里我们调用get()方法实现与urlopen()相同的操作，得到一个Response对象，然后分别输出了Response的状态码、内容以及Cookies。通过运行结果可以发现，它的返回类型是requests.models.Response，响应体的类型是字符串str，Cookies的类型是RequestsCookie]ar。

import requests

response = requests.post('http://httpbin.org/post')
response = requests.put('http://httpbin.org/put')
response = requests.delete('http://httpbin.org/delete')
response = requests.head('http://httpbin.org/grt')
response = requests.options('http://httpbin.org/get')

这里分别用post()、put()、delete()等方法实现了POST、PUT、DELETE等请求。

GET请求

首先，构建一个最简单的GET请求，请求的链接为http:/hpbin.org/get，该网站会判断如果客户端发起的是GET请求的话，它返回相应的请求信息：

import requests

response = requests.get('https://httpbin.org/get')
print(response.text)

运行结果：

{
  "args": {}, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.18.4"
  }, 
  "origin": "218.17.40.167, 218.17.40.167", 
  "url": "https://httpbin.org/get"
}

可以发现，我们成功发起了GET请求，返回结果中包含请求头、URL、IP等信息。那么，对于GET请求，如果要附加额外的信息，一般怎样添加呢？比如现在想添加两个参数，其中blog是hy592070616，corporation是HUAWEI。要构造这个请求链接，利用params参数就可以完成。

import requests

data = {
    'blog': 'hy592070616', 
    'corporation': 'HUAWEI'
}
response = requests.get('https://httpbin.org/get', params=data)
print(response.text)

运行结果：

{
  "args": {
    "blog": "hy592070616", 
    "corporation": "HUAWEI"
  }, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.18.4"
  }, 
  "origin": "218.17.40.167, 218.17.40.167", 
  "url": "https://httpbin.org/get?blog=hy592070616&corporation=HUAWEI"
}

通过运行结果可以判断，请求的链接自动被构造成了：http:/httpbin.org/get？blog=hy592070616&'corporation=HUAWEI。另外，网页的返回类型实际上是str类型，但是它很特殊，是JSON格式的。所以，如果想直接解析返回结果，得到一个字典格式的话，可以直接调用json()方法。

import requests

data = {
    'blog': 'hy592070616', 
    'corporation': 'HUAWEI'
}
response = requests.get('https://httpbin.org/get', params=data)
print(response.json())

可以发现，调用json()方法，就可以将返回结果是JSON格式的字符串转化为字典。但需要注意的是，如果返回结果不是JSON格式，便会出现解析错误，抛出json.decoder.JSONDecodeError异常。

{'args': {'blog': 'hy592070616', 'corporation': 'HUAWEI'}, 'headers': {'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate', 'Host': 'httpbin.org', 'User-Agent': 'python-requests/2.18.4'}, 'origin': '218.17.40.167, 218.17.40.167', 'url': 'https://httpbin.org/get?blog=hy592070616&corporation=HUAWEI'}

上面的请求链接返回的是JSON形式的字符串，如果请求普通的网页，则能获得相应的内容，下面以知乎的发现页面为例：

import requests
import re

headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safri/537.36'
}

response = requests.get('https://www.zhihu.com/explore', headers=headers)
pattern = re.compile('explore-feed.*?question_link.*?>(.*?)</a', re.S)
titles = re.findall(pattern, response.text)
print(titles)

这里我们加入了headers信息，其中包含了User-Agent字段信息，也就是浏览器标识信息。如果不加这个，知乎会禁止抓取。接下来我们用到了最基础的正则表达式来匹配出所有的问题内容，关于正则表达式的相关内容，我们会在之后的文章中详细介绍，这里作为实例来配合讲解。从运行结果就可以看出我们抓去了所有问题的内容：

['\n为什么大家的吸猫热情越来越高涨？\n', '\n从哪里开始觉得罗云熙真的在用心塑造角色？\n', '\n防弹少年团成员对金泰亨有过哪些温柔的举动？\n', '\n如何评价 iPhone 上的振动反馈？\n', '\n如何看待2019年5月11日微博热搜“杨超越 整容医生”一事？\n', '\n如何看待杨超越向海里吐口水的行为？\n', '\n程序员们有些什么好玩儿的程序分享？\n', '\n家里面有哪个亲戚令你感到恶心？\n', '\n有哪些句子是真正写到你的心里去了？\n', '\nVariational autoencoder 这个名称中的 variational 是什么意思？\n']

在上面的例子中，我们抓取的是知乎的一个页面，实际上它返回的是一个HTML文档。而图片、音频、视频这些文件本质上都是由二进制码组成的，由于有特定的保存格式和对应的解析方式，我们才可以看到这些形形色色的多媒体。所以，想要抓取它们，就要拿到它们的二进制码。下面以GitHub的站点图标为例：

import requests
import re

response = requests.get('https://github.com/favicon.ico')
print(response.content)
with open('favicon.ico', 'wb') as f:
	f.write(response.content)

这里用open()方法，它的第一个参数是文件名称，第二个参数代表以二进制写的形式打开，可以向文件里写入二进制数据。运行结束之后，可以发现在文件夹中出现了名为favicon.ico的图标。对于打印输出的response.content，response.content前带有一个b，这代表是bytes类型的数据。

POST请求

前面我们了解了最基本的GET请求，另外一种比较常见的请求方式是POST。使用requests实现POST请求同样非常简单，示例如下：

import requests

data = {
    'blog': 'hy592070616', 
    'corporation':  'HUAWEI'
}
response = requests.get('http://httpbin.org/post', data=data)
print(response.text)

这里还是请求http://httpbin.org/post，该网站可以判断如果请求是POST方式，就把相关请求信息返回。

响应

发送请求后，返回的结果就是响应。在上面的实例中，我们使用text和content获取了响应的内容。此外，还有很多属性和方法可以用来获取其他信息，比如状态码、响应头、Cookies等。

import requests

response = requests.get('http://www.baidu.com')
print(response.status_code)
print(response.headers)
print(response.cookies)
print(response.url)
print(response.history)

这里分别打印输出status_code属性得到状态码，输出headers属性得到响应头，输出cookies属性得到Cookies，输出url属性得到URL，输出history属性得到请求历史，结果如下：

200
{'Cache-Control': 'private, no-cache, no-store, proxy-revalidate, no-transform', 'Connection': 'Keep-Alive', 'Content-Encoding': 'gzip', 'Content-Type': 'text/html', 'Date': 'Thu, 16 May 2019 08:22:58 GMT', 'Last-Modified': 'Mon, 23 Jan 2017 13:27:56 GMT', 'Pragma': 'no-cache', 'Server': 'bfe/1.0.8.18', 'Set-Cookie': 'BDORZ=27315; max-age=86400; domain=.baidu.com; path=/', 'Transfer-Encoding': 'chunked'}
<RequestsCookieJar[<Cookie BDORZ=27315 for .baidu.com/>]>
http://www.baidu.com/
[]

状态码常用来判断请求是否成功，而requests还提供了一个内置的状态码查询对象requests.codes：

import requests

response = requests.get('http://www.baidu.com')
exit() if not response.status_code == requests.codes.ok else print('Request Successfully')

这里通过比较返回码和内置的成功的返回码，来保证请求得到了正常响应，输出成功请求的消息，否则程序终止，这里我们用requests.codes.ok得到的是成功的状态码200。下面列出了返回码和相应的查询条件：

#信息性状态码
100：（'continue'，），
101：（'switching_protocols'，），
102：（'processing'，），103：（'checkpoint'，），
122：（'uri too_long，'request_uri_too_long'），

#成功状态码
200：（ok"，‘okay'，'allok'，‘all_okay'，'all_good'，'\\o/'，'√），
201：（'created'，），202：（accepted'，），
203：（'non authoritative info'，'non authoritative_information'），
204：（'no content'，），
205：（'reset content'，'reset'），
206：（'partial_content'，‘partial'），
207：（'multi status'，‘multiple_status'，'multi_stati'，'multiple_stati'），
208：（'already reported'，），
226：（'im_used'，），

#重定向状态码
300：（'multiple_choices'，），
301：（"moved_permanently'，'moved'，'\\o-'），
302：（'found，），
303：（'see_other'，‘other’），
304：（'not_modified"，），
305：（'use_proxy'，），
306：（'switch_proxy'，），
307：（'temporary_redirect'，‘temporary_moved"，'temporary'），
308：（'permanent redirect’
resume_incomplete'，'resume'，），#These 2 to be removed in 3.o

#客户端错误状态码
400：（"bad_request"，'bad'），
401：（“unauthorized'，）’
402：（“payment_required'，‘payment"），
403：（'forbidden'，），
404：（'not found'，-o-'），
405：（method not allowed'，'not_allowed’），
406：（'not acceptable'，），
407：（"proxy_authentication_required'，‘proxy_auth'，'proxy_authentication"），
408：（‘request_timeout'，'timeout"），
409：（‘conflict'，），
410：（'gone'，），
411：（iength required'，），
412：（"precondition_failed'，‘precondition'），
413：（request_entity too_large'，），
414：（request_uri_too_large'，），
415：（'unsupported media type'，'unsupported media'，'media type'），
416：（"requested_range not_satisfiable'，‘requested_range'，"range_not_satisfiable'），
417：（expectation_failed，），
418；（im_a teapot"，'teapot'，'i am a teapot'），
421：（'misdirected request'，
422：（"unprocessable_entity'，'unprocessable'），
423：（locked"，），
424：（'failed_dependency'，‘dependency'），
425：（'unordered collection'，'unordered'），
426：（'upgrade required"，'upgrade'），
428：（precondition_required"，'precondition'），
429：（'too_many_requests'，'too_many'），
431：（"header_fields_too_large'，‘fields_too_large'），
444：（'no response'，'none'），449：（'retry with'，'retry'），
450：（'blocked_by windows parentalcontrols'，‘parental_controls'），
451：（'unavailable_for_legal_reasons'，‘1egal_reasons'），
499：（'client_closed request7，），

#服务端错误状态码
500：（internal_server_error'，'server error'，‘/o\\"，'x'），
501：（'not implemented"，），
502：（‘bad_gateway'，），
503：（'service_unavailable'，‘unavailable'），
504：（'gateway_timeout'，），
505：（"http_version_not_supported"，"http version"），
506：（"variant also negotiates'，），
507：（"insufficientstorage"，），
509：（"bandwidth 1imit_exceeded'，‘bandwidth'），
510：（'not extended'，），
511：（network_authentication_required"，'network_auth'，‘network authentication'）

比如，如果想判断结果是不是404状态，可以用requests.codes.not found来比对。

von Neumann

关注

18
点赞
踩
75

收藏

觉得还不错? 一键收藏
打赏
0
评论
Python爬虫从入门到精通——请求库requests的使用（一）：基本用法

分类目录：《Python爬虫从入门到精通》总目录在《基本库Urllib的使用》中，我们了解了urllib的基本用法，但是其中确实有不方便的地方，比如处理网页验证和Cookies时，需要写Opener和Handler来处理。为了更加方便地实现这些操作，就有了更为强大的库requests，有了它，Cookies、登录验证、代理设置等操作都不是事儿。urllib库中的urlopen()方法实际上是以...
复制链接

扫一扫