爬虫基础 || 2.1 request介绍（功能比urllib丰富，附上简单的知乎爬虫）

最新推荐文章于 2022-12-06 23:15:10 发布

Watson_Ashin

最新推荐文章于 2022-12-06 23:15:10 发布

阅读量746

点赞数

分类专栏：爬虫 Python代码文章标签：爬虫 python

本文链接：https://blog.csdn.net/Watson_Ashin/article/details/104359760

版权

Python代码同时被 2 个专栏收录

10 篇文章 0 订阅

订阅专栏

爬虫

6 篇文章 0 订阅

订阅专栏

之前已经差不多将urllib的所有功能介绍完毕，但是对于urllib来说，有许多不方便的地方，比如处理复杂请求的时候，都需要Opener和Handler来处理。而requests库，他集合了诸多功能，能够使爬虫更加简易

import requests

response = requests.get('https://www.baidu.com')  # 这里的get就是get请求

print(type(response))  #对象类型
print(response.status_code) # 状态码
print(type(response.text))  # 内容类型
print(response.text)  # 内容
print(response.cookies)  # cookies既然也能直接返回！！！

其中，返回类型是个Response对象，然后包含了属性有状态码，响应体类型，内推以及cookies（类型是RequestsCookieJar）。
另外处理get方法，还可以使用post，put，delete，head，options等请求。那么接下来讲讲各个基本请求。

1.GET请求

Get为hHTTP中最常见的请求之一，当我们请求http://httpbin.org/get ，该网站会判断如果客户发起的GET请求，那么返回相应信息

import requests 
response = requests.get('http://httpbin.org/get')
print(response.text)

{
  "args": {}, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.18.4", 
    "X-Amzn-Trace-Id": "Root=1-5e429772-b29169665c22ff227dfe1099"
  }, 
  "origin": "112.51.30.41", 
  "url": "http://httpbin.org/get"
}

- 在URL中添加参数 params
在这里发送了get请求，然后返回结果中有请求头，URL和IP等信息。
那么对于GET请求，如果要附加额外信息，做如下处理。

import requests 
data = {
    'name':'watson',
    'age':25
}
response = requests.get('http://httpbin.org/get',params = data)
print(response.text)

{
  "args": {
    "age": "25", 
    "name": "watson"
  }, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.18.4", 
    "X-Amzn-Trace-Id": "Root=1-5e43c2a1-6708638f8220e658c26d5d2d"
  }, 
  "origin": "112.51.30.41", 
  "url": "http://httpbin.org/get?name=watson&age=25"
}

这里使用params参数以赋值字典形式的对象，从而将参数带入到url中，变成了""http://httpbin.org/get?name=watson&age=25"" 。

- 爬取页面

上面的response.text返回的类似字典形式的字符串，可以使用.json()将其转化，那么当请求正在网页的时候，返回的不会是json格式内容，二十相应的网页源码。

import requests
import re
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.100 Safari/537.36'
}  # 这里需要设置请求头，伪装成浏览器，这是最简单的反爬虫。不然的话会返回 400 Bad Request
response = requests.get('https://www.zhihu.com/explore',headers=headers)
print(response.text) # 这里返回的是https://www.zhihu.com/explore网页源码，未登入。
pattern = re.compile('<a class="ExploreSpecialCard-.*?" href="/(question|special)/.*?" target="(blank|_blank)" rel="noopener noreferrer" data-za-detail-view-id=".*?">(.*?)</a>',re.S)
context = re.findall(pattern=pattern, string=response.text)
print('='*20)
print(context)

<!doctype html>
<html lang="zh" data-hairline="true" data-theme="light"><head><meta charSet="utf-8"/><title data-react-helmet="true">发现 - 知乎</title><meta name="viewport" content="width=device-width,ini...........................
.............................................
.............................................
====================
[('special', 'blank', '爱无处不在'), ('question', 'blank', '情侣养猫，猫会更喜欢谁？'), ('special', 'blank', '惊喜「宠」爱你'), ('question', 'blank', 'TA 在情人节送给我一只狗狗，应该怎么养？'), ('special', 'blank', '「宠」你一生'), ('question', 'blank', '情侣一起养宠物是怎样的体验？'), ('special', 'blank', '只有想不到'), ('question', 'blank', '雷神去供电局上班能开多少工资？'), ('special', 'blank', '打破次元壁'), ('question', 'blank', '如果葛优扮演钢铁侠，会怎么演？'), ('special', 'blank', '漫威一家人'), ('question', 'blank', '用雷神锤子的材料让托尼制作一套战衣钢铁侠会怎么样？'), ('special', 'blank', '一堆 CP 送单身狗'), ('question', 'blank', '影视剧里有哪些高甜的 cp ？'), ('special', 'blank', '一对 CP 宅在家'), ('question', 'blank', '在家过情人节，有哪些好物能营造出浪漫、温馨的氛围？'), ('special', 'blank', '一对 CP 云过节 '), ('question', 'blank', '异地恋的情人节怎么过？'), ('special', 'blank', '我是谁'), ('special', 'blank', '我好怕'), ('question', 'blank', '就因为我携带那么多病毒，你们就想把我消灭？'), ('special', 'blank', '我干了啥'), ('question', 'blank', '人类：蝙蝠野味一直有人吃，为啥疫情最近才爆发？')]

通过伪装浏览器爬取到了《知乎专题》页面源码，使用正则得到专题题目。这里说一下这个正则。 '(.*?)

第一个.?是因为专题题目分别有title和contentTitle，第二个(question|special) 使用到了 | 或判断，即question或者special，(blank|_blank)同样处理。 (.?)为我们需要的内容，最后用遍历可以得到我们最终结果这里不做处理了。

- 抓取二进制数据

我们抓取知乎专题页面时，他返回的网页源码是HTML文档，如果需要访问请求的是图片视频等文件，就需要使用特定的保存格式和解析方式。而其根源就是二进制码。

import requests 
response = requests.get('https://baidu.com/favicon.ico')
print(response.text)
print(response.content)

得到的分别是：

�������������������˥��Ù��Ù��Ù��Ù��Ù��Ù��Ù������������������������������������������������������Ҳ������e���3���3���3���3。。。。。。。。

和

\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff。。。。。。。。

由于太多了就不全部贴出来，大家可以自己试试，这里打印了Response对象的两个属性，text和content。
其中text就是文档类型，但是解析出的都是各种方形，而content解析内容是字节流bytes类型数据。由于图片是二进制数据，所以我们以文本查看自然是看不出来，这时候就应该将图片保存下来。如下就是可以在根目录下得到该文件，视频音频等也是这样操作，只要注意文件名的后缀就可以了。

import requests 
response = requests.get('https://baidu.com/favicon.ico')
with open('bnaiduico.ico','wb') as f:
    f.write(response.content)

2.POST请求

请求另外一种比较常见的请求方式是POST。使用request实现POST请求同样非常简单。

import requests
data = {
    'name':'watson',
    'age':25
}
response = requests.post('http://httpbin.org/post',params = data)
print(response.text)

{
  "args": {
    "age": "25", 
    "name": "watson"
  }, 
  "data": "", 
  "files": {}, 
  "form": {}, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Content-Length": "0", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.18.4", 
    "X-Amzn-Trace-Id": "Root=1-5e44f2c9-896c936048726324c4b92b86"
  }, 
  "json": null, 
  "origin": "112.51.30.41", 
  "url": "http://httpbin.org/post?name=watson&age=25"
}

可以看到获得的返回结果中form部分就是提交的数据，即POST请求发送成功。另外，获取网页和二进制数据的方式和GET一致。

3.相应

发送请求后，得到的自然就是响应。在上面的实例中，我们使用text和content获取了响应的内容。但是还有很多属性和方法可以用来获取其他信息，比如状态码、响应头、 Cookies等。（POST，GET，或者任意请求方式都一致，这里以GET为例）。

import requests 
r = requests.get('http://www.jianshu.com')
print(type(r.status_code),r.status_code) 
print(type(r.headers), r.headers)
print(type(r.cookies), r.cookies) 
print(type(r.url), r.url) 
print(type(r.history), r.history)

<class 'int'> 403
<class 'requests.structures.CaseInsensitiveDict'> {'Server': 'Tengine', 'Date': 'Thu, 13 Feb 2020 07:01:39 GMT', 'Content-Type': 'text/html', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'Vary': 'Accept-Encoding', 'Strict-Transport-Security': 'max-age=31536000; includeSubDomains; preload', 'Content-Encoding': 'gzip'}
<class 'requests.cookies.RequestsCookieJar'> <RequestsCookieJar[]>
<class 'str'> https://www.jianshu.com/
<class 'list'> [<Response [301]>]

可以看到，headers和cookies这两个属性得到的结果分别是 CaselnsensitiveDict和RequestsCookieJar类型。
状态码常用来判断请求是否成功，而requests还提供了一个内置的状态码查询对象 requests.codes。另外这里演示一次携带请求头访问否则会失败的例子。

import requests 
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.100 Safari/537.36'
}  
response = requests.get('https://www.jianshu.com')
if not response.status_code == requests.codes.ok:
    print('请求失败，状态码为：%s'%(response.status_code))
    response = requests.get('https://www.jianshu.com',headers=headers) #携带请求头
    if not response.status_code == requests.codes.ok:
        print('请求失败，状态码为：%s'%(response.status_code))
    else:
        print('Request Successfully',response.status_code)

请求失败，状态码为：403
Request Successfully 200

当天返回码特别多，这里列出相应码和查询条件，其实不用记条件，最终只要知道各返回码，直接使用int类型做if判断也是可以的。

信息性状态码
100 : continue
101 : switching_protocols
102 : processing
103 : checkpoint
122 : uri_too_long, request_uri_too_long
成功状态码
200 : ok,okay, all_ok, all_okay, all_good, '\of/', '√ '
201 : created
202 : accepted
203 : non_authoritative_info, non_authoritative_information
204 : no_content
205 : reset_content,reset
206 : partial_content,partial
207 : multi_status, multiple_status, multi_stati, multiple_stati
208 : already_reported
226 : im_used
重定向状态码
300 : multiple_choices
301 : moved_permanently, moved , '\o-'
302 : found
303 : see_ther, other
304 : not_modified
305 : use_proxy
306 : switch_proxy
307 : temporary_redirect, temporary_moved,emporary
308 : permanent_redirect, resume_incomplete, resume 后面两个在3.0版本已经移除
客户端错误状态码
400 : bad_request, bad
401 : unauthorized
402 : payment_required, payment
403 : forbidden
404 : not_found, '-o-'
405 : method_not_allowed, not_allowed
406 : not_acceptable
407 : proxy_authentication_required ,proxy_auth, proxy_authentication
408 : request_timeout, timeout
409 : conflict
410 : gone
411 : length_required
412 : precondition_failed, precondition
413 : request_entity_too_large
414 : request_uri_too_large
415 : unsupported_media_type, unsupported_media, media_type
416 : requested_range_not_satisfiable, requested_range, range_not_satisfiable
417 : expectation_failed
418 : im_a_teapot, teapot, i_am_a_teapot
421 : misdirected_request
422 : unprocessable_entity, unprocessable
423 : locked
424 : failed_dependency, dependency
425 : unordered_collection, unordered
426 : upgrade_required, upgrade
428 : precondition_required, precondition
429 : too_many_requests, too_many
431 : header_fields_too_large,fields_too_large
444 : no_response, none
449 : retry_with, retry
450 : blocked_by_windows_parental_controls, parental_controls
451 : unavailable_for_legal_reasons, legal_reasons
499 : client_closed_request
服务站错误状态码
500 : internal_server_error, server_error, '/o\', '×’
501 : not_implemented
502 : bad_gateway
503 : service_unavailable, unavailable
504 : gateway_timeout
505 : http_version_not_supported, http_version
506 : variant_also_negotiates
507 : insufficient_storage
509 : bandwidth_limit_exceeded, bandwidth
510 : not_extended
511 : network_authentication_required, network_auth, network_authentication

Watson_Ashin

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
爬虫基础 || 2.1 request介绍（功能比urllib丰富，附上简单的知乎爬虫）

之前已经差不多将urllib的所有功能介绍完毕，但是对于urllib来说，有许多不方便的地方，比如处理复杂请求的时候，都需要Opener和Handler来处理。而requests库，他集合了诸多功能，能够使爬虫更加简易import requestsresponse = requests.get('https://www.baidu.com') # 这里的get就是get请求prin...
复制链接

扫一扫