python3 网络爬虫开发实战-requests库

最新推荐文章于 2022-11-14 21:17:09 发布

gh0stf1re

最新推荐文章于 2022-11-14 21:17:09 发布

阅读量220

点赞数

分类专栏：爬虫文章标签： python python3网络爬虫开发实战 requests库

本文链接：https://blog.csdn.net/gh0stf1re/article/details/112101567

版权

3.2使用requests

requests库的安装

pip install requests

各种请求使用requests 的示例

r = requests.get('https://httpbin.org/get')
r = requests.post('https://httpbin.org/post')
r = requests.put('https://httpbin.org/put')
r = requests.delete('https://httpbin.org/delete')
r = requests.head('https://httpbin.org/get')
r = requests.options('https://httpbin.org/get')

Get 请求

一个基本实例

import requests
r = requests.get('https://httpbin.org/get')
print(r.text)

运行结果：

{
  "args": {},
  "headers": {
    "Accept": "*/*",
    "Accept-Encoding": "gzip, deflate",
    "Host": "httpbin.org",
    "User-Agent": "python-requests/2.24.0",
    "X-Amzn-Trace-Id": "Root=1-5ff02a2a-388ab87d6304205b25d36d65"
  },
  "origin": "223.104.38.31",
  "url": "https://httpbin.org/get"
}

当我们想添加两个参数，其中name是germey，age是22。

r = requests.get('https://httpbin.org/get?name=germey&age=22')

这样也可以，但是一般情况下，这种信息数据会用字典来存储

import requests

data = {
   'name':'germey', 'age':22}
r = requests.get('https://httpbin.org/get', params=data)
print(r.text)

运行结果：

{
  "args": {
    "age": "22",
    "name": "germey"
  },
  "headers": {
    "Accept": "*/*",
    "Accept-Encoding": "gzip, deflate",
    "Host": "httpbin.org",
    "User-Agent": "python-requests/2.24.0",
    "X-Amzn-Trace-Id": "Root=1-5ff02b77-29a87dfd567163e00cec510c"
  },
  "origin": "223.104.38.31",
  "url": "https://httpbin.org/get?name=germey&age=22"
}

可以看到，请求的链接被自动构造成了：https://httpbin.org/get?name=germey&age=22

另外，网页的返回类型实际上是str类型，但是它很特殊，是JSON格式的。如果想直接解析返回结果，得到一个字典格式的话，可以直接调用json()方法。

import requests

r = requests.get('https://httpbin.org/get')
print(type(r.text))
print(r.json())
print(type(r.json()))

运行结果如下：

<class 'str'>
{'args': {}, 'headers': {'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate', 'Host': 'httpbin.org', 'User-Agent': 'python-requests/2.24.0', 'X-Amzn-Trace-Id': 'Root=1-5ff02ec9-0cd8d4292288ec7d36865d76'}, 'origin': '223.104.38.31', 'url': 'https://httpbin.org/get'}
<class 'dict'>

可以发现，调用json() 方法，就可以将结果是JSON格式的字符串转化为字典。
注意：如果返回结果不是JSON 形式，便会出现解析错误

抓取网页

以“知乎”–>“发现”页面为例：

import requests, re

headers = {
   
    'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0) Gecko/20100101 Firefox/6.0'
}
r = requests.get('https://www.zhihu.com/explore', headers=headers)
pattern = re.compile('explore-feed.*?question_link.*?>(.*?)</a>', re.S)
titles = re.findall(pattern, r.text

最低0.47元/天解锁文章