3.2使用requests
requests库的安装
pip install requests
各种请求使用requests 的示例
r = requests.get('https://httpbin.org/get')
r = requests.post('https://httpbin.org/post')
r = requests.put('https://httpbin.org/put')
r = requests.delete('https://httpbin.org/delete')
r = requests.head('https://httpbin.org/get')
r = requests.options('https://httpbin.org/get')
Get 请求
一个基本实例
import requests
r = requests.get('https://httpbin.org/get')
print(r.text)
运行结果:
{
"args": {},
"headers": {
"Accept": "*/*",
"Accept-Encoding": "gzip, deflate",
"Host": "httpbin.org",
"User-Agent": "python-requests/2.24.0",
"X-Amzn-Trace-Id": "Root=1-5ff02a2a-388ab87d6304205b25d36d65"
},
"origin": "223.104.38.31",
"url": "https://httpbin.org/get"
}
当我们想添加两个参数,其中name是germey,age是22。
r = requests.get('https://httpbin.org/get?name=germey&age=22')
这样也可以,但是一般情况下,这种信息数据会用字典来存储
import requests
data = {
'name':'germey', 'age':22}
r = requests.get('https://httpbin.org/get', params=data)
print(r.text)
运行结果:
{
"args": {
"age": "22",
"name": "germey"
},
"headers": {
"Accept": "*/*",
"Accept-Encoding": "gzip, deflate",
"Host": "httpbin.org",
"User-Agent": "python-requests/2.24.0",
"X-Amzn-Trace-Id": "Root=1-5ff02b77-29a87dfd567163e00cec510c"
},
"origin": "223.104.38.31",
"url": "https://httpbin.org/get?name=germey&age=22"
}
可以看到,请求的链接被自动构造成了:https://httpbin.org/get?name=germey&age=22
另外,网页的返回类型实际上是str类型,但是它很特殊,是JSON格式的。如果想直接解析返回结果,得到一个字典格式的话,可以直接调用json()方法。
import requests
r = requests.get('https://httpbin.org/get')
print(type(r.text))
print(r.json())
print(type(r.json()))
运行结果如下:
<class 'str'>
{'args': {}, 'headers': {'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate', 'Host': 'httpbin.org', 'User-Agent': 'python-requests/2.24.0', 'X-Amzn-Trace-Id': 'Root=1-5ff02ec9-0cd8d4292288ec7d36865d76'}, 'origin': '223.104.38.31', 'url': 'https://httpbin.org/get'}
<class 'dict'>
可以发现,调用json() 方法,就可以将结果是JSON格式的字符串转化为字典。
注意: 如果返回结果不是JSON 形式,便会出现解析错误
抓取网页
以“知乎”–>“发现”页面为例:
import requests, re
headers = {
'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0) Gecko/20100101 Firefox/6.0'
}
r = requests.get('https://www.zhihu.com/explore', headers=headers)
pattern = re.compile('explore-feed.*?question_link.*?>(.*?)</a>', re.S)
titles = re.findall(pattern, r.text