爬虫系列之request库的简单应用

最新推荐文章于 2024-07-19 16:36:18 发布

Taylor George

最新推荐文章于 2024-07-19 16:36:18 发布

阅读量186

点赞数

文章标签： python http web

本文链接：https://blog.csdn.net/qq_43350003/article/details/105298934

版权

爬虫request库的简单应用

当前想要从网络上批量获取大量资源就需要使用爬虫这一项技术，今天我来分享一下python的request库的简单应用。

1.request中的get请求

get请求就是模拟用户向服务器请求资源的这一步骤。只需要将网址放入get请求中就可以获得该网址下的信息。下面来看例子：

response =requests.get("http://www.baidu.com")

这里就是向百度请求数据的过程。

2.request中的text

经过了get请求后，我们想要获得此网站的源代码即可用.text的方式，例子如下：

print(response.text)

3.解析josn数据

如果想要解析网站的josn数据就可以用如下代码实现：

import requests
import json

response = requests.get("http://httpbin.org/get")
print(type(response.text))
print(response.json())
print(json.loads(response.text))
print(type(response.json()))

4.获取二进制数据

获取网站的二进制数据可以使用.content,这样获取的数据是二进制数据，同样的这个方法也可以用于下载图片以及视频资源。

# coding:utf-8
import requests

url = "https://f.video.weibocdn.com/000u12WRlx07x3ZnTM1a0104120tzAhl0E0b0.mp4?label=mp4_hd&template=852x480.23.0&trans_finger=7caa0928b1e195c2bbdc24653f70894b&Expires=1581329997&ssig=L5%2BrZI9cfH&KID=unistore,video"

r = requests.get(url)

f = open("D:\桌面\《美国工厂》.mp4", 'wb')
f.write(r.content)
f.close()

5.添加headers

headers是网站的头部信息，我们可以自由定制，如当我们直接通过requests请求知乎网站的时，默认是无法访问的。这时候我们就需要定制头部信息，使它能够通过网站的识别，从而对其进行访问。

import requests

response =requests.get("https://www.zhihu.com")
print(response.text)

>>>    
<html>
<head><title>400 Bad Request</title></head>
<body bgcolor="white">
<center><h1>400 Bad Request</h1></center>
<hr><center>openresty</center>
</body>
</html>

因为访问知乎需要头部信息，这个时候我们在谷歌浏览器里输入chrome://version,就可以看到用户代理，将用户代理添加到头部信息。

Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.109 Safari/537.36

获取到浏览器的头部信息后我们将其添加进去：

import requests
headers = {
    "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64;x64)AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.109 Safari/537.36"
}
response =requests.get("https://www.zhihu.com",headers=headers)

print(response.text)

6.post请求

通过在发送post请求时添加一个data参数，这个data参数可以通过字典构造成，这样对于发送post请求就非常方便。

import requests

data = {
    "name":"Robin",
    "age":25
}
response = requests.post("http://httpbin.org/post",data=data)
print(response.text)

7.cookie

获取cookie

import requests

response = requests.get("http://www.baidu.com")
print(response.cookies)

for key,value in response.cookies.items():
    print(key+"="+value)

cookie的一个作用就是可以用于模拟登陆，做会话维持。

import requests
s = requests.Session()  #创建一个session对象
s.get("http://httpbin.org/cookies/set/number/123456") #请求网址1
response = s.get("http://httpbin.org/cookies")  #请求网址2，这时同一域名下，用的是同一个session
print(response.text)

现在的很多网站都是https的方式访问，所以这个时候就涉及到证书的问题。
为了避免这种情况的发生可以通过verify=False，但是这样是可以访问到页面，但是会提示：

InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings InsecureRequestWarning)

解决方法为：

import requests

from requests.packages import urllib3
urllib3.disable_warnings()
response = requests.get("https://www.12306.cn",verify=False)
print(response.status_code)

代理设置

import requests

proxies= {
    "http":"http://127.0.0.1:9999",
    "https":"http://127.0.0.1:8888"
}
response  = requests.get("https://www.baidu.com",proxies=proxies)
print(response.text)

Taylor George

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
爬虫系列之request库的简单应用

爬虫request库的简单应用当前想要从网络上批量获取大量资源就需要使用爬虫这一项技术，今天我来分享一下python的request库的简单应用。1.request中的get请求get请求就是模拟用户向服务器请求资源的这一步骤。只需要将网址放入get请求中就可以获得该网址下的信息。下面来看例子：response =requests.get("http://www.baidu.com")...
复制链接

扫一扫