爬虫04-requests库的基本使用

最新推荐文章于 2023-07-10 09:59:14 发布

闪闪发亮的小星星

最新推荐文章于 2023-07-10 09:59:14 发布

阅读量92

点赞数

文章标签： python

本文链接：https://blog.csdn.net/weixin_39107270/article/details/115065300

版权

Requests库的基本使用

requests 比urllib库更友好，使用更方便

安装和文档地址

pip insrall requests

发送GET请求：

最简单的发送get请求，就是通过requests.get 来调用

import requests
response=requests.get("https://www.baidu.com/")

添加headers和查询参数

在百度中搜索中国，
访问该url： https://www.baidu.com/s?wd=%E4%B8%AD%E5%9B%BD&rsv_spt=1&rsv_iqid=0xfbe16028000f25a9&issp=1&f=8&rsv_bp=1&rsv_idx=2&ie=utf-8&rqlang=cn&tn=baiduhome_pg&rsv_enter=0&rsv_dl=tb&oq=%25E4%25B8%25AD%25E5%259B%25BD&rsv_btype=t&rsv_t=9f9ej3wG2HUwNdkPBr9kCOlSFrx3%2FQvKUItTsPvIwDtADGhmIMt7CHVHbtX3mU%2B0aeMT&rsv_pq=e7b8a7f6000366bb&prefixsug=%25E4%25B8%25AD%25E5%259B%25BD&rsp=4
? 后面是对中国进行了解码

import requests

kw = {'wd':'中国'} # 查询参数

headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36"}

# params 接收一个字典或者字符串的查询参数，字典类型自动转换为url编码，不需要urlencode()
response = requests.get("http://www.baidu.com/s", params = kw, headers = headers)

# 查看响应内容，response.text 返回的是Unicode格式的数据
#print(response.text)

# 查看响应内容，response.content返回的字节流数据
#print(response.content)

# 查看完整url地址
print(response.url)

# 查看响应头部字符编码
print(response.encoding)

# 查看响应码
print(response.status_code)

http://www.baidu.com/s?wd=%E4%B8%AD%E5%9B%BD
utf-8
200

response.text和response.content的区别：

response.content：这个是直接从网络上面抓取的数据。没有经过任何解码。所以是一个bytes类型。其实在硬盘上和在网络上传输的字符串都是bytes类型。
response.text：这个是str的数据类型，是requests库将response.content进行解码的字符串。解码需要指定一个编码方式，requests会根据自己的猜测来判断编码的方式。所以有时候可能会猜测错误，就会导致解码产生乱码。这时候就应该使用response.content.decode('utf-8')进行手动解码。

保存爬虫结果到本地

with open('baidu.html','w',encoding='utf-8') as fp:
    fp.write(response.content.decode('utf-8'))

发送post请求

发送post请求非常简单，直接调用 “requests.post”方法就也可。
如果返回的是json 数据，那么可以调用 response.json()来将接送字符串转换为字典或者列表

最基本的POST请求可以使用post 方法

response =requests.post("www.baidu.com/",data=data)

传入data数据

这时候就不要使用urlencode进行编码了，直接传入字典进去就好了，比如请求拉勾网的数据:
拉勾网为了反爬虫，将方式从get,改成了post

拉勾网搜python 职位,职位信息在ajax里。
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-hUQBO68Q-1616379017086)(attachment:image.png)]

# encoding: utf-8
import requests
data = {
    'first':"true",
    'pn': '1',
    'kd': 'python'
}
headers={
    'Referer':'https://www.lagou.com/jobs/list_python/p-city_3?&cl=false&fromSearch=true&labelWords=&suginput=',
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36'
}
response=requests.post('https://www.lagou.com/jobs/positionAjax.json?city=%E4%B8%8A%E6%B5%B7&needAddtionalResult=false',data=data,headers=headers)
print(response.text) # 字符串类型
print(response.json()) # 将输出结果转换成字典格式

{"status":false,"msg":"您操作太频繁,请稍后再访问","clientIp":"180.163.178.34","state":2402}

结果显示“操作太频繁，稍后再访问”，应该是被识别出爬虫了。
https://blog.csdn.net/weixin_40576010/article/details/88336980
如果正常，结果显示Json 结果，放入 json.cn里进行格式化查看

requests 使用代理ip

代理服务器一般是变化的，如果ip被识别是爬虫就换一个

未使用代理：

import requests
response = requests.get("http://httpbin.org/ip")
print(response.text)

{
  "origin": "180.167.145.106"
}

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-7FXZmpXY-1616379017088)(attachment:image.png)]

使用代理：快代理（免费代理）
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-WNN5c66X-1616379017090)(attachment:image.png)]

import requests

proxy = {
    'http': '115.221.240.139:9999'
}

response = requests.get("http://httpbin.org/ip",proxies=proxy)
print(response.text)

{
  "origin": "115.221.240.139"
}

requests 处理cookie信息

import requests
response = requests.get('https://www.baidu.com/')
print(response.cookies)

<RequestsCookieJar[<Cookie BDORZ=27315 for .baidu.com/>]>

获取cookie的具体信息：

print(response.cookies.get_dict())

{'BDORZ': '27315'}

Session（会话）

之前使用urllib库，是可以使用opener发送多个请求，多个请求之间可以共享cookie的，那么如果使用requests,也要达到共享的目的。那么使用requests库的session对象
session:维持同一个会话，设置一次cookies后,会持续保持；

import requests
url = "http://www.renren.com/PLogin.do"
data = {"email":"970138074@qq.com",'password':"pythonspider"}
headers = {
    'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36"
}

session = requests.Session()

session.post(url,data=data,headers=headers) 

response = session.get('http://www.renren.com/880151247/profile') #能够记录上一次会话，保持到下一次调用
with open('renren.html','w',encoding='utf-8') as fp:
    fp.write(response.text)

处理不信任SSL证书

对于某些SSL证书不合法的时候，使 verify=true

resp = requests.get(‘http://www.12306.cn/mormhweb/’,verify=True)

闪闪发亮的小星星

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
打赏
0
评论
爬虫04-requests库的基本使用

Requests库的基本使用requests 比urllib库更友好，使用更方便安装和文档地址pip insrall requests发送GET请求：最简单的发送get请求，就是通过requests.get 来调用import requestsresponse=requests.get("https://www.baidu.com/")添加headers和查询参数在百度中搜索中国，访问该url： https://www.baidu.com/s?wd=%E4%B8%AD%E5%9B%BD
复制链接

扫一扫