requests

qq_187352634

已于 2023-11-21 15:17:21 修改

阅读量30

点赞数

分类专栏： spider 文章标签： python requests 爬虫

于 2023-11-14 15:57:56 首次发布

本文链接：https://blog.csdn.net/qq_37755459/article/details/134398599

版权

spider 专栏收录该内容

13 篇文章 0 订阅

订阅专栏

requests

查看能爬信息
一、函数方法
二、返回值/response响应
- headers参数
- params参数
三、request.get/post()参数
四、requests.session()
五、防盗链
五、代理
- 手动代理填写
- 批量代理

官网: https://requests.readthedocs.io/en/latest/api/

查看能爬信息

域名后加robots.txt，如：www.baidu.com/robots.txt

一、函数方法

序号	项目	Value
1	requests.get()	Get请求
2	requests.post()	Post请求
3	requests.head()	获取HTML的头部信息
4	requests.put()	发送Put请求
5	requests.patch()	提交局部修改的请求
6	requests.delete()	提交删除请求

二、返回值/response响应

response = requests.get/post(url)

项目	Value	说明
1	response.status_code	查看状态码，200表示成功
2	response.content	字节形式返回内容，二进制数据
3	response.text	反回网页内空容，字符串数据
4	response.encoding	查看/变更编码
5	response.cookie是	获取请求后的cookie
6	response.url	访问的url
7	response.json()	内置的JSON解码器
8	Response.headers	返回HTTP的headers,字典格式

headers参数

User-Agent:浏览器的信息
Cookie:用户登陆信息

headers = {
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36',
    'Cookie':'guidesStatus=off; highContrastMode=defaltMode; cursorStatus=off'
}

params参数

NetWork>Payload中

params = {
    "leftTicketDTO.train_date": input_date,
    "leftTicketDTO.from_station": station_depart,
    "leftTicketDTO.to_station": station_destination,
    "purpose_codes": "ADULT"
}

response = request.get(url=url,params=params,headers=headers)  params请求头参数
response.text   #返回响应体的文本数据
response.encoding=response.apparent_encoding # 自动匹配编码，防乱码

三、request.get/post()参数

request.get/post(url,params,data,headers,timeout,verify,allowallow_redirects (bool) ,cookies )

url 新Request对象的URL。
params -(可选)字典，在请求的查询字符串中发送的元组或字节列表。get中的参数
data -(可选)在请求体中发送的字典、元组、字节或类文件对象列表。post中的参数
headers -(可选)与请求一起发送的HTTP头的字典。
timeout (float or tuple) -(可选)在放弃之前等待服务器发送数据的秒数，可以是float或(连接超时，读取超时)元组。
verify -(可选)要么是一个布尔值，在这种情况下，它控制我们是否验证服务器的TLS证书，要么是一个字符串，在这种情况下，它必须是要使用的CA bundle的路径。默认为True。
allow_redirects (bool) -(可选)布尔值。启用/禁用GET/OPTIONS/POST/PUT/PATCH/DELETE/HEAD重定向。默认为True。
cookie -(可选)与请求一起发送的字典或CookieJar对象。

四、requests.session()

import requests
# 会话
session = requests.session()
data = {
    "loginName":"codeId",
    "password":"01298038"
}
# 1 登录 post请求参数是data,get的是params
url = "xxxxxxxxxxx"
response = session.post(url, data=data)
# 得到相应的cookies
print(response.cookies)
print(response.text)  #文本数据
# 2 获取登录后的网站数据，如商品信息、收藏书的信息
response2 = session.get('https://xxxxxxxxxxxxx')
response2.content  # 字节流数据

五、防盗链

Referer，当前请求地址的上一个地址，寻源

import requests

url = 'https://www.pearvideo.com/video_1789162'
replace_code = url.split('_')[1]
# 访问url后返回的xhr，动态异步请求地址
xhr = 'https://www.pearvideo.com/videoStatus.jsp?contId=1789162&mrd=0.07696565089660035'
# 真地址 https://video.pearvideo.com/mp4/short/20231113/cont-1789162-16008074-hd.mp4
headers = {
    "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.5735.289 Safari/537.36",
    # 防盗链，当前请求地址的上一个地址，寻源
    "Referer":url
}
response = requests.get(xhr, headers=headers)
# print(response.json())
response_dic = response.json()
srcUrl = response_dic['videoInfo']['videos']['srcUrl']
systemTime = response_dic['systemTime']
video_ip = srcUrl.replace(systemTime, f'cont-{replace_code}')
print(video_ip)
# download and save video
with open(systemTime+".mp4","wb") as f:
    f.write(requests.get(video_ip).content)

五、代理

使用第三方的ip进行访问
代理: 快代理

手动代理填写

import requests
url = 'https://www.baidu.com'
proxies = {
    "http": "http://120.39.55.44:18426",
    "https": "https://123.180.209.4:20660",
}
response = requests.get(url, proxies=proxies)
response.encoding = response.apparent_encoding
print(response.text)

批量代理

import requests

# 从代理商获取一批代量ip
def get_ips():
    url = 'http:xxxxxxxxx'
    response = requests.get(url)
    ip_s = response.json()
    for item in ip_s['data']['proxy_list']:
        # yield每次取一个ip,下次调用函数时接着取下一个
        yield item
# 爬取函数
def scramble():
    url = 'https:xxxxxx'
    while True:
        try:
            ip_of_proxy = next(generator_ip)
            proxies = {
                "http": "http://" + ip_of_proxy,
                "https": "http://" + ip_of_proxy,
            }
            response = requests.get(url,proxies=proxies)
            response.encoding = response.apparent_encoding
            return  response.text
        except:
            print("有错误")
if __name__ == '__main__':
    generator_ip = get_ips() # 生成器
    for i in range(10):
        scramble()