爬虫的Requests(小技巧)

最新推荐文章于 2024-04-22 12:03:40 发布

Small-J

最新推荐文章于 2024-04-22 12:03:40 发布

阅读量428

点赞数

分类专栏： Python爬虫

本文链接：https://blog.csdn.net/qq_37662827/article/details/103922658

版权

Python爬虫专栏收录该内容

24 篇文章 3 订阅

订阅专栏

爬虫小技巧

把cookie对象转换成字典
requests.utils.dict_from_cookiejar
把cookie值转换成Python的字典形式

requests.utils.cookiejar_from_dict
把Python的字典生成cookiejar形式

import requests


def main():
    url = 'https://www.csdn.net/'
    response = requests.get(url)
    print(response.cookies)    #  返回的结果为requests.cookies.RequestsCookieJar类型

    # 把cookie值转换成Python的字典类型
    print(requests.utils.dict_from_cookiejar(response.cookies))    # 返回的结果为字典类型

    data = {'dc_session_id': '10_1578585500018.380464', 'uuid_tt_dd': '10_30291552180-1578585500018-204559', 'acw_tc': '2760829e15785855000142693e932dc23200dbf1d6a5e569f5a0bacfe97297'}
    print(requests.utils.cookiejar_from_dict(data))

请求SSL证书验证
使用方法:verify=False,默认情况下verify为True

"""请求SSL证书认证"""
    url = 'https://www.12306.cn/index/'
    response= requests.get(url,verify=False)   # verify 代表着证书  默认值为True
    print(response.text)

设置超时设置

"""请求超时，设置超时"""
    url = 'https://chrome.google.com/.com'
    respsone = requests.get(url, timeout=10)  # 设置超时，当超过十秒之后，会放弃请求
    print(respsone.text)

也可以配合代理IP来使用请求超时
当超过十秒访问不到的时候，请求超时，抛出异常

"""设置请求，配合着代理IP使用"""
    url = 'https://chrome.google.com/'
    proxies = {'HTTPS': '114.239.144.185:808','HTTPS': '163.204.240.172:9999'}
    response = requests.get(url,proxies=proxies,timeout=10)  # 准备一大堆代理配合着IP和设置请求超时使用。
    html = response.text
    print(html)

配合状态码判断是否请求成功
assert:Python当中有个assert这个方法，这个方法可以结合爬虫的网络请求来写

assert response.status_code == 200

assert：这个方法请求成功的时候会直接pass，不会返回任何结果

retrying
安装方法：pip install retrying
可以使用retrying里面的retry这个装饰器
stop_max_attempt_number：这个方法是设置代码请求的次数，当达到一定的次数都无法访问的时候，将会抛出异常
为什么要使用这个方法呢？因为在请求的过程中，我们不知道是网络的原因，还是代码写错的原因，可以这么写，可以结合IP代理池使用

import requests
from retrying import retry

"""抓取数据"""


@retry(stop_max_attempt_number=3)
def get_url(url):
    print('代码执行了几次')
    response = requests.get(url, timeout=5)
    assert response.status_code == 200
    return response.text


"""异常捕获"""


def get_exception(url):
    try:
        html_str = get_url(url)
    except:
        html_str = None

    return html_str


if __name__ == '__main__':
    url = 'https://wwws.csdn.net/'
    print(get_exception(url))

爬虫数据-json
数据提取
什么是数据提取？
简单的来说，数据提取就是从相应中获取我们想要数据的过程

数据分类
非结构化数据：HTML
处理方法：xpath 、正则表达式、 BeatuifSoup4
结构化数据：json 、 xml
处理方法：转换成Python数据类型

数据提取之json
由于把json数据转化为Python内建数据类型很简单，所以爬虫中，如果我们能够找打返回json数据的URL,就会尽量使用这种URL

JSON是一种轻量化的数据交换格式，它使得人们很容易的进行阅读和编写。同时也方便了机器进行解析和生成。适用于进行数据交互的场景，比如网站前台与后台之间的数据交互

注意：json的数据类型都是以双引号来编写

json.loads():是将json字符串转换成Python数据类型
json.dumps():将Python数据类型转换成json数据类型
json.load():包含json的类文件对象转换成Python数据类型
json.dump():把Python数据类型转换成包含json的类文件对象

掘金案例

import requests

headers = {
    'Connection': 'keep-alive',
    'X-Legacy-Device-Id': '',
    'Origin': 'https://juejin.im',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.117 Safari/537.36',
    'X-Legacy-Token': '',
    'Content-Type': 'application/json',
    'X-Legacy-Uid': '',
    'X-Agent': 'Juejin/Web',
    'Accept': '*/*',
    'Sec-Fetch-Site': 'same-site',
    'Sec-Fetch-Mode': 'cors',
    'Referer': 'https://juejin.im/user/5dc143bd6fb9a04a7b29ccaf',
    'Accept-Encoding': 'gzip, deflate, br',
    'Accept-Language': 'zh-CN,zh;q=0.9,en-US;q=0.8,en;q=0.7,zh-TW;q=0.6',
}

data = '{"operationName":"","query":"","variables":{"ownerId":"5dc143bd6fb9a04a7b29ccaf","size":20,"after":"1572946941301"},"extensions":{"query":{"id":"b158d18c7ce74f0d6d85e73f21e17df6"}}}'

response = requests.post('https://web-api.juejin.im/query', headers=headers, data=data)
print(response.json())