requests 爬虫

最新推荐文章于 2023-07-08 14:15:00 发布

荒城以北

最新推荐文章于 2023-07-08 14:15:00 发布

阅读量334

点赞数

分类专栏：爬虫

本文链接：https://blog.csdn.net/weixin_44090435/article/details/86346193

版权

爬虫专栏收录该内容

20 篇文章 0 订阅

订阅专栏

三元运算符

a = b if b else c # 如果b为真，a=b，否则a=c

if b：
	a = b
else:
    a = c

requests处理cookies

将cookie字符串直接放入headers中

准备cookie字典，在requests请求实传入cookies参数中

cookies_dict = {}
requests.get(url,headers=headers,cookies=cookies_dict)

js分析

观察变化
定位js
1. 找到触发请求的标签，在标签的eventlistener中找到方法，定位js
2. 全局搜索url的关键字
执行js
python代码模拟js

requests获取cookie

requests.utils.dict_from_cookiejar:把cookiejar对象转化为字典

import requests

url = "http://www.baidu.com"
response = requests.get(url)
print(type(response.cookies)) # <class 'requests.cookies.RequestsCookieJar'>

cookies = requests.utils.dict_from_cookiejar(response.cookies)
print(cookies)

requests处理证书

国内部分网站虽然采用HTTPS协议，但证书是自行构建的，并非由权威机构颁发的，使用google浏览器访问时，会提示以下错误：

利用requests访问这类网站时，会报错如下：

requests.exceptions.SSLError: HTTPSConnectionPool(host='sls.cdb.com.cn', port=443): Max retries exceeded with url: / (Caused by SSLError(SSLError("bad handshake: Error([('SSL routines', 'tls_process_server_certificate', 'certificate verify failed')],)",),))

对此，如果需要抓取这类网站时，可以在get或者post请求中，添加一个verify参数，不去校验证书即可

url = "https://sls.cdb.com.cn/"
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:60.0) Gecko/20100101 Firefox/60.0",
}
response = requests.get(url=url, headers=headers,verify=False) 
print(response.status_code)

此时，可能会有警告信息，但不会报错了

装饰器

在不违反开放封闭的原则下，对已有的函数进行功能扩充，如权限校验，计算运行时间等，其实质是利用闭包进行实现。

闭包

闭包三要素：

函数嵌套
外函数返回内函数的引用
内函数使用了外函数的局部变量

闭包：闭包的定义并非特指函数嵌套形式，在正常情况下，当外函数执行完毕后，其局部变量需要被回收，但在闭包形式中，外函数返回的如果是内函数的引用，并且发现内函数使用了外函数局部变量，就不会回收局部变量，而是将局部变量和内函数的引用绑定在一起存储在特定空间，这个空间称为闭包

闭包的延迟绑定

def multipliers():
    return [lambda x : i*x for i in range(4)]

print [m(2) for m in multipliers()] 
# 6，6，6，6

def multipliers():
    list1 = []
    for i in range(4):
        def inner(x):
            return i * x

        list1.append(inner)
    i += 1
    return list1
# 8，8，8，8

内函数的引用和外函数局部变量的绑定，是在外函数执行到最后，准备返回前进行绑定的，此时局部变量的值已经确定，这种绑定称为延迟绑定。

retry模块的使用

retry可以对函数进行装饰，并提供参数，参数可以写入最大重试次数，当函数运行时发生异常，或者断言失败时，都会将函数代码重新执行一次，直至正常运行完成或者达到最大重试次数

class Request():

    @retry(stop_max_attempt_number=3)  # 最大重试3次，3次全部报错，才会报错
    def __parse_post_url(self, url, headers, data, timeout=3, verify=True):
        response = requests.post(url, headers=headers, timeout=timeout, data=data, verify=verify)  # 超时的时候回报错并重试
        assert response.status_code == 200  # 状态码不是200，也会报错并重试
        return response.content.decode()

    def parse_post_url(self, url, headers={}, data={}, timeout=3, verify=True):
        try:  # 进行异常捕获
            content = self.__parse_post_url(url, headers, data, timeout, verify)
        except Exception as e:
            print(e)
            content = None
        return content

    @retry(stop_max_attempt_number=3)  # 最大重试3次，3次全部报错，才会报错
    def __parse_get_url(self, url, headers, params={}, timeout=3, verify=True):
        response = requests.get(url, headers=headers, timeout=timeout, params=params, verify=verify)  # 超时的时候回报错并重试
        print(response)
        assert response.status_code == 200  # 状态码不是200，也会报错并重试
        return response.content.decode()

    def parse_get_url(self, url, headers={}, params={}, timeout=3, verify=True):
        try:  # 进行异常捕获
            content = self.__parse_get_url(url, headers, params, timeout, verify)
        except Exception as e:
            print(e)
            content = None
        return content
    
    
   if __name__ == '__main__':
    r = Request()
    resp = r.parse_get_url("http://www.baidu.com")
    print(resp)

作业分析

https://m.douban.com/rexxar/api/v2/subject_collection/tv_korean/items?start=0&count=18

total = 37

https://m.douban.com/rexxar/api/v2/subject_collection/tv_korean/items?start=18&count=18
https://m.douban.com/rexxar/api/v2/subject_collection/tv_korean/items?start={}&count=18

page_nums= total//18 if totla%18 else total//18+1
for i in range(1,page_nums):
	url_temp.format(i*18)

荒城以北

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
requests 爬虫

三元运算符a = b if b else c # 如果b为真，a=b，否则a=cif b： a = belse: a = crequests处理cookies将cookie字符串直接放入headers中准备cookie字典，在requests请求实传入cookies参数中cookies_dict = {}requests.get(url,headers=hea...
复制链接

扫一扫

专栏目录