Python网络数据采集

最新推荐文章于 2024-07-11 09:08:58 发布

酸奶加香蕉

最新推荐文章于 2024-07-11 09:08:58 发布

阅读量898

点赞数 1

分类专栏：笔记文章标签： python 爬虫

本文链接：https://blog.csdn.net/m0_54139855/article/details/119897455

版权

笔记专栏收录该内容

11 篇文章 0 订阅

订阅专栏

文章目录

Python网络数据采集

Python网络数据采集

requests高级用法

"""
example01 - requests高级用法 ---> Session（会话）

Author: Lj~Asus
Date: 2021/8/23
"""
import requests

session = requests.Session()
session.verify = False
session.headers.update({
    'User-Agent': '...'
})
resp = session.get('要获取的网址')
print(resp.status_code)
print(resp.text)

Selenium破解爬虫蜜罐

破解Selenium反爬最重要的一行代码
browser.execute_cdp_cmd(
‘Page.addScriptToEvaluateOnNewDocument’,
{
‘source’: ‘Object.defineProperty(navigator, “webdriver”, {get: () => undefined})’
}
)

"""
example03 - Selenium破解爬虫蜜罐

Author: Lj~Asus
Date: 2021/8/23
"""
from selenium import webdriver

browser = webdriver.Chrome('resources/chromedriver.exe')

# 设置取消测试环境
# browser.add_experimental_option('excludeSwitches', ['enable-automation'])

# 破解Selenium反爬最重要的一行代码
browser.execute_cdp_cmd(
    'Page.addScriptToEvaluateOnNewDocument',
    {
        'source': 'Object.defineProperty(navigator, "webdriver", {get: () => undefined})'
    }
)

browser.get('')

browser.implicitly_wait(10)

anchor = browser.find_element_by_css_selector('')
# 通过WebElement对象的is_displayed方法判定元素是否可见
# 注意∶不可见的超链接─般都不能访问，因为它极有可能是一个诱使爬虫访问的蜜罐链接
print(anchor.is_displayed())
print(anchor.size)
print(anchor.location)

光学文字识别

注意：在安装easyocr时，还会另外安装其他库，有1.7G左右，务必在网络好的时候安装

"""
example04 - 光学文字识别

Author: Lj~Asus
Date: 2021/8/23
"""
import warnings

import easyocr

# 去除警告
warnings.filterwarnings('ignore')

# 简体中文：ch_sim, 繁体中文：ch_tra, 英文和数字：en
reader = easyocr.Reader(['en'], gpu=False)
print(reader.readtext('导入的要识别的图片', detail=0))

从页面上抠图

PIL(Python Image Library) —> pillow

再使用crop()函数

"""
example05 - 从页面上抠图

Author: Lj~Asus
Date: 2021/8/23
"""
from PIL import Image as img
from PIL.Image import Image

image = img.open('resources/idcard.jpg')  # type: image
print(image.size)
# 抠图
# 500, 316
head = image.crop((320, 50, 460, 235))
# 显示
head.show()

加速爬去的方式

并发编程

多线程

Thread(target=…, args=(…, …)) —> start()

继承Thread, 重写run() —> 创建自定义类的对象 —> start()

ThreadPoolExecutor() —> submit(fn, …) / map(fn, […])

"""
example08 - 编写多线程编码的第一种方式

Author: Lj~Asus
Date: 2021/8/24
"""
import time


def output(content):
    while True:
        # 具有输出缓冲区，加入flush可以把输出缓冲区清空，不用把输出缓冲区堆满就可以输出
        print(content, end='', flush=True)
        time.sleep(0.1)

# output('Ping')
Thread(target=output, args=('Ping', )).start()
Thread(target=output, args=('Pong', )).start()
output('Hello')

"""
example10 - 编写多线程代码的第二种方式：自定义线程类

Author: Hao
Date: 2021/8/24
"""
import time
from threading import Thread


class OutputThread(Thread):
    """自定义线程类"""

    def __init__(self, content):
        self.content = content
        super().__init__()

    def run(self):
        while True:
            print(self.content, end='', flush=True)
            time.sleep(0.1)


OutputThread('Ping').start()
OutputThread('Pong').start()

"""
example11 - 编写多线程编码的第三种方式：线程池

Author: Lj~Asus
Date: 2021/8/24
"""
import time


def output(content):
    while True:
        # 具有输出缓冲区，加入flush可以把输出缓冲区清空，不用把输出缓冲区堆满就可以输出
        print(content, end='', flush=True)
        time.sleep(0.1)

with ThreadPoolExecutor(max_workers=16) as pool:
    pool.submit(output, 'Ping')
    pool.submit(output, 'Pong')

多进程

Process(target=…, args=(…, …)) —> start()

继承Process, 重写run() —> 创建自定义类的对象 —> start()

ProcessPoolExecutor() —> submit(fn, …) / map(fn, […])

异步编程（异步IO）—> 协作式并发，通过提高CPU利用率来制造并发效果
I/O密集型任务 —> 大量的操作都是输入输出的操作，需要CPU运算很少
计算密集型任务 —> 大量的操作都是需要CPU做运算，I/O中断很少发生

分布式爬虫

要点:一般会通过部署Redis数据库(KV数据库) ，通过这个数据库保存待爬取的页面、
爬取过的页面、有可能还要保存一些数据，这样多个运行爬虫程序的计算机，就可以彼此协调行为，最终达成一个共同的目标。

多进程和进程池的使用

多线程因为GIL的存在不能够发挥CPU的多核特性，对于计算密集型任务应该考虑使用多进程
在终端Terminal运行：

用线程池的方式运行下面的代码
python example08.py

用进程池的方式运行下面的代码（可以在任务管理器中查看自己的电脑是几核的）
python example08.py

"""
example08 - 多进程和进程池的使用
多线程因为GIL的存在不能够发挥CPU的多核特性，对于计算密集型任务应该考虑使用多进程

time python example08.py ---> 执行代码并统计用时

Author: Hao
Date: 2021/8/23
"""
from concurrent.futures import ThreadPoolExecutor
from concurrent.futures import ProcessPoolExecutor

# 判断列表中的数是不是质数（计算密集型任务）
PRIMES = [
    1116281,
    1297337,
    104395303,
    472882027,
    533000389,
    817504243,
    982451653,
    112272535095293,
    112582705942171,
    112272535095293,
    115280095190773,
    115797848077099,
    1099726899285419,
    1099726899285421
] * 5


def is_prime(num):
    """判断素数"""
    for i in range(2, int(num ** 0.5) + 1):
        if num % i == 0:
            return False
    return num > 1


def main():
    """主函数"""
    # # 使用多线程的方式执行
    # with ThreadPoolExecutor(max_workers=4) as pool:
    #     for number, result in zip(PRIMES, pool.map(is_prime, PRIMES)):
    #         print(f'{number} is prime: {result}')
    # 使用多进程的方式执行（可以判断自己的电脑是几核）
    with ProcessPoolExecutor(max_workers=4) as pool:
        for number, result in zip(PRIMES, pool.map(is_prime, PRIMES)):
            print(f'{number} is prime: {result}')


if __name__ == '__main__':
    main()

请添加图片描述

生成器

"""
example12 - 生成器

Author: Lj~Asus
Date: 2021/8/24
"""

# 创建生成器的字面量语法（生成器表达式）
nums = (num for num in range(1, 10))

# 通过next函数从生成器取值
print(next(nums))

for num in nums:
    print(num, end=' ')

"""
example13 - 生成器

函数中如果出现了yield，它已经不是一个普通的函数，它是一个生成器
调用函数不是得到返回值而是得到—个生成器对象。

Author: Lj~Asus
Date: 2021/8/24
"""


def fib(n):
    a, b = 0, 1
    for _ in range(n):
        a, b = b, a + 
    # return a
        yield a

gen_obj = fib(20)
print(next(nums))
print(next(nums))

for i in gen_obj:
    print(i, end=' ')

爬虫框架的应用

框架：把项目开发中常用功能和样板代码全部都封装好冷清，你可以专注于核心问题，而不要再次编写重复的样板代码，重复的去实现之前已经实现过无数次的功能。

Scrapy —> 命令行工具 —> 创建爬虫项目

安装Scrapy（注意：记得在命令提示符窗口进行操作）
创建Scrapy项目：scrapy startproject demo
创建一个蜘蛛: scrapy genspider douban movie.douban.com

在创建成功之后，将其拖入pycharm中，将会出现以下项目：

    - 修改配置文件（在`settings.py`中找到指定位置修改）:
        - USER-AGENT
        - DOWNLOAD_DELAY
        - CONCURRENT_REQUESTS
    - 运行一个蜘蛛: scrapy crawl douban

酸奶加香蕉

关注

1
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Python网络数据采集

文章目录Python网络数据采集requests高级用法Selenium破解爬虫蜜罐光学文字识别从页面上抠图加速爬去的方式并发编程分布式爬虫多进程和进程池的使用生成器Python网络数据采集requests高级用法"""example01 - requests高级用法 ---> Session（会话）Author: Lj~AsusDate: 2021/8/23"""import requestssession = requests.Session()session.verify
复制链接

扫一扫